Crawling the web is very resource intensive. You'll need thousands of machines, and probably a pretty sizable IPv4 allocation to go with that. You'll find that people allow GoogleBot and maybe a few other crawlers, but don't allow you -- because crawling causes too much load on their site.
Once you have a snapshot of the web, you have two problems. The first is that your snapshot is out of date; you're going to have to continuously update it. The second is you have to figure out how to turn that enormous data into something useful. That's probably going to take thousands more servers, plus or minus lots of development to figure out what's useful.
And then, if you do manage to decent results, you have two more problems. speed -- to compete with Google, you need to be fast, and to be fast, you need to be close to users, which means you need datacenters spread throughout your market area. Even if your results are objectively and subjectively better if blind compared, people are going to prefer the google results because they have google branding.
It's not an insurmountable barrier, but it's pretty big.