Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):
For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:
Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.
Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).
If you need any more help/input, let me know and I'll be happy to do what I can.
HTH and all the best moving forward.