The explosion of cheap CPU and storage means that single server with a few terabytes of disk can serve up a billion or more spam pages. And seemingly everyone who gets into the game starts with "I know, we'll create a lot of web sites that link to this thing I'm trying to get to rank in Google results ..." worse, when it doesn't work they don't bother taking that crap down, they just link to it from more and more other sites in an attempt to get its host authority better. That doesn't work either (for getting page rank)
But what it means is that 99.9% of all new web pages created on a given day, are created by robots or algorithms or other agencies without any motive to provide value, merely to provide "inventory" for advertisements. You are lucky if you can pull a billion "real" web pages out of a crawl frontier of 100 billion URIs.
That said, if you have ever wondered why domains that used to have web sites on them suddenly become huge spam havens, it is because spammers buy up the domain as soon as it expires and try to exploit its previous reputation as a non-spam site, to push link authority into some (generally Google's) crawl.
For the most part, we just provide general facts about the web, and we've been contacted by academics on more than one occasion for data sets.
We also calculate MozRank which is our version of Google's PageRank as well as some in house higher level metrics like PageAuthority and DomainAuthority which are machine learning models derived from all of the other metrics we compute.