Free 5 Billion Page Web Index Now Available from Common Crawl Foundation (opens in new tab)

(readwriteweb.com)

201 pointspooyak14y ago39 comments

39 comments

36 comments · 14 top-level

mthoms14y ago· 5 in thread

I'd love to see Gabriel weigh in on this. I wonder if Duck Duck Go will be able to take advantage of this resource?

I love stuff like this effort. The more open data sources, the better for everyone. I'm sure we (DuckDuckGo) will find a way to make use of it :)

wcchandler14y ago

This was my first thought, too. I can't seem to hit the resource to check it out but even if the content is "stale" it can be used for a couple different reasons. Initial snapshot of pages, decent starting index on self crawling (instead of reliance on BOSS or Bing), content differentiation, nullifying search bias (if existent)... But I'm not really a search guy so I could be jaded on its importance.

coderdude14y ago

Nova Spivack said that the crawls have been going for several years. There's a good chance that many of the pages in the archive are unacceptably outdated for indexing purposes.

ahadrana14y ago

Hi. I work for commoncrawl. We are about to start an improved recrawl and will be doing this more frequently going forward. In the process we will also consolidate our data on S3 to keep it relevant. But, as with any crawl of the Internet, there is lot of noise in there. We spent most of 2011 tweaking the algorithms to improve the freshness and quality of the crawl, and hopefully this work starts to show results in 2012.

jerfelix14y ago

I'm not sure whether major portions of their archive are unacceptably outdated.

But I am sure that it would be logic failure to conclude that it must be out of date simply because they've been indexing for several years. With that logic, Google would be further out of date, having indexed for over a decade.

pooyakOP14y ago· 5 in thread

One interesting discussion from here: http://www.commoncrawl.org/common-crawl-enters-a-new-phase/ It says the cost of running a hadoop job to scan all 5billon documents is in the order of $100.

Does any one know how does this compare to let say Yahoo BOSS? Is it even comparable?

ahadrana14y ago

Hi, I work at commoncrawl, so I will try to answer your question. We store our crawl data on S3 in the form of 100MB compressed archives and there are between 40,000 and 50,000 such files in commoncrawl’s bucket today. The key to scanning such a large set of files efficiently on EC2 is to have your each of your Mappers (assuming you are running Hadoop) open multiple S3 streams in parallel to maintain some desired level of throughput. For example, assuming that you can maintain on average a 1MByte/sec throughput per S3 stream, and you start 10 parallel streams per Mapper, you should be able to sustain a throughput 80 Mbits/sec or 10 MBytes/sec. If you were to run one Mapper per EC2 small instance, and start 100 such instances, this would yield and aggregated throughput of close to 3TB/hour. At that rate, you would need 16 hours to scan 50TB of data, or a total of 1600 machine hours at $.085 per hour, costing you somewhere in the neighborhood of $130.00. Of course, you would then need to add in the cost of running any subsequent aggregation / data consolidation jobs and the cost of storing your final data on S3. So, the $100.00 number is generally in the ballpark but final numbers may vary :-)

As far as comparisons to Yahoo BOSS are concerned, no, we are definitely not comparable to Yahoo BOSS or other such APIs that run on top of an already built (and properly ranked) inverted index of the web. At this stage we only produce bulk snapshots of what we crawl, and we are focusing our engineering resources on improving the frequency and coverage of crawl (the results of which will hopefully start to bear fruit in early 2012). Perhaps at some point in the near future, we can partner with the community to build a rudimentary full-text inverted index of the Web that we can make available in bulk via S3 as well.

joda_14y ago

Hey ahadrana, I haven't found anything about the page ranks on the website, are they included? Do you know if it is possible to go only trough the metadata of the crawl, say to get the page ranks for a list of pages or do you have to go through the full crawl?

1 more reply

Aloisius14y ago

Does BOSS still exist? I was under the impression that it was defunct.

michels2414y ago

I was the former GM of Yahoo BOSS (was there from pre-launch through 11/09). BOSS does still exist - http://developer.yahoo.com/search/boss/. It is now a paid API under the umbrella of Yahoo Developer Network. The pricing plan (http://developer.yahoo.com/search/boss/#pricing) is based on query type and volume. Unfortunately there is no self-serve advertising model (meaning if you incorporate Y!/Bing search ads, the service is free). It's important to note though that this is the Bing search index, not the old Yahoo Search index that is effectively shut down. The original BOSS product was based on Yahoo! Search.

From what I have heard BOSS continues to do very well and is pointed at internally as how to turn an API into a real business and product.

One more note, I am now at Factual where we are very happy consumers of the CommonCrawl service.

nethsix14y ago

Yes. With Google no longer providing search result API (not even paid version, the last I checked) people are turning to BOSS/Bing/(anything else?)

1 more reply

patio1114y ago· 3 in thread

I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling. They were far ahead of everyone else back then, but one could conceive of a rag-tag group of companies, institutions, and individuals pooling their resources and getting a crawl about 10% as good. These days, on the externally visible evidence they're probably several orders of magnitude better than everybody else on the planet combined.

Take crawl freshness. If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

ahadrana14y ago

Hi I work at commoncrawl. We have spent our time (in 2011) improving our algorithms, and hopefully this effort will start to show real results (with respect to crawl frequency and relevancy) in 2012. But you are right, it is pretty unlikely that our crawl will be able to be fully competitive with the likes of Google etc., multi-billion dollar corporations who dedicate huge amounts of engineering and hardware resources to stay competitive in this field.

funthree14y ago

It is not "Google etc., multi-billion dollar corporations" it is just Google.

anamax14y ago

> I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling.

In the 2004 timeframe, Yahoo was crawling about the same number of pages as Google. (More some months.)

> If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

Time from crawl to appearing in search results is a different issue.

wisty14y ago· 2 in thread

Is there a sample dataset?

I think all projects should have sample datasets. It simplifies a lot of things, and in this case stops hundreds of geeks burning through bandwidth before they realize they don't have a clue what they are going to do with the data.

ahadrana14y ago

We hear you. Could you define some criteria as to the type and size of sample data you would like to see? We are working on producing more targeted/limited collections, like perhaps all most recently published blog posts etc.

showerst14y ago

Perhaps two sets, one that's just a few hundred kilobytes that contains a few sample .arc files to test against the format, and then one larger 'training' set that's small enough to test against offline (maybe like 100MB?) but large enough to contain a good sample of the possible content.

1 more reply

dotBen14y ago· 2 in thread

Although I'm personally all for open distribution of crawl data like this and all of my personal websites are CC-licensed, isn't there something to be said for the copyright status of the pages in the crawl file?

The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.

There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.

Surprisingly, I couldn't see anything about this covered in the FAQs

ahadrana14y ago

Hi, you can view our terms of use at http://www.commoncrawl.org/about/terms-of-use/full-terms-of-.... We adhere to the robots.txt standard, try to do all our crawling above board, and (strictly personal opinion here) we are definitely not in the business of diminishing or subverting peoples rights with regards to the content they produce. There are many other options available to those who are determined to crawl a site's content, whether the site owner wants them to or not. Our goal is to democratize access to our crawl for the betterment of Web ecosystem as a whole and we believe storing the data on S3 and making it accessible to a wide audience is the right way to accomplish this goal.

ohashi14y ago

I see it in the ToS:

http://www.commoncrawl.org/about/terms-of-use/

-Violate other people’s rights (IP, proprietary, etc.)

ChuckMcM14y ago· 2 in thread

I wonder if crooks will try to exploit this crawl. As a person who has an index of the web like this it has been interesting to see what they look for. SSN's and credit card numbers are common, as are sites running older versions of PHP software or exploitable shopping carts.

yaix14y ago

It makes it very easy for people to steal vast amounts of your content and republish it on their own sites, with ads all around it.

Many content sites have protections in place to recognize bots by their behavior or use "honeypots" to tell bots apart from human visitors and thus avoid large scale content theft.

noahc14y ago

Presumably those protections would prevent this bot from collecting data as well?

rb2k_14y ago· 1 in thread

Oh nice. I've been doing a lot of crawling myself (http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...) and I'd love to get my hands on this data. I hope they'll segment their data a bit further.

I personally would LOVE to have a simple list of the domainnames themselves without all of the connections and documents.

Also: Why not just use bittorrent to distribute it?

Aloisius14y ago

I imagine they don't use bittorrent because it is both very large (TBs) and changes frequently.

With S3, you could boot up a bunch of Hadoop processes, pull it (without incurring any bandwidth costs I believe), process it and dump out whatever you want.

hnwh14y ago· 1 in thread

I don't see any links to download their Hadoop classes..

ahadrana14y ago

Sorry, our github repository had some accidental check-ins that we needed to remove. I will share the link to the code shortly.

mhp14y ago· 1 in thread

"Well this has to be a first for a software company"

I'll just leave this here: http://training.fogcreek.com

pragmatic14y ago

You sure you commented on the correct article?

rgrieselhuber14y ago

> We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts.

Were there any other reasons to not use Nutch (performance, etc.)?

I'd love to hear more about the stack you're using to perform the crawls. If you don't mind sharing, it would be very interesting to read about the costs involved in gathering this data (how many machines, how long did it take, etc.)

Any plans to open source that as well? In addition to a general lack of open web crawl data freely available, there are precious few open source projects (if any) that produce high quality crawlers able to deal with the modern web.

rshm14y ago

It is a good news for anyone having an eye for a vertical search engine. With your device, total cost of seed data (Assuming about 40TB) comes below one thousand dollars.

LisaG14y ago

New Common Crawl blog post addressing many of the questions raised here last week. http://www.commoncrawl.org/answers-to-recent-community-quest...

I work at Common Crawl.Thanks for all the interest and the good questions! Lisa

corbet14y ago

So what is the license for all of this data? It seems murky at best...

pablohoffman14y ago

I initially submitted this post, but then deleted it and resubmitted to the original post on Common Crawl blog: http://news.ycombinator.com/item?id=3208853

I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.

j / k navigate · click thread line to collapse

39 comments

36 comments · 14 top-level

mthoms14y ago· 5 in thread

I'd love to see Gabriel weigh in on this. I wonder if Duck Duck Go will be able to take advantage of this resource?

epi0Bauqu14y ago

I love stuff like this effort. The more open data sources, the better for everyone. I'm sure we (DuckDuckGo) will find a way to make use of it :)

wcchandler14y ago

coderdude14y ago

Nova Spivack said that the crawls have been going for several years. There's a good chance that many of the pages in the archive are unacceptably outdated for indexing purposes.

ahadrana14y ago

jerfelix14y ago

I'm not sure whether major portions of their archive are unacceptably outdated.

pooyakOP14y ago· 5 in thread

One interesting discussion from here: http://www.commoncrawl.org/common-crawl-enters-a-new-phase/ It says the cost of running a hadoop job to scan all 5billon documents is in the order of $100.

Does any one know how does this compare to let say Yahoo BOSS? Is it even comparable?

ahadrana14y ago

joda_14y ago

1 more reply

Aloisius14y ago

Does BOSS still exist? I was under the impression that it was defunct.

michels2414y ago

From what I have heard BOSS continues to do very well and is pointed at internally as how to turn an API into a real business and product.

One more note, I am now at Factual where we are very happy consumers of the CommonCrawl service.

nethsix14y ago

Yes. With Google no longer providing search result API (not even paid version, the last I checked) people are turning to BOSS/Bing/(anything else?)

1 more reply

patio1114y ago· 3 in thread

Take crawl freshness. If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

ahadrana14y ago

funthree14y ago

It is not "Google etc., multi-billion dollar corporations" it is just Google.

anamax14y ago

> I was hoping for Yahoo, Amazon, or Microsoft to throw a lot of resources at this about 5~8 years ago. Since then, Google kind of ran away with the game in crawling.

In the 2004 timeframe, Yahoo was crawling about the same number of pages as Google. (More some months.)

> If I publish a new blog post, it gets crawled and added to the Google index in seconds. Other crawling efforts take weeks between refreshes.

Time from crawl to appearing in search results is a different issue.

wisty14y ago· 2 in thread

Is there a sample dataset?

ahadrana14y ago

showerst14y ago

1 more reply

dotBen14y ago· 2 in thread

The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.

There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.

Surprisingly, I couldn't see anything about this covered in the FAQs

ahadrana14y ago

ohashi14y ago

I see it in the ToS:

http://www.commoncrawl.org/about/terms-of-use/

-Violate other people’s rights (IP, proprietary, etc.)

ChuckMcM14y ago· 2 in thread

yaix14y ago

It makes it very easy for people to steal vast amounts of your content and republish it on their own sites, with ads all around it.

Many content sites have protections in place to recognize bots by their behavior or use "honeypots" to tell bots apart from human visitors and thus avoid large scale content theft.

noahc14y ago

Presumably those protections would prevent this bot from collecting data as well?

rb2k_14y ago· 1 in thread

I personally would LOVE to have a simple list of the domainnames themselves without all of the connections and documents.

Also: Why not just use bittorrent to distribute it?

Aloisius14y ago

I imagine they don't use bittorrent because it is both very large (TBs) and changes frequently.

With S3, you could boot up a bunch of Hadoop processes, pull it (without incurring any bandwidth costs I believe), process it and dump out whatever you want.

hnwh14y ago· 1 in thread

I don't see any links to download their Hadoop classes..

ahadrana14y ago

Sorry, our github repository had some accidental check-ins that we needed to remove. I will share the link to the code shortly.

mhp14y ago· 1 in thread

"Well this has to be a first for a software company"

I'll just leave this here: http://training.fogcreek.com

pragmatic14y ago

You sure you commented on the correct article?

rgrieselhuber14y ago

> We do not use Nutch for the purposes of crawling, but instead utilize a custom crawl infrastructure to strictly limit the rate at which we crawl individual web hosts.

Were there any other reasons to not use Nutch (performance, etc.)?

rshm14y ago

It is a good news for anyone having an eye for a vertical search engine. With your device, total cost of seed data (Assuming about 40TB) comes below one thousand dollars.

LisaG14y ago

New Common Crawl blog post addressing many of the questions raised here last week. http://www.commoncrawl.org/answers-to-recent-community-quest...

I work at Common Crawl.Thanks for all the interest and the good questions! Lisa

corbet14y ago

So what is the license for all of this data? It seems murky at best...

pablohoffman14y ago

I initially submitted this post, but then deleted it and resubmitted to the original post on Common Crawl blog: http://news.ycombinator.com/item?id=3208853

I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.

j / k navigate · click thread line to collapse