102TB of New Crawl Data Available (opens in new tab)

(commoncrawl.org)

237 pointsLisaG12y ago37 comments

37 comments

29 comments · 15 top-level

boyter12y ago· 5 in thread

I love common crawl, but as I commented before I still want to see a subset available for download, something like the top million sites or something like that. Certainly a few steps of data, say 50GB 100GB and 200GB.

I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.

LisaGOP12y ago

There will be news about a subset sometime next month!

malandrew12y ago

Ideally beyond the top sites, these subsets would be available as verticals, so that people can focus on specialized search engines.

While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:

1) Everything linux, unix and both

2) Everything open-source

3) Only news & current events

4) Popular culture globally and by country

5) Politics globally and by country

6) Everything software engineering

7) Everything hardware engineering

8) Everything maker community

9) Everything financial markets

10) Everything medicine / health (sans obvious quackery)

11) etc.

Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.

The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.

1 more reply

hkmurakami12y ago

would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)

2 more replies

daivd12y ago

One subset for each TLD would be nice. Or, if you can afford more CPU-power, per language, using a good open language detector.

boyter12y ago

Fantastic news. Will be looking forward to seeing it.

DigitalSea12y ago· 2 in thread

I've yet to find an excuse to download some of this data to play with. I have a feeling my ISP will personally send around a bunch of suits to collect the bill payment in person if I were to ever go over my 500gb monthly limit by downloading 102tb of data, haha. I would still like to download a subset of the data, from what I've read apparently that kind of idea is already in the works. I just can't possibly think of what I would do, perhaps a machine learning based project.

msoad12y ago

I'm on Comcast and download around 3TB/month with no problem. But seriously why you should download big data to work with? It's cheaper and faster to do it in 'cloud'!

recuter12y ago

Why not grab it to/with a VPS?

sirsar12y ago· 2 in thread

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.

Where can I read more about this?

ldng12y ago

Section "Resources" of the post you haven't read ?

sirsar12y ago

No, I mean the difference between the filetypes.

iamtechaddict12y ago· 2 in thread

Is there a way we can access the data(small subet say 30-40GB's) without having an AWS account(as it requires a credit card, I'm a student i don't have any) ?

wodow12y ago

Some of the older data (2009) is available on archive.org: https://archive.org/details/commoncrawl

iamtechaddict12y ago

Thanks a lot. It'll be very helpful i'm sure.

rwg12y ago· 1 in thread

I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla's pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.

It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...

Aloisius12y ago

I'd have to check the last crawl settings, but I believe I set the last crawl was set to truncate at 1 MB (response body size, so that could be 1 MB uncompressed or 1 MB compressed depending on what the source web server sent out).

At one point I was tried out a 10 MB limit, but the thing is we try to limit crawls to webpages and few are that big, but occasionally we'd hit sites ISDN-speed connections that would slow down the whole thing.

For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them.

Also, hopefully you'll find the new metadata files to be a little clearer. We switched over the same format Internet Archive uses and it contains quite a bit more data (xpath truncated paths for each link for instance).

danso12y ago· 1 in thread

Very cool...though I have to say, CC is a constant reminder that whatever you put on the Internet will basically remain in the public eye for the perpetuity of electronic communication. There exists ways to remove your (owned) content from archive.org and Google...but once some other independent scraper catches it, you can't really do much about it

bollacker12y ago

I think about this from George Santayana's perspective: "Those who cannot remember the past are condemned to repeat it." I feel like we need our past recorded (good, bad, AND ugly). It keeps us civil and humble.

GigabyteCoin12y ago· 1 in thread

Can anyone give me a quick rundown on how exactly one gains access to all of this data?

I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.

Perhaps I just don't know what I'm looking at?

wpietri12y ago

Did you try this?

http://commoncrawl.org/get-started/

I haven't tried that one, but I've poked at other of the Amazon Common Datasets collection:

http://aws.amazon.com/datasets

If you're already familiar with using Amazon's virtual servers, it's pretty straightforward.

I also note that the Common Crawl project publishes code here:

https://github.com/commoncrawl/commoncrawl

kohanz12y ago

I'm curious to hear how people are using Common Crawl data.

rb2k_12y ago

Is there an easy way to grab JUST a list of uniq domains?

That would be a great starter for all sorts of fun little weekend experiments.

ma2rten12y ago

I would be great if common crawl (or anyone else) would also release a document-term index for it's data. If you had an index, you could do a lot more things with this data.

ecaron12y ago

Anyone have a good understanding of the difference between this and http://www.dotnetdotcom.org/? I've seen Dotbot in my access logs more than CommonCrawl, so I'm more inclined to believe they have a wider - but not deeper - spread.

recuter12y ago

Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can't find reliable figures, numbers all over the place. 5 percent?

kordless12y ago

Ah, distributed crawling. What a great idea. :)

csmuk12y ago

Well that would take 3.5 years to download on my Internet connection!

manismku12y ago

That's great and cool stuff.

j / k navigate · click thread line to collapse

37 comments

29 comments · 15 top-level

boyter12y ago· 5 in thread

LisaGOP12y ago

There will be news about a subset sometime next month!

malandrew12y ago

Ideally beyond the top sites, these subsets would be available as verticals, so that people can focus on specialized search engines.

While it's nice to have generalist search engines, it would be even better to be able to unbundle the generalist search engines completely. Verticals such as the following would be nice:

1) Everything linux, unix and both

2) Everything open-source

3) Only news & current events

4) Popular culture globally and by country

5) Politics globally and by country

6) Everything software engineering

7) Everything hardware engineering

8) Everything maker community

9) Everything financial markets

10) Everything medicine / health (sans obvious quackery)

11) etc.

Maybe make a tool that allows the community to create the subset creation recipes that perform the parsing out of data of a certain type and that the community forks and improves over time.

The time to create a generalist search engine has sailed, but specialist search engines is total greenfield.

1 more reply

hkmurakami12y ago

would love to have even smaller subsets (like 5gb) that students can casually play around with too to practice and learn tools and algos :) (if it's not too much trouble!)

2 more replies

daivd12y ago

One subset for each TLD would be nice. Or, if you can afford more CPU-power, per language, using a good open language detector.

boyter12y ago

Fantastic news. Will be looking forward to seeing it.

DigitalSea12y ago· 2 in thread

msoad12y ago

I'm on Comcast and download around 3TB/month with no problem. But seriously why you should download big data to work with? It's cheaper and faster to do it in 'cloud'!

recuter12y ago

Why not grab it to/with a VPS?

sirsar12y ago· 2 in thread

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.

Where can I read more about this?

ldng12y ago

Section "Resources" of the post you haven't read ?

sirsar12y ago

No, I mean the difference between the filetypes.

iamtechaddict12y ago· 2 in thread

Is there a way we can access the data(small subet say 30-40GB's) without having an AWS account(as it requires a credit card, I'm a student i don't have any) ?

wodow12y ago

Some of the older data (2009) is available on archive.org: https://archive.org/details/commoncrawl

iamtechaddict12y ago

Thanks a lot. It'll be very helpful i'm sure.

rwg12y ago· 1 in thread

Aloisius12y ago

For the next crawl, we'll mark which pages are truncated and which aren't (an oversight in the last crawl) so at least you can skip over them.

danso12y ago· 1 in thread

bollacker12y ago

GigabyteCoin12y ago· 1 in thread

Can anyone give me a quick rundown on how exactly one gains access to all of this data?

I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.

Perhaps I just don't know what I'm looking at?

wpietri12y ago

Did you try this?

http://commoncrawl.org/get-started/

I haven't tried that one, but I've poked at other of the Amazon Common Datasets collection:

http://aws.amazon.com/datasets

If you're already familiar with using Amazon's virtual servers, it's pretty straightforward.

I also note that the Common Crawl project publishes code here:

https://github.com/commoncrawl/commoncrawl

kohanz12y ago

I'm curious to hear how people are using Common Crawl data.

rb2k_12y ago

Is there an easy way to grab JUST a list of uniq domains?

That would be a great starter for all sorts of fun little weekend experiments.

ma2rten12y ago

I would be great if common crawl (or anyone else) would also release a document-term index for it's data. If you had an index, you could do a lot more things with this data.

ecaron12y ago

recuter12y ago

Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can't find reliable figures, numbers all over the place. 5 percent?

kordless12y ago

Ah, distributed crawling. What a great idea. :)

csmuk12y ago

Well that would take 3.5 years to download on my Internet connection!

manismku12y ago

That's great and cool stuff.

j / k navigate · click thread line to collapse