I just finished crawling 5.19B web pages, Ask Me Anything

19 pointsdor_jack9y ago19 comments

I WAS JUST RATE LIMITED BY HN, SO IM GOING TO ANSWER YOUR QUESTIONS UNDER A NEW ACCOUNT: dor_jack_2

19 comments

18 comments · 7 top-level

savethefuture9y ago· 6 in thread

What did you discover.

We are processing the data as we speak. However the movement of technology based on where your company is based is truly incredible.

Will update this in a few days with more data.

savethefuture9y ago

That will be an interesting correlation to see different frameworks or tech or even design elements based on geographical location.

1 more reply

savethefuture9y ago

How did you crawl so many sites, how did you discover them, search engine, ip ranges or another method?

dor_jackOP9y ago

The platform we used provided their own seedlist and took it from there.

savethefuture9y ago

How long did it take? What type of data did you record?

dor_jackOP9y ago

It took us about 13 days. We recorded reources of all types: text/, image/, application/*

As one would expect the vast majority of data recorded is text/* (html,...)

dm_i3869y ago· 1 in thread

What tools did you use? What had to be custom-written and why?

dor_jack_29y ago

We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.

maurtinshkreli9y ago· 1 in thread

How much did it cost?

dor_jackOP9y ago

It was an all inclusive deal: 420 TB at 0.06 per GB = $25,804

tlack9y ago· 1 in thread

what did you do to avoid winding up in endless GET url loops? How deep did you get per site, and how did you schedule followup requests?

dor_jack_29y ago

Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same website unless robots.txt had a lower delay. Pretty polite...

joshpen1889y ago· 1 in thread

Why didn't you use common crawl instead?

dor_jack_29y ago

For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.

itburnslikeice9y ago· 1 in thread

but why?

dor_jackOP9y ago

Our company is in the Marketing Intelligence (MI) industry. We needed to measure the penetration of multiple technologies in different countries.

grzm9y ago

If you're rate-limited, you can contact the mods via the Contact link in the footer.

j / k navigate · click thread line to collapse

19 comments

18 comments · 7 top-level

savethefuture9y ago· 6 in thread

What did you discover.

dor_jackOP9y ago

We are processing the data as we speak. However the movement of technology based on where your company is based is truly incredible.

Will update this in a few days with more data.

savethefuture9y ago

That will be an interesting correlation to see different frameworks or tech or even design elements based on geographical location.

1 more reply

savethefuture9y ago

How did you crawl so many sites, how did you discover them, search engine, ip ranges or another method?

dor_jackOP9y ago

The platform we used provided their own seedlist and took it from there.

savethefuture9y ago

How long did it take? What type of data did you record?

dor_jackOP9y ago

It took us about 13 days. We recorded reources of all types: text/, image/, application/*

As one would expect the vast majority of data recorded is text/* (html,...)

dm_i3869y ago· 1 in thread

What tools did you use? What had to be custom-written and why?

dor_jack_29y ago

We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.

maurtinshkreli9y ago· 1 in thread

How much did it cost?

dor_jackOP9y ago

It was an all inclusive deal: 420 TB at 0.06 per GB = $25,804

tlack9y ago· 1 in thread

what did you do to avoid winding up in endless GET url loops? How deep did you get per site, and how did you schedule followup requests?

dor_jack_29y ago

Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same website unless robots.txt had a lower delay. Pretty polite...

joshpen1889y ago· 1 in thread

Why didn't you use common crawl instead?

dor_jack_29y ago

For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.

itburnslikeice9y ago· 1 in thread

but why?

dor_jackOP9y ago

Our company is in the Marketing Intelligence (MI) industry. We needed to measure the penetration of multiple technologies in different countries.

grzm9y ago

If you're rate-limited, you can contact the mods via the Contact link in the footer.

j / k navigate · click thread line to collapse