Triv.io donates URL index to Common Crawl (opens in new tab)

(commoncrawl.org)

51 pointsLisaG13y ago16 comments

16 comments

13 comments · 3 top-level

rb2k_13y ago· 8 in thread

What I'd love to see: A simple list of domains No information about content, no full URLs, just the domainname.

A number of the sponsoring orgs will grant you access to their zonefiles if you're willing to do a little paperwork (HelloFax is awesome) and have a valid reason for access. Granted it's not the complete list of registered domains (those without nameservers on file won't show up), but it's pretty close.

The list of sponsoring orgs for each TLD is at: http://www.iana.org/domains/root/db

soult13y ago

Extracing such a list from the generated index only takes a small script and a few hours to download the 200+ GB index file. Which is a lot less than the slightly bigger script and the months/years to download and process 80+ TB of arc files that it previously would have taken to extract all domains.

Anyways, if you want a copy of domains from the index file just send me a mail to the address in my profile.

thefreeman13y ago

Where is the index file located in the s3 bucket?

1 more reply

adventured13y ago

You can get the top million sites from Quantcast in a download for free. Alexa used to also offer something similar, but I don't see it on their site any longer (they try to sell their data through aws, so that's probably why).

ghewgill13y ago

You can find the Alexa top 1 million domain list at http://www.alexa.com/topsites (it's on the right hand side).

gourneau13y ago

Can someone dig that up and link it please

1 more reply

srobertson13y ago

That would be great. Out of curiosity, mind describing what you'd use such a list for?

rb2k_13y ago

Usually as a starting point for a crawler when checking statistics about the web.

Crawling is much faster if you don't have to "spider" all the links and check if you've already visited them or not.

With a big enough list, you can just iterate over those domains. (average number of links on a website, how often does javascript framework x vs y get used, how many sites have an HTML5 doctype yet, ...)

soult13y ago· 2 in thread

Is it just me or is the data file not available for free despite being in the Amazon Public Dataset S3 bucket?

*Edit: The problem seems to be fixed now.

srobertson13y ago

Let me double check that for you. Were you using a valid aws-id and secret?

soult13y ago

No, usually you can download them without sending any aws-id if they are on the Public Datasets S3 bucket, e.g.

    wget https://s3.amazonaws.com/aws-publicdatasets/common-crawl/crawl-001/2008/06/19/0/1213886083018_0.arc.gz

brianr13y ago

Nice work triv.io!

j / k navigate · click thread line to collapse

16 comments

13 comments · 3 top-level

rb2k_13y ago· 8 in thread

What I'd love to see: A simple list of domains No information about content, no full URLs, just the domainname.

aseidl13y ago

The list of sponsoring orgs for each TLD is at: http://www.iana.org/domains/root/db

soult13y ago

Anyways, if you want a copy of domains from the index file just send me a mail to the address in my profile.

thefreeman13y ago

Where is the index file located in the s3 bucket?

1 more reply

adventured13y ago

ghewgill13y ago

You can find the Alexa top 1 million domain list at http://www.alexa.com/topsites (it's on the right hand side).

gourneau13y ago

Can someone dig that up and link it please

1 more reply

srobertson13y ago

That would be great. Out of curiosity, mind describing what you'd use such a list for?

rb2k_13y ago

Usually as a starting point for a crawler when checking statistics about the web.

Crawling is much faster if you don't have to "spider" all the links and check if you've already visited them or not.

soult13y ago· 2 in thread

Is it just me or is the data file not available for free despite being in the Amazon Public Dataset S3 bucket?

*Edit: The problem seems to be fixed now.

srobertson13y ago

Let me double check that for you. Were you using a valid aws-id and secret?

soult13y ago

No, usually you can download them without sending any aws-id if they are on the Public Datasets S3 bucket, e.g.

    wget https://s3.amazonaws.com/aws-publicdatasets/common-crawl/crawl-001/2008/06/19/0/1213886083018_0.arc.gz

brianr13y ago

Nice work triv.io!

j / k navigate · click thread line to collapse