10% of the top million sites are dead (opens in new tab)

(ccampbell.io)

375 pointsSoupy3y ago143 comments

143 comments

103 comments · 33 top-level

gojomo3y ago· 7 in thread

Many issues with this analysis, some others have already mentioned, including:

• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)

• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.

• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.

• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.

If you want to probe if domains are still active:

• confirm they're still registered via a `whois`-like lookup

• examine their DNS records for evidence of current services

• ping them, or any DNS-evident subdomains

• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)

• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services

If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.

thematrixturtle3y ago

> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.

gojomo3y ago

> Majestic promotes their list as the "top 1 million websites of the world"

Well, the source URL provided by the article author initially claims, “The million domains we find with the most referring subnets”. Then it makes a contradictory comment mentioning ‘websites’. At best we can say Majestic is vague and/or confused about what they’re providing – but given the author’s results, I suspect this list contains domains with no guarantee Majestic ever saw a live HTTP service on these domains.

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

How about I cite HN user ~gojomo, who for nearly a decade wrote & managed web crawling software for the Internet Archive. He says: “Sites that don’t want to be crawled use every tactic you can imagine to repel unwanted crawlers, including unceremoniously instant-dropping open connections from disfavored IPs and User-Agents. Sadly, given Google’s dominance, many give a free pass to only Google IPs & User-Agents, and maybe a few other search-engines.”

zinekeller3y ago

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.

Semaphor3y ago

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

We use stage.www.domain.tld for the staging/testing site, but that's about it ;)

1 more reply

justsomehnguy3y ago

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

https://psnow.ext.hpe.com/doc/c04543743.pdf?jumpid=in_lit-ps...

https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04543743.pdf

https://h22204.www2.hpe.com/NEP

https://h30125.www3.hpe.com/hpcsn/?hpp

https://h41370.www4.hpe.com/quickspecs/overview.html

spc4763y ago

It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12

useruserabc3y ago

As near as I can tell, these are the top 1,000,000 domains referred to by other websites they crawled.

The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.

So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.

[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet

bioemerl3y ago· 7 in thread

I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.

mike_hock3y ago

Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain much more than what can be called a "site," especially in 2022 when the internet has been all but destroyed and all that's left is a corporate oligopoly.

MonkeyMalarky3y ago

How many are placeholder pages thrown up by registrars like Network Solutions?

denton-scratch3y ago

If they're placeholder pages, they're not dead. Those 10% are not responding at all; the requests aren't reaching any HTTP server.

2 more replies

ehsankia3y ago

How is "top" defined here? If they were dead, wouldn't they fairly quickly stop being "top"?

EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.

nine_k3y ago

I can expect large SEO spam clusters of "sites" with many links inside a cluster to make them look legit. For some time such bits of SEO spam were on top of certain google searches and enjoyed significant traffic, putting them firmly into "top 1M".

Once a particular SEO trick is understood and "deoptimized" by Google, these "sites" no longer make money, and get abandoned.

Swizec3y ago

Blows my mind that my blog is 210863rd on that list. That makes the web feel somehow smaller than I thought it was.

wincent3y ago

Eyeing you jealously from my position at 237,014 on the list... We're almost neighbors, I guess.

smugma3y ago· 6 in thread

I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.

It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.

It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.

code1234567893y ago

wixsite.com is a domain for free sites built on Wix, so if your username on Wix is smugma, and your site name is mysite, then you'll have a URL like smugma.wixsite.com/mysite for your Home page.

That's why this domain is in the top

smugma3y ago

Correct, that's why it's in the top. Your example further confirms why the author's methodology is broken.

zinekeller3y ago

> other sites I checked such as wixstatic.com don't appear to be valid

But docs.wixstatic.com is valid.

winddude3y ago

100% agree his methodology is broken. Another example like this is googleapis.com. If I remember correctly there a quite a number of domains like this in magestic million.

Not to mention a number of his requests may have been blocked.

quickthrower23y ago

He takes this into account by generously considering any returned response code as “not dead”.

> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid

mort963y ago

That doesn't take this into account, no. `curl wixsite.com` returns a "Could not resolve host" error; it doesn't return a response code, so the author would consider it invalid, even though `curl www.wixsite.com` does return a response (a 301 redirect to www.wix.com).

1 more reply

gravitate3y ago· 6 in thread

> Domain normalization is a bitch

I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.

Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...

noizejoy3y ago

> I’m a no-www advocate.

I used to feel the same way. — Until the arrival of so many new TLDs.

Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band

gravitate3y ago

I hear you. But a redirect is a good solution in that case.

1 more reply

macintux3y ago

25 years ago I added a rule to my employer’s firewall to allow the bare domain to work on our web server.

Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.

A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.

Been much too long to have any clue now what we did wrong.

JackMcMack3y ago

You probably created a cname from the apex to www? This problem still exists today.

From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME record is present at a node, no other data should be present; this ensures that the data for a canonical name and its aliases cannot be different."

So if you're looking up the MX record for domain, but happen to find a cname for domain to www.domain , it will follow that and won't find any MX records for www.domain.

The correct approach is to create a cname record from www.domain to domain, and have the A record (and MX and other records) on the apex.

Most DNS providers have a proprietary workaround to create dns-redirects on the apex (such as AWS Route53 Alias records) and serve them as A records, but those rarely play nice with external resources.

1 more reply

agraddy3y ago

I'm a www advocate and reroute my domains from apex domain to www. When you use an apex domain, you have to use an A record which means if you have a server outage it is going to take time to update the record to point at a new IP address. If you use www with a CNAME, the final server IP can be quickly switched assuming you've set the CNAME and network up for that functionality - you can't do that with an apex domain.

account423y ago

That doesn't make any sense at all - the CNAME just points to somewhere else with an A (and AAAA in $current_year) record. It adds another point you can switch around but doesn't let you switch it any quicker. How quickly you can effectively change what the domain points to is determined by the TTL of the record (withing limits) which can be lowered for any record.

1 more reply

the_biot3y ago· 5 in thread

By what possible criteria are these the "top" million sites, if 10% are dead? I'd start with questioning that data.

kjeetgill3y ago

Dude, it's the second sentence of the first paragraph:

> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).

MatthiasPortzel3y ago

And moreover, the author’s conclusion is that the dataset is bad.

> While I had expected some cleanliness issues, I wasn’t expecting to see this level of quality problems from a dataset that I’ve seen referenced pretty extensively across the web

the_biot3y ago

Yeah, but they're still providing a dataset that's just plain bad. It's hardly relevant how many sites link to some other site, if it's dead.

1 more reply

winddude3y ago

part of the problem is it's not the number of links, it's referring subnets. Fairly certain this includes, script tags.

deltree73y ago

Exactly!

Garbage In == Garbage Out

softwaredoug3y ago· 5 in thread

My current beliefs about how people use and trust information on the Web.

First, trust is _everything_ on the Web, it is the thing people first think of when arriving on some information. But how people evaluate trust has changed dramatically over the last 10 years.

- Trust now comes almost exclusively from social proof. Searching reddit, youtube, etc and other extremely _moderated_ sources of information, where the most work is done to ensure content comes from actual human beings. How many of us now google `<topic> reddit` instead of just `<topic>`?

- Of course a lot of this trust is misplaced. There's a very thin line between influencers and cult leaders / snake oil salesmen. Our last President used this hack really effectively.

- Few trust Google's definition of trust anymore -- essentially page rank. This made more sense when the Web essentially was social, where inbound links were very organic. Now with the trust in general Web sites evaporated, the main 'inbound links' anyone cares about come from individuals or community they trust or identify with. They don't trust Googles algorithm (its too opaque, and too easily gamed).

This of course means the fracturing of truth away from elites. Sometimes this could be a good thing, but in many cases cough Covid cough it might be pretty disastrous for misinformation

mountainriver3y ago

> How many of us now google `<topic> reddit` instead of just `<topic>`

I sure hope not, Reddit is horrible place for information

romanhn3y ago

When I have a specific technical question, I append "stackoverflow" to my search queries. When I want to read a discussion, I add "reddit" (or "hacker news").

failTide3y ago

I use the strategy for a few things - including when I want to get reviews of a product or service. There's still potential for manipulation there, but you can judge the replies based on the user history - and you know that businesses aren't able to delete or hide bad reviews there.

But in general I agree with you - reddit is full of misinformation, propaganda and astroturfing

wolverine8763y ago

> How many of us now google `<topic> reddit` instead of just `<topic>`?

One of us lives in a bubble. I don't trust Reddit for anything, or YouTube or any social media. IME, it's mis/disinformation - not only a lack of information, but a negative; it leaves me believing something false. My experience is, and plenty of research shows, that we have no way to sort truth from fiction without prior expertise in the domain. The misinformation and disinformation on social media, and its persuasiveness, is very well known. The results are evident before us, in the madness and disasters, in dead people, in threats to freedom, prosperity, and stability.

Why would people in this community, who are aware of these issues, trust social media? How is that working out?

> This of course means the fracturing of truth away from elites. Sometimes this could be a good thing

I think that's mis/disinformation. 'Elite' is a loaded, negative (in this context) word. It makes the question about power and the conclusion inevitable.

Making it about power distracts from the core issue of knowledge, which is truth. I want to hear from the one person, or one of the few people, with real knowledge about a topic; I don't want to hear from others.

In matters of science the authority of thousands is not worth the humble reasoning of one single person.

Brian_K_White3y ago

They already acknowledge the problem of trusting the crowd, but you seem to not acknowledge the problem of trusting a central dispensary. In fact it's unwise to trust either one. Everything has to be evaluated case by case. The same source should be trusted for one thing today, and not for some other thing tomorrow.

phkahler3y ago· 4 in thread

Read that again folks:

"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "

This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.

jayd163y ago

The flip side is anyone can run these kinds of tools against your site easily and cheaply.

stevemk14ebr3y ago

your point has a truth behind it for sure, but there's a large difference between serving requests and making requests. Many sites are simple html and css pages, but many others also have complex backends. It's those that often are hard to scale and why the cloud is hugely popular, maintaining and scaling the backend is hard

phkahler3y ago

Oh absolutely, but he also said this:

I found that my local system could easily handle 512 parallel processes, with my CPU @ ~35% utilization, 2GB of RAM usage, and a constant 1.5MB down on the network.

Another thing that happened in the early web days was Apache. People needed a web server and it did the job correctly. Nobody ever really noticed that it had terrible performance, so early on infrastructure went to multiple servers and load balancers and all that jazz. Now with nginx, fast multi-core, and speedy networks even at home, it's possible to run sites with a hundred thousand users a day at home on a laptop. Not that you'd really want to do exactly that but it could be done.

Because of this I think an alternative to github would be open source projects hosted on peoples home machines. CI/CD might require distributing work to those with the right hardware variants though.

cratermoon3y ago

> you don't need to rent or provision "big infrastructure" unless you're actually quite big.

Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.

superb-owl3y ago· 4 in thread

One of the few things I like about blockchain is the promise of a less ephemeral web.

bergenty3y ago

Is that actually true? Don’t most nodes hold heavily compressed pointers only while there are only a percentage of nodes that host the entire blockchain. I mean if what you’re saying is true then each node needs to have a copy of the entire internet which isn’t reasonable.

superb-owl3y ago

I'm thinking about things like Filecoin, a blockchain which is meant to power IPFS. To be fair though, IPFS itself is not a blockchain

matkoniecz3y ago

One of many things I dislike about cryptoscams is making promises which are lies.

deltree73y ago

spoken like someone who is clueless about Blockchain

altdataseller3y ago· 3 in thread

All these top million lists are very good at telling you the top most 10K-50K sites on the web. After that, you're going into 'crapshoot' land, where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.

giantrobot3y ago

> where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

That's literally the phenomenon the article is describing.

altdataseller3y ago

Ok let me reword it differently: the 500,000th most popular site on these lists most likely isnt the 500,000th most visited and it might not even be in the top 5 million. These data sources are so bad at capturing popularity after 50k sites or so simply because they dont have enough data

1 more reply

TuringNYC3y ago

How are people determining the "top" sites? We do some of this at work and we pay SimilarWeb a giant sum of money, are people able to find site traffic in inexpensive ways which allow for these analyses?

mouzogu3y ago· 3 in thread

whenever i go through my bookmarks, i tend to find maybe 5-10% are now 404.

this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.

syedkarim3y ago

What’s the benefit to using archive.ph instead of archive.org (Internet Archive)? Seems like the latter is much more likely to be around for awhile.

tropicalfruit3y ago

i find archive.ph does a better job of preserving the page as is (it also takes a screenshot) compared to internet archive which can be flaky at best.

i also find archive.ph much faster at searching, and the browser extension is really useful too.

the faq does a great job of explaining too https://archive.ph/faq

2 more replies

system23y ago

archive.ph = Russian federation website. Blocked by most firewalls by default.

tete3y ago· 2 in thread

The biggest problem I find is that it seems to be pretty "outdated" to keep redirects in place, if you move stuff. So many links to news websites, etc. will cause a redirect to either / or a 404 (which is a very odd thing to redirect to in my opinion).

If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.

I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.

Another worry is the ever-increasing size of these scripts making archiving more expensive.

Kye3y ago

You can often pop the URL into the Wayback Machine to bring up the last live copy. It's better at handling dynamic stuff the more recent it is. Older stuff, especially early AJAX pages, are just gone because the crawler couldn't handle it at the time. It's far from a perfect solution, especially in light of the big publishers finally getting their excuse to go after the Internet Archive legally. It's a good silo, but just as vulnerable as any other.

nikisweeting3y ago

ArchiveWeb.page + ReplayWeb.page are the best I've found at handling ajax loaded content.

MonkeyMalarky3y ago· 2 in thread

Last time I tried to crawl that many domains, I ran into problems with my ISP's DNS server. I ended up using a pool of public DNS servers to spread out all the requests. I'm surprised that wasn't an issue for the author?

wumpus3y ago

You have to run your own resolver. Crawling 101.

MonkeyMalarky3y ago

This is of course the correct answer. It just felt like shaving a big yak at the time.

2 more replies

yajjackson3y ago· 2 in thread

Tangential, but I love the format for your site. Any plans to do a "How I built this blog" post?

kerbersos3y ago

Likely using Hugo with the congo theme

SoupyOP3y ago

Yup, nailed it. Hugo with Congo theme (and a few minor layout tweaks). Hosted on cloudflare pages for free

kozziollek3y ago· 2 in thread

Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.

But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!

drdaeman3y ago

Those are called Public Suffixes or effective TLDs (eTLDs): https://en.wikipedia.org/wiki/Public_Suffix_List

And you're entirely correct that author should've referred to such list.

slyall3y ago

I think the problem is that the original source needs to use that list as well. Just looked though .nz and they list several sites ( govt.nz , school.nz, gen.nz ) that don't exist since all the domains are one level below.

They even list.govt.nz as the top site. In fact that doesn't exist (although www.govt.nz does since it is a a kinda government portal )

I see they list an old employer of mine who got bought 15 years ago and whose website has been redirecting for 10 years.

ghostly_s3y ago· 2 in thread

Wow, I would not have suspected `tee` is able to handle multiple processes writing to the same file. Doesn't seem to be mentioned on the man-page, either.

remram3y ago

All tee does is write its standard input (a single file descriptor) to a file (a single one) and its own output. xargs is the thing running multiple processes (and they inherit the same standard output, your shell's).

What you're seeing is Linux being able to handle multiple processes writing to the same file.

ghostly_s3y ago

Well, then that's a Linux feature I was unaware of. I found this SO[1] question with two conflicting answers that have almost the same number of votes, and even the "yes you can do this" answer seems to have enough caveats that it doesn't sound like a great idea.

1. https://stackoverflow.com/questions/7842511/safe-to-have-mul...

pahool3y ago· 2 in thread

zombo.com still kicking!

system23y ago

The png rotates with this:

.rotate {animation: rotation .5s infinite linear;}

I think it wasn't like this before. They must've updated it at one point.

blowski3y ago

Yes, when Flash went end of life, they were forced to adopt a new tech strategy.

gumby3y ago· 2 in thread

His 'www' logic is flawed: https://www.example.com and https://example.com need not return the same results, but his checking code sends the output straight to /dev/null so he has no way of knowing.

cbarrick3y ago

In theory, sure.

In practice, how many orgs serve on both example.com and www.example.com yet operate each as entirely separate sites?

I cannot think of any example.

gumby3y ago

MIT was, for decades, though they seem to have changed.

zinekeller3y ago· 2 in thread

TLDR: Campbell's methodology is flawed, does not consider edge cases (one of which (equating apex-only and www-prefixed domains) I consider reckless), and didn't understand how Majestic collects and processes its data.

Longer version: This isn't comprehensive, but I think of two main reasons why:

- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example https://peering.azurewebsites.net/) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least http://ai./ does exist!)

- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!

Edit: I've downloaded out the CSV to check my claims, and it shows:

  wixsite.com 0
  beian.gov.cn 0

Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?

Also addendum to crawling but I consider "probably forgivable":

- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.

zepearl3y ago

> Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will.

I can confirm stuff like that - I'm writing a crawler&indexer-program (prototype in Python, now writing the final version in Rust) and assuming anything while crawling is NOK. I ended up adding URLs to my "to-index"-list by considering only links explicitly mentioned by other websites (or by pages within the same site).

cratermoon3y ago

It even says right at the top of the Majestic Million site "The million domains we find with the most referring subnets", not implying anything about reachability for http(s) requests.

baby3y ago· 1 in thread

Free.fr, one of the biggest ISP in France a while back, and perhaps still today, still runs all the old-school websites it was hosting for people (for free) today. It's quite insane, but a lot of the French web 1.0 is still alive today thanks to them. Truly an ISP ran by passionate technical people.

ssl2323y ago

Good on them. Last year I randomly discovered an ancient email to my old Hotmail address from free website host Tripod, owned at the time by Lycos, that old search engine. As an 11 year old I had a website with them and wanted to dig it out to see what I had put there. I managed to convince them I was the owner and got my access back, only to discover nothing there. I guess at some point in the ~20 years since I made one they nuked their dormant sites.

macintux3y ago· 1 in thread

Title is misleading: that’s the outcome, but the bulk of the story is the data processing to reach that conclusion.

hinkley3y ago

It happens. Most of the stuff we do these days invokes a number of disciplines. I forget sometimes that maybe ten percent of us just play with random CS domains for “fun” and that most people are coming into big problems blind, even sometimes the explorers (though having comfort with exploring random fields is a skill set unto itself).

Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.

I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…

I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.

allknowingfrog3y ago· 1 in thread

I don't have any particular opinions on the author's conclusions, but I learned a thing or two about the power of terminal commands by reading through the article. I had no idea that xargs had a parallel mode.

thelamest3y ago

Probably not news to anyone who works with big data™, but I learned, after additional searches, that using (something like) duckdb as a CSV parser makes sense, especially if the alternative is loading the entire thing to memory with (something like) base R. This was informative for me: https://hbs-rcs.github.io/large_data_in_R/.

noiv3y ago· 1 in thread

How does a dead site make it into the top million?

dredmorbius3y ago

Typically, during its pre-death phase.

ocdtrekkie3y ago

I've been working on trying to migrate sites I ran in 2008 or so into my new preferred hosting strategy lately: I know zero people look at them, since many were functionally broken at present, but I don't like the idea of actually removing them from the web. So I'm patching them up, migrating them to a more maintainable setting, and keeping them going. Maybe someday some historian will get something out of it.

terrycody3y ago

Nice work.

Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.

For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.

You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.

flas9sd3y ago

having the luxury of scrutinizing the method and retesting: to "normalize" domains and skip the www skewed results - not all websites do their redirects across apex to www (and schemas). Some servers weren't answering the request with the default curl accept header / and needed encouragement.

I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.

banana_giraffe3y ago

The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.

If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%

(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)

zX41ZdbW3y ago

This looks surprisingly similar to the unfinished research that I did: https://github.com/ClickHouse/ClickHouse/issues/18842

nr2x3y ago

Majestic is a shit list. Mystery solved.

indigodaddy3y ago

Are there more cycles/cpu/work involved to `cat verylargefile | awk` vs `awk verylargefile` ?

winddude3y ago

No they're not.

kderbyma3y ago

wouldn't this imply that either the ranking system is broken.....or there are less than 1 million active sites.....

zzzeek3y ago

irony that the site is not responding?

spaceman_20203y ago

Not surprising. We're far away from the glory days of the vibrant, chaotic web.

In countries like India that onboarded most users through smartphones instead of computers, websites are not even necessary. There's a huge dearth of local-focused web content as well since there just isn't enough demand.

1 more reply

j / k navigate · click thread line to collapse

143 comments

103 comments · 33 top-level

gojomo3y ago· 7 in thread

Many issues with this analysis, some others have already mentioned, including:

• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

If you want to probe if domains are still active:

• confirm they're still registered via a `whois`-like lookup

• examine their DNS records for evidence of current services

• ping them, or any DNS-evident subdomains

• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)

• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services

If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.

thematrixturtle3y ago

> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.

gojomo3y ago

> Majestic promotes their list as the "top 1 million websites of the world"

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

zinekeller3y ago

> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.

Semaphor3y ago

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

We use stage.www.domain.tld for the staging/testing site, but that's about it ;)

1 more reply

justsomehnguy3y ago

> Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

https://psnow.ext.hpe.com/doc/c04543743.pdf?jumpid=in_lit-ps...

https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04543743.pdf

https://h22204.www2.hpe.com/NEP

https://h30125.www3.hpe.com/hpcsn/?hpp

https://h41370.www4.hpe.com/quickspecs/overview.html

spc4763y ago

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12

useruserabc3y ago

As near as I can tell, these are the top 1,000,000 domains referred to by other websites they crawled.

The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.

So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.

[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet

bioemerl3y ago· 7 in thread

I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.

mike_hock3y ago

MonkeyMalarky3y ago

How many are placeholder pages thrown up by registrars like Network Solutions?

denton-scratch3y ago

If they're placeholder pages, they're not dead. Those 10% are not responding at all; the requests aren't reaching any HTTP server.

2 more replies

ehsankia3y ago

How is "top" defined here? If they were dead, wouldn't they fairly quickly stop being "top"?

nine_k3y ago

Once a particular SEO trick is understood and "deoptimized" by Google, these "sites" no longer make money, and get abandoned.

Swizec3y ago

Blows my mind that my blog is 210863rd on that list. That makes the web feel somehow smaller than I thought it was.

wincent3y ago

Eyeing you jealously from my position at 237,014 on the list... We're almost neighbors, I guess.

smugma3y ago· 6 in thread

I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.

It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.

code1234567893y ago

wixsite.com is a domain for free sites built on Wix, so if your username on Wix is smugma, and your site name is mysite, then you'll have a URL like smugma.wixsite.com/mysite for your Home page.

That's why this domain is in the top

smugma3y ago

Correct, that's why it's in the top. Your example further confirms why the author's methodology is broken.

zinekeller3y ago

> other sites I checked such as wixstatic.com don't appear to be valid

But docs.wixstatic.com is valid.

winddude3y ago

100% agree his methodology is broken. Another example like this is googleapis.com. If I remember correctly there a quite a number of domains like this in magestic million.

Not to mention a number of his requests may have been blocked.

quickthrower23y ago

He takes this into account by generously considering any returned response code as “not dead”.

> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid

mort963y ago

1 more reply

gravitate3y ago· 6 in thread

> Domain normalization is a bitch

Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...

noizejoy3y ago

> I’m a no-www advocate.

I used to feel the same way. — Until the arrival of so many new TLDs.

Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band

gravitate3y ago

I hear you. But a redirect is a good solution in that case.

1 more reply

macintux3y ago

25 years ago I added a rule to my employer’s firewall to allow the bare domain to work on our web server.

Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.

A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.

Been much too long to have any clue now what we did wrong.

JackMcMack3y ago

You probably created a cname from the apex to www? This problem still exists today.

So if you're looking up the MX record for domain, but happen to find a cname for domain to www.domain , it will follow that and won't find any MX records for www.domain.

The correct approach is to create a cname record from www.domain to domain, and have the A record (and MX and other records) on the apex.

1 more reply

agraddy3y ago

account423y ago

1 more reply

the_biot3y ago· 5 in thread

By what possible criteria are these the "top" million sites, if 10% are dead? I'd start with questioning that data.

kjeetgill3y ago

Dude, it's the second sentence of the first paragraph:

MatthiasPortzel3y ago

And moreover, the author’s conclusion is that the dataset is bad.

> While I had expected some cleanliness issues, I wasn’t expecting to see this level of quality problems from a dataset that I’ve seen referenced pretty extensively across the web

the_biot3y ago

Yeah, but they're still providing a dataset that's just plain bad. It's hardly relevant how many sites link to some other site, if it's dead.

1 more reply

winddude3y ago

part of the problem is it's not the number of links, it's referring subnets. Fairly certain this includes, script tags.

deltree73y ago

Exactly!

Garbage In == Garbage Out

softwaredoug3y ago· 5 in thread

My current beliefs about how people use and trust information on the Web.

First, trust is _everything_ on the Web, it is the thing people first think of when arriving on some information. But how people evaluate trust has changed dramatically over the last 10 years.

- Of course a lot of this trust is misplaced. There's a very thin line between influencers and cult leaders / snake oil salesmen. Our last President used this hack really effectively.

This of course means the fracturing of truth away from elites. Sometimes this could be a good thing, but in many cases cough Covid cough it might be pretty disastrous for misinformation

mountainriver3y ago

> How many of us now google `<topic> reddit` instead of just `<topic>`

I sure hope not, Reddit is horrible place for information

romanhn3y ago

When I have a specific technical question, I append "stackoverflow" to my search queries. When I want to read a discussion, I add "reddit" (or "hacker news").

failTide3y ago

But in general I agree with you - reddit is full of misinformation, propaganda and astroturfing

wolverine8763y ago

> How many of us now google `<topic> reddit` instead of just `<topic>`?

Why would people in this community, who are aware of these issues, trust social media? How is that working out?

> This of course means the fracturing of truth away from elites. Sometimes this could be a good thing

I think that's mis/disinformation. 'Elite' is a loaded, negative (in this context) word. It makes the question about power and the conclusion inevitable.

In matters of science the authority of thousands is not worth the humble reasoning of one single person.

Brian_K_White3y ago

phkahler3y ago· 4 in thread

Read that again folks:

jayd163y ago

The flip side is anyone can run these kinds of tools against your site easily and cheaply.

stevemk14ebr3y ago

phkahler3y ago

Oh absolutely, but he also said this:

I found that my local system could easily handle 512 parallel processes, with my CPU @ ~35% utilization, 2GB of RAM usage, and a constant 1.5MB down on the network.

Because of this I think an alternative to github would be open source projects hosted on peoples home machines. CI/CD might require distributing work to those with the right hardware variants though.

cratermoon3y ago

> you don't need to rent or provision "big infrastructure" unless you're actually quite big.

superb-owl3y ago· 4 in thread

One of the few things I like about blockchain is the promise of a less ephemeral web.

bergenty3y ago

superb-owl3y ago

I'm thinking about things like Filecoin, a blockchain which is meant to power IPFS. To be fair though, IPFS itself is not a blockchain

matkoniecz3y ago

One of many things I dislike about cryptoscams is making promises which are lies.

deltree73y ago

spoken like someone who is clueless about Blockchain

altdataseller3y ago· 3 in thread

So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.

giantrobot3y ago

> where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

That's literally the phenomenon the article is describing.

altdataseller3y ago

1 more reply

TuringNYC3y ago

mouzogu3y ago· 3 in thread

whenever i go through my bookmarks, i tend to find maybe 5-10% are now 404.

this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.

syedkarim3y ago

What’s the benefit to using archive.ph instead of archive.org (Internet Archive)? Seems like the latter is much more likely to be around for awhile.

tropicalfruit3y ago

i find archive.ph does a better job of preserving the page as is (it also takes a screenshot) compared to internet archive which can be flaky at best.

i also find archive.ph much faster at searching, and the browser extension is really useful too.

the faq does a great job of explaining too https://archive.ph/faq

2 more replies

system23y ago

archive.ph = Russian federation website. Blocked by most firewalls by default.

tete3y ago· 2 in thread

If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.

Another worry is the ever-increasing size of these scripts making archiving more expensive.

Kye3y ago

nikisweeting3y ago

ArchiveWeb.page + ReplayWeb.page are the best I've found at handling ajax loaded content.

MonkeyMalarky3y ago· 2 in thread

wumpus3y ago

You have to run your own resolver. Crawling 101.

MonkeyMalarky3y ago

This is of course the correct answer. It just felt like shaving a big yak at the time.

2 more replies

yajjackson3y ago· 2 in thread

Tangential, but I love the format for your site. Any plans to do a "How I built this blog" post?

kerbersos3y ago

Likely using Hugo with the congo theme

SoupyOP3y ago

Yup, nailed it. Hugo with Congo theme (and a few minor layout tweaks). Hosted on cloudflare pages for free

kozziollek3y ago· 2 in thread

Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.

But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!

drdaeman3y ago

Those are called Public Suffixes or effective TLDs (eTLDs): https://en.wikipedia.org/wiki/Public_Suffix_List

And you're entirely correct that author should've referred to such list.

slyall3y ago

They even list.govt.nz as the top site. In fact that doesn't exist (although www.govt.nz does since it is a a kinda government portal )

I see they list an old employer of mine who got bought 15 years ago and whose website has been redirecting for 10 years.

ghostly_s3y ago· 2 in thread

Wow, I would not have suspected `tee` is able to handle multiple processes writing to the same file. Doesn't seem to be mentioned on the man-page, either.

remram3y ago

What you're seeing is Linux being able to handle multiple processes writing to the same file.

ghostly_s3y ago

1. https://stackoverflow.com/questions/7842511/safe-to-have-mul...

pahool3y ago· 2 in thread

zombo.com still kicking!

system23y ago

The png rotates with this:

.rotate {animation: rotation .5s infinite linear;}

I think it wasn't like this before. They must've updated it at one point.

blowski3y ago

Yes, when Flash went end of life, they were forced to adopt a new tech strategy.

gumby3y ago· 2 in thread

His 'www' logic is flawed: https://www.example.com and https://example.com need not return the same results, but his checking code sends the output straight to /dev/null so he has no way of knowing.

cbarrick3y ago

In theory, sure.

In practice, how many orgs serve on both example.com and www.example.com yet operate each as entirely separate sites?

I cannot think of any example.

gumby3y ago

MIT was, for decades, though they seem to have changed.

zinekeller3y ago· 2 in thread

Longer version: This isn't comprehensive, but I think of two main reasons why:

Edit: I've downloaded out the CSV to check my claims, and it shows:

  wixsite.com 0
  beian.gov.cn 0

Also addendum to crawling but I consider "probably forgivable":

- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.

zepearl3y ago

cratermoon3y ago

It even says right at the top of the Majestic Million site "The million domains we find with the most referring subnets", not implying anything about reachability for http(s) requests.

baby3y ago· 1 in thread

ssl2323y ago

macintux3y ago· 1 in thread

Title is misleading: that’s the outcome, but the bulk of the story is the data processing to reach that conclusion.

hinkley3y ago

I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…

allknowingfrog3y ago· 1 in thread

thelamest3y ago

noiv3y ago· 1 in thread

How does a dead site make it into the top million?

dredmorbius3y ago

Typically, during its pre-death phase.

ocdtrekkie3y ago

terrycody3y ago

Nice work.

You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.

flas9sd3y ago

banana_giraffe3y ago

The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.

If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%

(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)

zX41ZdbW3y ago

This looks surprisingly similar to the unfinished research that I did: https://github.com/ClickHouse/ClickHouse/issues/18842

nr2x3y ago

Majestic is a shit list. Mystery solved.

indigodaddy3y ago

Are there more cycles/cpu/work involved to `cat verylargefile | awk` vs `awk verylargefile` ?

winddude3y ago

No they're not.

kderbyma3y ago

wouldn't this imply that either the ranking system is broken.....or there are less than 1 million active sites.....

zzzeek3y ago

irony that the site is not responding?

spaceman_20203y ago

Not surprising. We're far away from the glory days of the vibrant, chaotic web.

1 more reply

j / k navigate · click thread line to collapse