• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP
• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)
• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.
• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.
• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.
If you want to probe if domains are still active:
• confirm they're still registered via a `whois`-like lookup
• examine their DNS records for evidence of current services
• ping them, or any DNS-evident subdomains
• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)
• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services
If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.
Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.
> `subdomain.www.domain`
Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.
> Many sites may block `curl` requests because they only want attended human browser traffic,
Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.
And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.
Well, the source URL provided by the article author initially claims, “The million domains we find with the most referring subnets”. Then it makes a contradictory comment mentioning ‘websites’. At best we can say Majestic is vague and/or confused about what they’re providing – but given the author’s results, I suspect this list contains domains with no guarantee Majestic ever saw a live HTTP service on these domains.
> Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.
How about I cite HN user ~gojomo, who for nearly a decade wrote & managed web crawling software for the Internet Archive. He says: “Sites that don’t want to be crawled use every tactic you can imagine to repel unwanted crawlers, including unceremoniously instant-dropping open connections from disfavored IPs and User-Agents. Sadly, given Google’s dominance, many give a free pass to only Google IPs & User-Agents, and maybe a few other search-engines.”
Most major search engine has dedicated blocks of addresses and uses unique user-agents. If you just literally sent wget or curl requests, you will be identified as a "bad" crawler almost immediately.
We use stage.www.domain.tld for the staging/testing site, but that's about it ;)
https://psnow.ext.hpe.com/doc/c04543743.pdf?jumpid=in_lit-ps...
https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04543743.pdf
https://h22204.www2.hpe.com/NEP
So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?
The report is described as "The million domains we find with the most referring subnets"[1] and a referring subnet is a host with a webpage which points at the domain.
So to the grandparent, presumably if something is "linking" to these domains, they probably were meant to be websites.
[1] https://majestic.com/reports/majestic-million [2] https://majestic.com/help/glossary#RefSubnets, https://majestic.com/help/glossary#RefIPs and also https://majestic.com/help/glossary#Csubnet
EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.
Once a particular SEO trick is understood and "deoptimized" by Google, these "sites" no longer make money, and get abandoned.
It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.
It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.
That's why this domain is in the top
But docs.wixstatic.com is valid.
Not to mention a number of his requests may have been blocked.
> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid
I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.
Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...
I used to feel the same way. — Until the arrival of so many new TLDs.
Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band
Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.
A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.
Been much too long to have any clue now what we did wrong.
From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME record is present at a node, no other data should be present; this ensures that the data for a canonical name and its aliases cannot be different."
So if you're looking up the MX record for domain, but happen to find a cname for domain to www.domain , it will follow that and won't find any MX records for www.domain.
The correct approach is to create a cname record from www.domain to domain, and have the A record (and MX and other records) on the apex.
Most DNS providers have a proprietary workaround to create dns-redirects on the apex (such as AWS Route53 Alias records) and serve them as A records, but those rarely play nice with external resources.
> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).
> While I had expected some cleanliness issues, I wasn’t expecting to see this level of quality problems from a dataset that I’ve seen referenced pretty extensively across the web
Garbage In == Garbage Out
First, trust is _everything_ on the Web, it is the thing people first think of when arriving on some information. But how people evaluate trust has changed dramatically over the last 10 years.
- Trust now comes almost exclusively from social proof. Searching reddit, youtube, etc and other extremely _moderated_ sources of information, where the most work is done to ensure content comes from actual human beings. How many of us now google `<topic> reddit` instead of just `<topic>`?
- Of course a lot of this trust is misplaced. There's a very thin line between influencers and cult leaders / snake oil salesmen. Our last President used this hack really effectively.
- Few trust Google's definition of trust anymore -- essentially page rank. This made more sense when the Web essentially was social, where inbound links were very organic. Now with the trust in general Web sites evaporated, the main 'inbound links' anyone cares about come from individuals or community they trust or identify with. They don't trust Googles algorithm (its too opaque, and too easily gamed).
This of course means the fracturing of truth away from elites. Sometimes this could be a good thing, but in many cases cough Covid cough it might be pretty disastrous for misinformation
I sure hope not, Reddit is horrible place for information
But in general I agree with you - reddit is full of misinformation, propaganda and astroturfing
One of us lives in a bubble. I don't trust Reddit for anything, or YouTube or any social media. IME, it's mis/disinformation - not only a lack of information, but a negative; it leaves me believing something false. My experience is, and plenty of research shows, that we have no way to sort truth from fiction without prior expertise in the domain. The misinformation and disinformation on social media, and its persuasiveness, is very well known. The results are evident before us, in the madness and disasters, in dead people, in threats to freedom, prosperity, and stability.
Why would people in this community, who are aware of these issues, trust social media? How is that working out?
> This of course means the fracturing of truth away from elites. Sometimes this could be a good thing
I think that's mis/disinformation. 'Elite' is a loaded, negative (in this context) word. It makes the question about power and the conclusion inevitable.
Making it about power distracts from the core issue of knowledge, which is truth. I want to hear from the one person, or one of the few people, with real knowledge about a topic; I don't want to hear from others.
In matters of science the authority of thousands is not worth the humble reasoning of one single person.
"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "
This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.
I found that my local system could easily handle 512 parallel processes, with my CPU @ ~35% utilization, 2GB of RAM usage, and a constant 1.5MB down on the network.
Another thing that happened in the early web days was Apache. People needed a web server and it did the job correctly. Nobody ever really noticed that it had terrible performance, so early on infrastructure went to multiple servers and load balancers and all that jazz. Now with nginx, fast multi-core, and speedy networks even at home, it's possible to run sites with a hundred thousand users a day at home on a laptop. Not that you'd really want to do exactly that but it could be done.
Because of this I think an alternative to github would be open source projects hosted on peoples home machines. CI/CD might require distributing work to those with the right hardware variants though.
Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.
So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.
That's literally the phenomenon the article is describing.
this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.
i also find archive.ph much faster at searching, and the browser extension is really useful too.
the faq does a great job of explaining too https://archive.ph/faq
If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.
I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.
Another worry is the ever-increasing size of these scripts making archiving more expensive.
But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!
And you're entirely correct that author should've referred to such list.
They even list.govt.nz as the top site. In fact that doesn't exist (although www.govt.nz does since it is a a kinda government portal )
I see they list an old employer of mine who got bought 15 years ago and whose website has been redirecting for 10 years.
What you're seeing is Linux being able to handle multiple processes writing to the same file.
1. https://stackoverflow.com/questions/7842511/safe-to-have-mul...
Longer version: This isn't comprehensive, but I think of two main reasons why:
- The Majestic Million lists only the registrable part (with some exceptions), and this sometimes lead to central CDNs being listed. For example, the Majestic Million lists wixsite.com (for those who are unaware is a CDN domain used by Wix.com with separate subdomains), but if you visit wixsite.com you wouldn't get anything. Same with Azure, subdomains of azureedge.net and azurewebsites.net do exist (for example https://peering.azurewebsites.net/) but azureedge.net and azurewebsites.net themselves don't exist. Without similar filtering, using the Cisco list (https://s3-us-west-1.amazonaws.com/umbrella-static/index.htm...) would quickly lead you to this precise problem (mainly because the number one is "com", but phew at least http://ai./ does exist!)
- Also, shame on the author considering www-prefixed and apex-only as one and the same. For some websites, it isn't. Take this example: jma.go.jp (Japan Meteorological Agency), which doesn't respond (actually NODATA) on http://jma.go.jp/ but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP Licence Administrator) wouldn't respond at all but www.beian.gov.cn will. And for ncbi.nlm.nih.gov (National Center for Biotechnology Information) ? I can't blame Majestic: https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't redirect to a canonical domain, and unless you've compared the HTTP pages there's no way you would know that they are the same website!
Edit: I've downloaded out the CSV to check my claims, and it shows:
wixsite.com 0
beian.gov.cn 0
Please, for the love of sanity, consider what the Majestic Million (and similar lists) criterion on inclusion. I can't believe it to say, but can we crowd-source "Falsehoods programmers believe about domains"?Also addendum to crawling but I consider "probably forgivable":
- Some websites are only available in certain countries (internal Russian websites don't respond at all outside Russia for example). This can skew the numbers a little bit.
I can confirm stuff like that - I'm writing a crawler&indexer-program (prototype in Python, now writing the final version in Rust) and assuming anything while crawling is NOK. I ended up adding URLs to my "to-index"-list by considering only links explicitly mentioned by other websites (or by pages within the same site).
Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.
I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…
I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.
Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.
For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.
You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.
I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.
If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%
(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)
In countries like India that onboarded most users through smartphones instead of computers, websites are not even necessary. There's a huge dearth of local-focused web content as well since there just isn't enough demand.