Compiler Explorer and the promise of URLs that last forever (opens in new tab)

davidcollantes1y ago

How do you manage those? Do you have a way to search them, or a specific way to catalogue them, which will make it easy to find exactly what you need from them?

snthpy1y ago

Thanks. I didn't know about this and it looks great.

A couple of questions:

- do you store them compressed or plain?

- what about private info like bank accounts or health issuance?

I guess for privacy one could train oneself to use private browsing mode.

Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?

shwouchk1y ago

i was considering a similar setup, but i don’t really trust extensions. Im curious;

- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?

did you try integrating it with llms/rag etc yet?

nyarlathotep_1y ago

Are you automating this in some fashion? Is there another extension you've authored or similar to invoke SingleFile functionality on a new page load or similar?

dataflow1y ago

Have you tried MHTML?

90s_dev1y ago

You must have several TB of the internet on disk by now...

flexagoon1y ago

By the way, if you install the official Web Archive browser extension, you can configure it to automatically archive every page you visit

petethomas1y ago

This a good suggestion with the caveat that entire domains can and do disappear: https://help.archive.org/help/how-do-i-request-to-remove-som...

internetter1y ago

recently I've come to believe even IA and especially archive.is are ephermal. I've watched sites I've saved disappear without a trace, except in my selfhosted archives.

A technological conundrum, however, is the fact that I have no way to prove that my archive is an accurate representation of a site at a point in time. Hmmm, or maybe I do? Maybe something funky with cert chains could be done.

vitorsr1y ago

> you can configure it to automatically archive every page you visit

What?? I am a heavy user of the Internet Archive services, not just the Wayback Machine, including official and "unofficial" clients and endpoints, and I had absolutely no idea the extension could do this.

To bulk archive I would manually do it via the web interface or batch automate it. The limitations of manually doing it one by one are obvious, and the limitations of doing it in batches requires, well, keeping batches (lists).

90s_dev1y ago

My solution has been to just remember the important stuff, or at least where to find it. I'm not dead yet so I guess it works.

TeMPOraL1y ago

It was my solution too, and I liked it, but over the past decade or so, I noticed that even when I remember where to find some stuff, hell, even if I just remember how to find it, when I actually try and find it, it often isn't there anymore. "Search rot" is just as big a problem as link rot.

As for being still alive, by that measure hardly anything anyone does is important in the modern world. It's pretty hard to fail at thinking or remembering so badly that it becomes a life-or-death thing.

mock-possum1y ago

I’ve found that whenever I think “why don’t other people just do X” it’s because I’m misunderstanding what’s involved in X for them, and that generally if they could ‘just’ do X then they would.

“Why don’t you just” is a red flag now for me.

mycall1y ago

Is there some browser extension that automatically goes to web.archive.org if the link timesout?

theblazehen1y ago

I use the Resurrect Pages addon

account421y ago

I really is a travesty that Browsers still haven't updated their bookmark feature based on this realization - all bookmarks should store not only the link but a full copy of the rendered page (not just the source which could rely on dynamic content that will no longer be available).

Also, open tabs should work the same way: I never want to see a network error while going back to a tab while not having an internet connection because the browser has helpfully evicted that tab from memory. It should just reload the state from disk instead of the network in this case until I manually refresh the page.

blurbleblurble1y ago

Use WARC: https://en.wikipedia.org/wiki/WARC_(file_format) with WebRecorder: https://webrecorder.net/

shwouchk1y ago

warc is not a panacea; for example, gemini makes it super annoying to get a transcript of your conversation, so i started saving those as pdf and warc.

turns out that unlike most webpages, the pdf version is only a single page of what is visible on screen.

turns out also that opening the warc immediately triggers a js redirect that is planted in the page. i can still extract the text manually - it’s embedded there - but i cannot “just open” the warc in my browser and expect an offline “archive” version - im interacting with a live webpage! this sucks from all sides - usability, privacy, security.

Admittedly, i don’t use webrecorder - does it solve this problem? did you verify?

andai1y ago

Is there some kind of thing that turns a web page into a text file? I know you can do it with beautiful soup (or like 4 lines of python stdlib), but I usually need it on my phone, where I don't know a good option.

My phone browser has a "reader view" popup but it only appears sometimes, and usually not on pages that need it!

Edit: Just installed w3m in Termux... the things we can do nowadays!

XorNot1y ago

You want Zotero.

It's for bibliographies, but it also archives and stores web pages locally with a browser integration.

https://www.w3.org/Provider/Style/URI

m-p-31y ago

I export text-based content I want to retain into Markdown files, and when I find something useful for work I also send the URL to the Wayback Machine.

nonethewiser1y ago

A reference is a bet on continuity.

At a fundamental level, broken website links and dangling pointers in C are the same.

jwe1y ago

I can recommend to use Pinboard with the archive option

taeric1y ago

That assumption isn't true of any sources? Things flat out change. Some literally, others more in meaning. Some because they are corrected, but there are other reasons.

Not that I don't think there is some benefit in what you are attempting, of course. A similar thing I still wish I could do is to "archive" someone's phone number from my contact list. Be it a number that used to be ours, or family/friends that have passed.

rubit_xxx161y ago

> Before 2010 I had this unquestioned assumption that links are supposed to last forever

Any site/company whatsoever of this world (and most) that promises that anything will last forever is seriously deluded or intentionally lying, unless their theory of time is different than that of the majority.

90s_dev1y ago· 10 in thread

Some famous programmer once wrote about how links should last forever.

He advocated for /foo/bar with no extension. He was right about not using /foo/bar.php because the implementation might change.

But he was wrong, it should be /foo/bar.html because the end-result will always be HTML when it's served by a browser, whether it's generated by PHP, Node.js or by hand.

It's pointless to prepare for some hypothetical new browser that uses an alternate language other than HTML and that doesn't use HTML.

Just use .html for your pages and stop worrying about how to correctly convert foo.md to foo/index.html and configure nginx accordingly.

Sesse__1y ago

> Some famous programmer once wrote about how links should last forever.

You're probably thinking of W3C's guidance: https://www.w3.org/Provider/Style/URI

> But he was wrong, it should be /foo/bar.html because the end-result will always be HTML

20 years ago, it wasn't obvious at all that the end-result would always be HTML (in particular, various styled forms of XML was thought to eventually take over). And in any case, there's no reason to have the content-type in the URL; why would the user care about that?

90s_dev1y ago

There's strong precedence for associating file extensions with content types. And it allows static files to map 1:1 to URLs.

I agree though that I was too harsh, I didn't realize it was written in 1998 when HTML was still new. I probably first read it around 2010.

But now that we have hindsight, I think it's safe to say .html files will continue to be supported for the next 50 years.

esafak1y ago

If it's always .html, it's cruft; get rid of it. And what if it's not HTML but JSON? Besides, does the user care? Berners-Lee was right.

90s_dev1y ago

If it's JSON then name it /foo/bar.json, and as a bonus you can also have /foo/bar.html!

You say the extension is cruft. That's your opinion. I don't share it.

https://github.com/sdegutis/bubbles

90s_dev1y ago

Found it: https://www.w3.org/Provider/Style/URI

Why did I think Joel Spolsky or Jeff Atwood wrote it?

crackalamoo1y ago

I use /foo/bar/ with the trailing slash because it works better with relative URLs for resources like images. I could also use /foo/bar/index.html but I find the former to be cleaner

90s_dev1y ago

It's always bothered me in a small way that github doesn't honor this:

https://github.com/sdegutis/bubbles/

No redirect, just two renders!

It bothers me first because it's semantically different.

Second and more importnatly, because it's always such a pain to configure that redirect in nginx or whatever. I eventually figure it out each time, after many hours wasted looking it up all over again and trial/error.

Dwedit1y ago

mod_rewrite means you can redirect the .php page to something else if you stop using php.

shakna1y ago

Unless mod_rewrite is disabled, because it has had a few security bugs over the years. Like last year. [0]

[0] https://nvd.nist.gov/vuln/detail/CVE-2024-38475

account421y ago

It also means you can internally redirect the extension-less version to .php in the first place so you never have to change your public URL in the future.

diggan1y ago· 9 in thread

URLs (uniform resource locator) cannot ever last forever, as it's a location and locations can't last forever :)

URIs however, can be made to last forever! Also comes with the added benefit that if you somehow integrate content-addressing into the identifier, you'll also be able to safely fetch it from any computer, hostile or not.

90s_dev1y ago

I've been making websites for almost 30 years now.

I still don't know the difference between URI and URL.

I'm starting to think it doesn't matter.

Sesse__1y ago

It doesn't matter.

URI is basically a format and nothing else. (foo://bar123 would be a URI but not a URL because nothing defines what foo: is.)

URLs and URNs are thingies using the URI format; https://news.ycombinator.com is a URL (in addition to being a URI) because there's an RFC that specifies that https: means and how to go out and fetch them.

urn:isbn:0451450523 (example cribbed from Wikipedia) is an URN (in addition to being an URI) that uniquely identifies a book, but doesn't tell you how to go find that book.

Mostly, the difference is pedantic, given that URNs never took off.

layer81y ago

URLs in the strict sense are a subset of URIs. They specify a mechanism (like HTTP or FTP) for how to access the referenced resource. The other type of URIs are opaque IDs, like doi:10.1000/182 or urn:isbn:9780141036144. These technically can’t expire, though that doesn’t mean you’ll be able to access what they reference.

However, “URL” in the broader sense is used as an umbrella term for URIs and IRIs (internationalized resource identifiers), in particular by WHATWG.

In practice, what matters is the specific URI scheme (“http”, “doi”, etc.).

immibis1y ago

A URL tells you where to get some data, like https://example.com/index.html

A URN tells you which data to get (usually by hash or by some big centralized registry), but not how to get it. DOIs in academia, for example, or RFC numbers. Magnet links are borderline.

URIs are either URLs or URNs. URNs are rarely used since they're less practical since browsers can't open them - but note that in any case each URL scheme (https) or URN scheme (doi) is unique - there's no universal way to fetch one without specific handling for each supported scheme. So it's not actually that unusual for a browser not to be able to open a certain scheme.

diggan1y ago

> I still don't know the difference between URI and URL.

One is a location, the other one is a ID. Which is which is referenced in the name :)

And sure, it doesn't matter as long as you're fine with referencing locations rather than the actual data, and aware of the tradeoffs.

marcosdumay1y ago

An URI is an standard way to write names of documents.

And URL is an URI that also tells you how to find the document.

postoplust1y ago

For example: IPFS URI's are content addresses

https://docs.ipfs.tech/

bowsamic1y ago

Does this have any actual grounding in reality or does your lack of suggestion for action confirm my suspicion that this is just a theoretical wish?

diggan1y ago

> Does this have any actual grounding in reality

Depends on your use case I suppose. For things I want to ensure I can reference forever (theoretical forever), then using location for that reference feels less than ideal, I cannot even count the number of dead bookmarks on both hands and feet, so "link rot" is a real issue.

If those bookmarks instead referenced the actual content (via content-addressing for example), rather than the location, then those would still work today.

But again, not everyone cares about things sticking around, not all use cases require the reference to continue being alive, and so on, so if it's applicable to you or not is something only you can decide.

olalonde1y ago· 6 in thread

> This article was written by a human, but links were suggested by and grammar checked by an LLM.

This is the second time today I've seen a disclaimer like this. Looks like we're witnessing the start of a new trend.

tester7561y ago

It's crazy that people feel that they need to put such disclaimers

actuallyalys1y ago

It makes sense to me. After seeing a bunch of AI slop, people started putting no AI buttons and disclaimers. Then some people using AI for little things wanted to clarify it wasn’t AI generated wholesale without falsely claiming AI wasn’t involved at all.

layer81y ago

It’s more a claimer than a disclaimer. ;)

psychoslave1y ago

This comment was written by a human with no check by any automaton, but how will you check that?

chii1y ago

i dont find the need to have such a disclaimer at all.

If the content can stand on its own, then it is sufficient. If the content is slop, then why does it matter that it is an ai generated slop vs human generated slop?

The only reason anyone wants to know/have the disclaimer is if they cannot themselves discern the quality of the contents, and is using ai generation as a proxy for (bad) quality.

For the author it matters. To which degree do they want to be associated with the resulting text.

And I differentiate between "Matt Godbolt" who is an expert in some areas and in my experience careful about avoiding wrong information and an LLM which may produce additional depth, but may also make up things.

And well, "discern the quality of the contents" - I often read texts to learn new things. On new things I don't have enough knowledge to qualify the statements, but I may have experience with regards to the author or publisher.

https://en.m.wikipedia.org/wiki/Uniform_Resource_Name

s17n1y ago· 5 in thread

URLs lasting forever was a beautiful dream but in reality, it seems that 99% of URLs don't in fact last forever. Rather than endlessly fighting a losing battle, maybe we should build the technology around the assumption that infrastructure isn't permanent?

nonethewiser1y ago

>maybe we should build the technology around the assumption that infrastructure isn't permanent?

Yes. Also not using a url shortener as infrastructure.

dreamcompiler1y ago

URNs were supposed to solve that problem by separating the identity of the thing from the location of the thing.

But they never became popular and then link shorteners reimplemented the idea, badly.

hoppp1y ago

Yes.

domain names often exchange hands and a URL that is supposed to last forever can turn into malicious phishing link over time.

emaro1y ago

In theory a content-addressed system like IPFS would be the best: if someone online still has a copy, you can get it too.

jjmarr1y ago

URL identify the location of a resource on a network, not the resource itself, and so are not required to be permanent or unique. That's why they're called "uniform resource locators".

This problem was recognized in 1997 and is why the Digital Object Identifier was invented.

jimmyl021y ago· 5 in thread

This is great perspective about how assumptions play out over longer period of time. I think that this risk is much greater for free third party services for critical infrastructure.

Someone has to foot the bill somewhere and if there isn't a source of income then the project is bound to be unsupported eventually.

tekacs1y ago

I think I would struggle to say that free services die at a higher rate consistently…

So many paid offerings, whether from startups or even from large companies, have been sunset over time, often with frustratingly short migration periods.

If anything, I feel like I can think of more paid services that have given their users short migration periods than free ones.

cortesoft1y ago

Nah, businesses go under all the time, whether their services are paid or not.

lqstuart1y ago

Counterexample: the Linux kernel

charcircuit1y ago

How? Big tech foots the bill.

iainmerrick1y ago

Linux isn't a service (in the SaaS sense).

creatonez1y ago· 4 in thread

There's something poetic about abusing a link shortener as a database and then later having to retrieve all your precious links from random corners of the internet because you've lost the original reference.

rs1861y ago

Shortening long URLs is the intended use case for a ... URL shortener.

The real abusers are the people who use a shortener to hide scam/spam/illegal websites behind a common domain and post it everywhere.

creatonez1y ago

These are not just "long URLs". These are URLs where the entire content is stored in the fragment suffix of the URL. They are blobs, and always have been.

nonethewiser1y ago

Didnt they just use the link shortener to compress the url? They used their url as the "database" (ie holding the compiler state).

Arcuru1y ago

They didn't store anything themselves since they encoded the full state in the urls that were given out. So the link shortener was the only place where the "database", the urls, were being stored.

layer81y ago· 4 in thread

I find it somewhat surprising that it’s worth the effort for Google to shut down the read-only version. Unless they fear some legal risks of leaving redirects to private links online.

actuallyalys1y ago

Hard to say from the outside, but it’s possible the service relies on some outdated or insecure library, runtime, service, etc. they want to stop running. Although frankly it seems just as possible it’s a trivial expense and they’re cutting it because it’s still a net expense, goodwill and past promises be dammed.

Scaevolus1y ago

Typically services like these are side projects of just a few Google employees, and when the last one leaves they are shut down.

mbac327681y ago

yeah but nobody wants to put "spent two months migrating goo.gl url shortener to work with Sisyphus release manager and Dante 7 SRE monitoring" in their perf packet

that's a negative credit activity

mmooss1y ago

Another possibility is that it's a distraction - whatever the marginal costs, there's a fixed cost to each system in terms of cognitive overhead, if not documentation, legal issues (which can change as laws and regulations change), etc. Removing distractions is basic management.

swyx1y ago· 4 in thread

idk man how can URLs last forever if it costs money to keep a domain name alive?

i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

Historians however would love to have more garbage from history, to get more insights on "real" life rather than just the parts one considered worth keeping.

If I could time jump it would be interesting to see how historians inna thousand years will look back at our period where a lot of information will just disappear without traces as digital media rots.

mrguyorama1y ago

I regularly wonder if modern educated people do not journal as much as previous century educated people who were kind of rare.

Maybe we should get a journaling boom going.

But it has to be written, because pen and paper is literally ten times more durable than even good digital storage.

[1] https://wiki.archiveteam.org/index.php/Goo.gl

swyx1y ago

we'd keep the curiosities around, like so much Ea Nasir Sells Shit Copper. we have room for like 5-10 of those per century. not like 8 billion. much of life is mundane.

5 more replies

internetter1y ago

> i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

agreed. formerly wrote some thoughts here: https://boehs.org/node/internet-evanescence

mananaysiempre1y ago· 3 in thread

May be worth cooperating with ArchiveTeam’s project[1] on Goo.gl?

> url shortening was a fucking awful idea[2]

[2] https://wiki.archiveteam.org/index.php/URLTeam

MallocVoidstar1y ago

IIRC ArchiveTeam were bruteforcing Goo.gl short URLs, not going through 'known' links, so I'd assume they have many/all of Compiler Explorer's URLs. (So, good idea to contact them)

tech234a1y ago

Real-time status for that project indicates 7.5 billion goo.gl URLs found out of 42 billion goo.gl URLs scanned: https://tracker.archiveteam.org:1338/status

mattgodbolt1y ago

Thanks! Someone posted on GitHub about that and I'll be looking at that tomorrow!

wrs1y ago· 3 in thread

I hate to say it, but unless there’s a really well-funded foundation involved, Compiler Explorer and godbolt.org won’t last forever either. (Maybe by then all the info will have been distilled into the 487 quadrillion parameter model of everything…)

mattgodbolt1y ago

We've done alright so far: 13 years this week. I have funding for another year and change even assuming growth and all our current sponsors pull out.

I /am/ thinking about a foundation or similar though: the single point of failure is not funding but "me".

badmintonbaseba1y ago

Well, that's true, but at least now compiler explorer links will stop working when compiler explorer vanishes, but not before that.

I think the most valuable long-living compiler explorer links are in bug reports. I like to link to compiler explorer in bug reports for convenience, but I also include the code in the report itself, and specify what compiler I used with what version to reproduce the bug. I don't expect compiler explorer to vanish anytime soon, but making bug reports self-contained like this protects against that.

layer81y ago

Thanks to the no-hiding theorem, the information will live forever. ;)

amiga3861y ago· 2 in thread

https://killedbygoogle.com/

> Google Go Links (2010–2021)

> Killed about 4 years ago, (also known as Google Short Links) was a URL shortening service. It also supported custom domain for customers of Google Workspace (formerly G Suite (formerly Google Apps)). It was about 11 years old.

zerocrates1y ago

"Killing" the service in the sense of minting new ones is no big deal and hardly merits mention.

Killing the existing ones is much more of a jerk move. Particularly so since Google is still keeping it around in some form for internal use by their own apps.

ruune1y ago

Don't they use https://g.co now? Or are there still new internal goo.gl links created?

Edit: Google is using a g.co link on the "Your device is booting another OS" screen that appears when booting up my Pixel running GrapheneOS. Will be awkward when they kill that service and the hard coded link in the phones bios is just dead

sebstefan1y ago· 2 in thread

>Over the last few days, I’ve been scraping everywhere I can think of, collating the links I can find out in the wild, and compiling my own database of links1 – and importantly, the URLs they redirect to. So far, I’ve found 12,000 links from scraping:

>Google (using their web search API)

>GitHub (using their API)

>Our own (somewhat limited) web logs

>The archive.org Stack Overflow data dumps

>Archive.org’s own list of archived webpages

You're an angel Matt

mattgodbolt1y ago

Thanks! It's been a fun learning experience. I just found out the internet archive has a much more comprehensive effort going so it might have been in vain, but I tried :)

account421y ago

What really matters is caring about keeping the links going in the first place. Most website operators never really get that far. So, thanks for caring.

mbac327681y ago· 2 in thread

it seems a bit crazy to try to avoid storing a relatively small amount of data when a link is shared when storage costs and bandwidth costs are rapidly dropping with time

but perhaps I don't appreciate how much traffic godbolt gets

mattgodbolt1y ago

It was a simpler time and I didn't want the responsibility of storing other people's data. We do now though!

mattgodbolt1y ago

Oh and traffic: https://stats.compiler-explorer.com/

sdf4j1y ago· 2 in thread

> One of my founding principles is that Compiler Explorer links should last forever.

And yet… that was a very self-destructive decision.

mattgodbolt1y ago

I'm not sure why so?

MyPasswordSucks1y ago

Because URL shortening is relatively trivial to implement, and instead of just doing so on their own end, they decided to rely on a third-party service.

Considering link permanence was a "founding principle", that's just unbelievably stupid. If I decide one of my "founding principles" is that I'm never going to show up at work with a dirty windshield, then I shouldn't rely on the corner gas station's squeegee and cleaning fluid.

https://github.com/gildas-lormeau/SingleFile

rurban1y ago· 1 in thread

He missed the archive.org crawl for those links in the blog post. they have them stored also now. https://github.com/compiler-explorer/compiler-explorer/discu...

mattgodbolt1y ago

He didn't know at the time but he's definitely pleased this is happening and will get to looking at it tomorrow!

devnullbrain1y ago· 1 in thread

>despite Google solemnly promising that “all existing links will continue to redirect to the intended destination,” it went read-only a few years back, and now they’re finally sunsetting it in August 2025

It's become so trite to mention that I'm rolling my eyes at myself just for bringing it up again but... come on! How bad can it be before Google do something about the reputation this behaviour has created?

Was Stadia not an expensive enough failure?

iainmerrick1y ago

I'm very surprised, even though I shouldn't be, that they're actually shutting the read-only goo.gl service down.

For other obsolete apps and services, you can argue that they require some continual maintenance and upkeep, so keeping them around is expensive and not cost-effective if very few people are using them.

But a URL shortener is super simple! It's just a database, and in this case we don't even need to write to it. It's literally one of the example programs for AWS Lambda, intentionally chosen because it's really simple.

I guess the goo.gl link database is probably really big, but even so, this is Google! Storage is cheap! Shutting it down is such a short-sighted mean-spirited bean-counter decision, I just don't get it.

Ericson23141y ago· 1 in thread

The only type of reference that lasts forever is a content address.

We should be using more of them.

account421y ago

A content address doesn't guarantee that there is anyone still serving that content so it doesn't actually improve much over an URL + reference date.

2YwaZHXV1y ago

Presumably there's no way to get someone at Google to query their database and find all the shortened links that go to godbolt.org?

sedatk1y ago

Surprisingly, purl.org URLs still work after a quarter century, thanks to Internet Archive.

shepmaster1y ago

As we all know, Cool URIs don't change [1]. I greatly appreciate the care taken to keep these Compiler Explorer links working as long as possible.

The Rust playground uses GitHub Gists as the primary storage location for shared data. I'm dreading the day that I need to migrate everything away from there to something self-maintained.

[1]: https://www.w3.org/Provider/Style/URI

3cats-in-a-coat1y ago

Nothing lasts forever.

I've pondered that a lot in my system design which bears some resemblance to the principles of REST.

I have split resources in ephemeral (and mutable), and immutable, reference counted (or otherwise GC-ed), which are persistent while referred to, but collected when no one refers to them.

In a distributed system the former is the default, the latter can exist in little islands of isolated context.

You can't track references throughout the entire world. The only thing that works is timeouts. But those are not reliable. Nor you can exist forever, years after no one needs you. A system needs its parts to be useful, or it dies full of useless parts.

devrandoom1y ago

> despite Google solemnly promising ...

I'm pretty sure the lore says that a solemn promise from Google carries the exact same value as a prostitute saying she likes you.

nssnsjsjsjs1y ago

The collolary of URLs that last forever is we have both forever storage (costs money forever) and forever institutional care and memory.

Where URLs may last longer is where they are not used for the RL bit. But more like a UUID for namespacing. E.g. in XML, Java or Go.

j / k navigate · click thread line to collapse

189 comments

115 comments · 24 top-level

kccqzy1y ago· 27 in thread

lappa1y ago

I use the SingleFile extension to archive every page I visit.

It's easy to set up, but be warned, it takes up a lot of disk space.

    $ du -h ~/archive/webpages
    1.1T /home/andrew/archive/webpages

internetter1y ago

storage is cheap, but if you wanted to improve this:

1. find a way to dedup media

2. ensure content blockers are doing well

4. add fts and embeddings over the pages

davidcollantes1y ago

How do you manage those? Do you have a way to search them, or a specific way to catalogue them, which will make it easy to find exactly what you need from them?

snthpy1y ago

Thanks. I didn't know about this and it looks great.

A couple of questions:

- do you store them compressed or plain?

- what about private info like bank accounts or health issuance?

I guess for privacy one could train oneself to use private browsing mode.

shwouchk1y ago

i was considering a similar setup, but i don’t really trust extensions. Im curious;

did you try integrating it with llms/rag etc yet?

nyarlathotep_1y ago

Are you automating this in some fashion? Is there another extension you've authored or similar to invoke SingleFile functionality on a new page load or similar?

dataflow1y ago

Have you tried MHTML?

90s_dev1y ago

You must have several TB of the internet on disk by now...

flexagoon1y ago

By the way, if you install the official Web Archive browser extension, you can configure it to automatically archive every page you visit

petethomas1y ago

This a good suggestion with the caveat that entire domains can and do disappear: https://help.archive.org/help/how-do-i-request-to-remove-som...

internetter1y ago

recently I've come to believe even IA and especially archive.is are ephermal. I've watched sites I've saved disappear without a trace, except in my selfhosted archives.

vitorsr1y ago

> you can configure it to automatically archive every page you visit

90s_dev1y ago

My solution has been to just remember the important stuff, or at least where to find it. I'm not dead yet so I guess it works.

TeMPOraL1y ago

mock-possum1y ago

“Why don’t you just” is a red flag now for me.

mycall1y ago

Is there some browser extension that automatically goes to web.archive.org if the link timesout?

theblazehen1y ago

I use the Resurrect Pages addon

account421y ago

blurbleblurble1y ago

Use WARC: https://en.wikipedia.org/wiki/WARC_(file_format) with WebRecorder: https://webrecorder.net/

shwouchk1y ago

warc is not a panacea; for example, gemini makes it super annoying to get a transcript of your conversation, so i started saving those as pdf and warc.

turns out that unlike most webpages, the pdf version is only a single page of what is visible on screen.

Admittedly, i don’t use webrecorder - does it solve this problem? did you verify?

andai1y ago

My phone browser has a "reader view" popup but it only appears sometimes, and usually not on pages that need it!

Edit: Just installed w3m in Termux... the things we can do nowadays!

XorNot1y ago

You want Zotero.

It's for bibliographies, but it also archives and stores web pages locally with a browser integration.

https://www.w3.org/Provider/Style/URI

m-p-31y ago

I export text-based content I want to retain into Markdown files, and when I find something useful for work I also send the URL to the Wayback Machine.

nonethewiser1y ago

A reference is a bet on continuity.

At a fundamental level, broken website links and dangling pointers in C are the same.

jwe1y ago

I can recommend to use Pinboard with the archive option

taeric1y ago

That assumption isn't true of any sources? Things flat out change. Some literally, others more in meaning. Some because they are corrected, but there are other reasons.

rubit_xxx161y ago

> Before 2010 I had this unquestioned assumption that links are supposed to last forever

90s_dev1y ago· 10 in thread

Some famous programmer once wrote about how links should last forever.

He advocated for /foo/bar with no extension. He was right about not using /foo/bar.php because the implementation might change.

But he was wrong, it should be /foo/bar.html because the end-result will always be HTML when it's served by a browser, whether it's generated by PHP, Node.js or by hand.

It's pointless to prepare for some hypothetical new browser that uses an alternate language other than HTML and that doesn't use HTML.

Just use .html for your pages and stop worrying about how to correctly convert foo.md to foo/index.html and configure nginx accordingly.

Sesse__1y ago

> Some famous programmer once wrote about how links should last forever.

You're probably thinking of W3C's guidance: https://www.w3.org/Provider/Style/URI

> But he was wrong, it should be /foo/bar.html because the end-result will always be HTML

90s_dev1y ago

There's strong precedence for associating file extensions with content types. And it allows static files to map 1:1 to URLs.

I agree though that I was too harsh, I didn't realize it was written in 1998 when HTML was still new. I probably first read it around 2010.

But now that we have hindsight, I think it's safe to say .html files will continue to be supported for the next 50 years.

esafak1y ago

If it's always .html, it's cruft; get rid of it. And what if it's not HTML but JSON? Besides, does the user care? Berners-Lee was right.

90s_dev1y ago

If it's JSON then name it /foo/bar.json, and as a bonus you can also have /foo/bar.html!

You say the extension is cruft. That's your opinion. I don't share it.

https://github.com/sdegutis/bubbles

90s_dev1y ago

Found it: https://www.w3.org/Provider/Style/URI

Why did I think Joel Spolsky or Jeff Atwood wrote it?

crackalamoo1y ago

I use /foo/bar/ with the trailing slash because it works better with relative URLs for resources like images. I could also use /foo/bar/index.html but I find the former to be cleaner

90s_dev1y ago

It's always bothered me in a small way that github doesn't honor this:

https://github.com/sdegutis/bubbles/

No redirect, just two renders!

It bothers me first because it's semantically different.

Dwedit1y ago

mod_rewrite means you can redirect the .php page to something else if you stop using php.

shakna1y ago

Unless mod_rewrite is disabled, because it has had a few security bugs over the years. Like last year. [0]

[0] https://nvd.nist.gov/vuln/detail/CVE-2024-38475

account421y ago

It also means you can internally redirect the extension-less version to .php in the first place so you never have to change your public URL in the future.

diggan1y ago· 9 in thread

URLs (uniform resource locator) cannot ever last forever, as it's a location and locations can't last forever :)

90s_dev1y ago

I've been making websites for almost 30 years now.

I still don't know the difference between URI and URL.

I'm starting to think it doesn't matter.

Sesse__1y ago

It doesn't matter.

URI is basically a format and nothing else. (foo://bar123 would be a URI but not a URL because nothing defines what foo: is.)

urn:isbn:0451450523 (example cribbed from Wikipedia) is an URN (in addition to being an URI) that uniquely identifies a book, but doesn't tell you how to go find that book.

Mostly, the difference is pedantic, given that URNs never took off.

layer81y ago

However, “URL” in the broader sense is used as an umbrella term for URIs and IRIs (internationalized resource identifiers), in particular by WHATWG.

In practice, what matters is the specific URI scheme (“http”, “doi”, etc.).

immibis1y ago

A URL tells you where to get some data, like https://example.com/index.html

A URN tells you which data to get (usually by hash or by some big centralized registry), but not how to get it. DOIs in academia, for example, or RFC numbers. Magnet links are borderline.

diggan1y ago

> I still don't know the difference between URI and URL.

One is a location, the other one is a ID. Which is which is referenced in the name :)

And sure, it doesn't matter as long as you're fine with referencing locations rather than the actual data, and aware of the tradeoffs.

marcosdumay1y ago

An URI is an standard way to write names of documents.

And URL is an URI that also tells you how to find the document.

postoplust1y ago

For example: IPFS URI's are content addresses

https://docs.ipfs.tech/

bowsamic1y ago

Does this have any actual grounding in reality or does your lack of suggestion for action confirm my suspicion that this is just a theoretical wish?

diggan1y ago

> Does this have any actual grounding in reality

If those bookmarks instead referenced the actual content (via content-addressing for example), rather than the location, then those would still work today.

olalonde1y ago· 6 in thread

> This article was written by a human, but links were suggested by and grammar checked by an LLM.

This is the second time today I've seen a disclaimer like this. Looks like we're witnessing the start of a new trend.

tester7561y ago

It's crazy that people feel that they need to put such disclaimers

actuallyalys1y ago

layer81y ago

It’s more a claimer than a disclaimer. ;)

psychoslave1y ago

This comment was written by a human with no check by any automaton, but how will you check that?

chii1y ago

i dont find the need to have such a disclaimer at all.

If the content can stand on its own, then it is sufficient. If the content is slop, then why does it matter that it is an ai generated slop vs human generated slop?

The only reason anyone wants to know/have the disclaimer is if they cannot themselves discern the quality of the contents, and is using ai generation as a proxy for (bad) quality.

For the author it matters. To which degree do they want to be associated with the resulting text.

https://en.m.wikipedia.org/wiki/Uniform_Resource_Name

s17n1y ago· 5 in thread

nonethewiser1y ago

>maybe we should build the technology around the assumption that infrastructure isn't permanent?

Yes. Also not using a url shortener as infrastructure.

dreamcompiler1y ago

URNs were supposed to solve that problem by separating the identity of the thing from the location of the thing.

But they never became popular and then link shorteners reimplemented the idea, badly.

hoppp1y ago

Yes.

domain names often exchange hands and a URL that is supposed to last forever can turn into malicious phishing link over time.

emaro1y ago

In theory a content-addressed system like IPFS would be the best: if someone online still has a copy, you can get it too.

jjmarr1y ago

URL identify the location of a resource on a network, not the resource itself, and so are not required to be permanent or unique. That's why they're called "uniform resource locators".

This problem was recognized in 1997 and is why the Digital Object Identifier was invented.

jimmyl021y ago· 5 in thread

This is great perspective about how assumptions play out over longer period of time. I think that this risk is much greater for free third party services for critical infrastructure.

Someone has to foot the bill somewhere and if there isn't a source of income then the project is bound to be unsupported eventually.

tekacs1y ago

I think I would struggle to say that free services die at a higher rate consistently…

So many paid offerings, whether from startups or even from large companies, have been sunset over time, often with frustratingly short migration periods.

If anything, I feel like I can think of more paid services that have given their users short migration periods than free ones.

cortesoft1y ago

Nah, businesses go under all the time, whether their services are paid or not.

lqstuart1y ago

Counterexample: the Linux kernel

charcircuit1y ago

How? Big tech foots the bill.

iainmerrick1y ago

Linux isn't a service (in the SaaS sense).

creatonez1y ago· 4 in thread

rs1861y ago

Shortening long URLs is the intended use case for a ... URL shortener.

The real abusers are the people who use a shortener to hide scam/spam/illegal websites behind a common domain and post it everywhere.

creatonez1y ago

These are not just "long URLs". These are URLs where the entire content is stored in the fragment suffix of the URL. They are blobs, and always have been.

nonethewiser1y ago

Didnt they just use the link shortener to compress the url? They used their url as the "database" (ie holding the compiler state).

Arcuru1y ago

They didn't store anything themselves since they encoded the full state in the urls that were given out. So the link shortener was the only place where the "database", the urls, were being stored.

layer81y ago· 4 in thread

I find it somewhat surprising that it’s worth the effort for Google to shut down the read-only version. Unless they fear some legal risks of leaving redirects to private links online.

actuallyalys1y ago

Scaevolus1y ago

Typically services like these are side projects of just a few Google employees, and when the last one leaves they are shut down.

mbac327681y ago

yeah but nobody wants to put "spent two months migrating goo.gl url shortener to work with Sisyphus release manager and Dante 7 SRE monitoring" in their perf packet

that's a negative credit activity

mmooss1y ago

swyx1y ago· 4 in thread

idk man how can URLs last forever if it costs money to keep a domain name alive?

i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

Historians however would love to have more garbage from history, to get more insights on "real" life rather than just the parts one considered worth keeping.

mrguyorama1y ago

I regularly wonder if modern educated people do not journal as much as previous century educated people who were kind of rare.

Maybe we should get a journaling boom going.

But it has to be written, because pen and paper is literally ten times more durable than even good digital storage.

[1] https://wiki.archiveteam.org/index.php/Goo.gl

swyx1y ago

we'd keep the curiosities around, like so much Ea Nasir Sells Shit Copper. we have room for like 5-10 of those per century. not like 8 billion. much of life is mundane.

5 more replies

internetter1y ago

> i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.

agreed. formerly wrote some thoughts here: https://boehs.org/node/internet-evanescence

mananaysiempre1y ago· 3 in thread

May be worth cooperating with ArchiveTeam’s project[1] on Goo.gl?

> url shortening was a fucking awful idea[2]

[2] https://wiki.archiveteam.org/index.php/URLTeam

MallocVoidstar1y ago

IIRC ArchiveTeam were bruteforcing Goo.gl short URLs, not going through 'known' links, so I'd assume they have many/all of Compiler Explorer's URLs. (So, good idea to contact them)

tech234a1y ago

Real-time status for that project indicates 7.5 billion goo.gl URLs found out of 42 billion goo.gl URLs scanned: https://tracker.archiveteam.org:1338/status

mattgodbolt1y ago

Thanks! Someone posted on GitHub about that and I'll be looking at that tomorrow!

wrs1y ago· 3 in thread

mattgodbolt1y ago

We've done alright so far: 13 years this week. I have funding for another year and change even assuming growth and all our current sponsors pull out.

I /am/ thinking about a foundation or similar though: the single point of failure is not funding but "me".

badmintonbaseba1y ago

Well, that's true, but at least now compiler explorer links will stop working when compiler explorer vanishes, but not before that.

layer81y ago

Thanks to the no-hiding theorem, the information will live forever. ;)

amiga3861y ago· 2 in thread

https://killedbygoogle.com/

> Google Go Links (2010–2021)

zerocrates1y ago

"Killing" the service in the sense of minting new ones is no big deal and hardly merits mention.

Killing the existing ones is much more of a jerk move. Particularly so since Google is still keeping it around in some form for internal use by their own apps.

ruune1y ago

Don't they use https://g.co now? Or are there still new internal goo.gl links created?

sebstefan1y ago· 2 in thread

>Google (using their web search API)

>GitHub (using their API)

>Our own (somewhat limited) web logs

>The archive.org Stack Overflow data dumps

>Archive.org’s own list of archived webpages

You're an angel Matt

mattgodbolt1y ago

Thanks! It's been a fun learning experience. I just found out the internet archive has a much more comprehensive effort going so it might have been in vain, but I tried :)

account421y ago

What really matters is caring about keeping the links going in the first place. Most website operators never really get that far. So, thanks for caring.

mbac327681y ago· 2 in thread

it seems a bit crazy to try to avoid storing a relatively small amount of data when a link is shared when storage costs and bandwidth costs are rapidly dropping with time

but perhaps I don't appreciate how much traffic godbolt gets

mattgodbolt1y ago

It was a simpler time and I didn't want the responsibility of storing other people's data. We do now though!

mattgodbolt1y ago

Oh and traffic: https://stats.compiler-explorer.com/

sdf4j1y ago· 2 in thread

> One of my founding principles is that Compiler Explorer links should last forever.

And yet… that was a very self-destructive decision.

mattgodbolt1y ago

I'm not sure why so?

MyPasswordSucks1y ago

Because URL shortening is relatively trivial to implement, and instead of just doing so on their own end, they decided to rely on a third-party service.