It's easy to set up, but be warned, it takes up a lot of disk space.
$ du -h ~/archive/webpages
1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFile1. find a way to dedup media
2. ensure content blockers are doing well
3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved
4. add fts and embeddings over the pages
A couple of questions:
- do you store them compressed or plain?
- what about private info like bank accounts or health issuance?
I guess for privacy one could train oneself to use private browsing mode.
Regarding compression, for thousands of files don't all those self-extraction headers add up? Wouldn't there be space savings by having a global compression dictionary and only storing the encoded data?
- Do you also archive logged in pages, infinite scrollers, banking sites, fb etc? - How many entries is that? - How often do you go back to the archive? is stuff easy to find? - do you have any organization or additional process (eg bookmarks)?
did you try integrating it with llms/rag etc yet?
A technological conundrum, however, is the fact that I have no way to prove that my archive is an accurate representation of a site at a point in time. Hmmm, or maybe I do? Maybe something funky with cert chains could be done.
What?? I am a heavy user of the Internet Archive services, not just the Wayback Machine, including official and "unofficial" clients and endpoints, and I had absolutely no idea the extension could do this.
To bulk archive I would manually do it via the web interface or batch automate it. The limitations of manually doing it one by one are obvious, and the limitations of doing it in batches requires, well, keeping batches (lists).
As for being still alive, by that measure hardly anything anyone does is important in the modern world. It's pretty hard to fail at thinking or remembering so badly that it becomes a life-or-death thing.
“Why don’t you just” is a red flag now for me.
Also, open tabs should work the same way: I never want to see a network error while going back to a tab while not having an internet connection because the browser has helpfully evicted that tab from memory. It should just reload the state from disk instead of the network in this case until I manually refresh the page.
turns out that unlike most webpages, the pdf version is only a single page of what is visible on screen.
turns out also that opening the warc immediately triggers a js redirect that is planted in the page. i can still extract the text manually - it’s embedded there - but i cannot “just open” the warc in my browser and expect an offline “archive” version - im interacting with a live webpage! this sucks from all sides - usability, privacy, security.
Admittedly, i don’t use webrecorder - does it solve this problem? did you verify?
My phone browser has a "reader view" popup but it only appears sometimes, and usually not on pages that need it!
Edit: Just installed w3m in Termux... the things we can do nowadays!
It's for bibliographies, but it also archives and stores web pages locally with a browser integration.
At a fundamental level, broken website links and dangling pointers in C are the same.
Not that I don't think there is some benefit in what you are attempting, of course. A similar thing I still wish I could do is to "archive" someone's phone number from my contact list. Be it a number that used to be ours, or family/friends that have passed.
Any site/company whatsoever of this world (and most) that promises that anything will last forever is seriously deluded or intentionally lying, unless their theory of time is different than that of the majority.
He advocated for /foo/bar with no extension. He was right about not using /foo/bar.php because the implementation might change.
But he was wrong, it should be /foo/bar.html because the end-result will always be HTML when it's served by a browser, whether it's generated by PHP, Node.js or by hand.
It's pointless to prepare for some hypothetical new browser that uses an alternate language other than HTML and that doesn't use HTML.
Just use .html for your pages and stop worrying about how to correctly convert foo.md to foo/index.html and configure nginx accordingly.
You're probably thinking of W3C's guidance: https://www.w3.org/Provider/Style/URI
> But he was wrong, it should be /foo/bar.html because the end-result will always be HTML
20 years ago, it wasn't obvious at all that the end-result would always be HTML (in particular, various styled forms of XML was thought to eventually take over). And in any case, there's no reason to have the content-type in the URL; why would the user care about that?
I agree though that I was too harsh, I didn't realize it was written in 1998 when HTML was still new. I probably first read it around 2010.
But now that we have hindsight, I think it's safe to say .html files will continue to be supported for the next 50 years.
You say the extension is cruft. That's your opinion. I don't share it.
Why did I think Joel Spolsky or Jeff Atwood wrote it?
https://github.com/sdegutis/bubbles
https://github.com/sdegutis/bubbles/
No redirect, just two renders!
It bothers me first because it's semantically different.
Second and more importnatly, because it's always such a pain to configure that redirect in nginx or whatever. I eventually figure it out each time, after many hours wasted looking it up all over again and trial/error.
URIs however, can be made to last forever! Also comes with the added benefit that if you somehow integrate content-addressing into the identifier, you'll also be able to safely fetch it from any computer, hostile or not.
I still don't know the difference between URI and URL.
I'm starting to think it doesn't matter.
URI is basically a format and nothing else. (foo://bar123 would be a URI but not a URL because nothing defines what foo: is.)
URLs and URNs are thingies using the URI format; https://news.ycombinator.com is a URL (in addition to being a URI) because there's an RFC that specifies that https: means and how to go out and fetch them.
urn:isbn:0451450523 (example cribbed from Wikipedia) is an URN (in addition to being an URI) that uniquely identifies a book, but doesn't tell you how to go find that book.
Mostly, the difference is pedantic, given that URNs never took off.
However, “URL” in the broader sense is used as an umbrella term for URIs and IRIs (internationalized resource identifiers), in particular by WHATWG.
In practice, what matters is the specific URI scheme (“http”, “doi”, etc.).
A URN tells you which data to get (usually by hash or by some big centralized registry), but not how to get it. DOIs in academia, for example, or RFC numbers. Magnet links are borderline.
URIs are either URLs or URNs. URNs are rarely used since they're less practical since browsers can't open them - but note that in any case each URL scheme (https) or URN scheme (doi) is unique - there's no universal way to fetch one without specific handling for each supported scheme. So it's not actually that unusual for a browser not to be able to open a certain scheme.
One is a location, the other one is a ID. Which is which is referenced in the name :)
And sure, it doesn't matter as long as you're fine with referencing locations rather than the actual data, and aware of the tradeoffs.
And URL is an URI that also tells you how to find the document.
Depends on your use case I suppose. For things I want to ensure I can reference forever (theoretical forever), then using location for that reference feels less than ideal, I cannot even count the number of dead bookmarks on both hands and feet, so "link rot" is a real issue.
If those bookmarks instead referenced the actual content (via content-addressing for example), rather than the location, then those would still work today.
But again, not everyone cares about things sticking around, not all use cases require the reference to continue being alive, and so on, so if it's applicable to you or not is something only you can decide.
This is the second time today I've seen a disclaimer like this. Looks like we're witnessing the start of a new trend.
If the content can stand on its own, then it is sufficient. If the content is slop, then why does it matter that it is an ai generated slop vs human generated slop?
The only reason anyone wants to know/have the disclaimer is if they cannot themselves discern the quality of the contents, and is using ai generation as a proxy for (bad) quality.
And I differentiate between "Matt Godbolt" who is an expert in some areas and in my experience careful about avoiding wrong information and an LLM which may produce additional depth, but may also make up things.
And well, "discern the quality of the contents" - I often read texts to learn new things. On new things I don't have enough knowledge to qualify the statements, but I may have experience with regards to the author or publisher.
Yes. Also not using a url shortener as infrastructure.
But they never became popular and then link shorteners reimplemented the idea, badly.
domain names often exchange hands and a URL that is supposed to last forever can turn into malicious phishing link over time.
This problem was recognized in 1997 and is why the Digital Object Identifier was invented.
Someone has to foot the bill somewhere and if there isn't a source of income then the project is bound to be unsupported eventually.
So many paid offerings, whether from startups or even from large companies, have been sunset over time, often with frustratingly short migration periods.
If anything, I feel like I can think of more paid services that have given their users short migration periods than free ones.
The real abusers are the people who use a shortener to hide scam/spam/illegal websites behind a common domain and post it everywhere.
that's a negative credit activity
i also wonder if url death could be a good thing. humanity makes special effort to keep around the good stuff. the rest goes into the garbage collection of history.
If I could time jump it would be interesting to see how historians inna thousand years will look back at our period where a lot of information will just disappear without traces as digital media rots.
Maybe we should get a journaling boom going.
But it has to be written, because pen and paper is literally ten times more durable than even good digital storage.
agreed. formerly wrote some thoughts here: https://boehs.org/node/internet-evanescence
> url shortening was a fucking awful idea[2]
I /am/ thinking about a foundation or similar though: the single point of failure is not funding but "me".
I think the most valuable long-living compiler explorer links are in bug reports. I like to link to compiler explorer in bug reports for convenience, but I also include the code in the report itself, and specify what compiler I used with what version to reproduce the bug. I don't expect compiler explorer to vanish anytime soon, but making bug reports self-contained like this protects against that.
> Google Go Links (2010–2021)
> Killed about 4 years ago, (also known as Google Short Links) was a URL shortening service. It also supported custom domain for customers of Google Workspace (formerly G Suite (formerly Google Apps)). It was about 11 years old.
Killing the existing ones is much more of a jerk move. Particularly so since Google is still keeping it around in some form for internal use by their own apps.
Edit: Google is using a g.co link on the "Your device is booting another OS" screen that appears when booting up my Pixel running GrapheneOS. Will be awkward when they kill that service and the hard coded link in the phones bios is just dead
>Google (using their web search API)
>GitHub (using their API)
>Our own (somewhat limited) web logs
>The archive.org Stack Overflow data dumps
>Archive.org’s own list of archived webpages
You're an angel Matt
but perhaps I don't appreciate how much traffic godbolt gets
And yet… that was a very self-destructive decision.
Considering link permanence was a "founding principle", that's just unbelievably stupid. If I decide one of my "founding principles" is that I'm never going to show up at work with a dirty windshield, then I shouldn't rely on the corner gas station's squeegee and cleaning fluid.
It's become so trite to mention that I'm rolling my eyes at myself just for bringing it up again but... come on! How bad can it be before Google do something about the reputation this behaviour has created?
Was Stadia not an expensive enough failure?
For other obsolete apps and services, you can argue that they require some continual maintenance and upkeep, so keeping them around is expensive and not cost-effective if very few people are using them.
But a URL shortener is super simple! It's just a database, and in this case we don't even need to write to it. It's literally one of the example programs for AWS Lambda, intentionally chosen because it's really simple.
I guess the goo.gl link database is probably really big, but even so, this is Google! Storage is cheap! Shutting it down is such a short-sighted mean-spirited bean-counter decision, I just don't get it.
We should be using more of them.
The Rust playground uses GitHub Gists as the primary storage location for shared data. I'm dreading the day that I need to migrate everything away from there to something self-maintained.
I've pondered that a lot in my system design which bears some resemblance to the principles of REST.
I have split resources in ephemeral (and mutable), and immutable, reference counted (or otherwise GC-ed), which are persistent while referred to, but collected when no one refers to them.
In a distributed system the former is the default, the latter can exist in little islands of isolated context.
You can't track references throughout the entire world. The only thing that works is timeouts. But those are not reliable. Nor you can exist forever, years after no one needs you. A system needs its parts to be useful, or it dies full of useless parts.
I'm pretty sure the lore says that a solemn promise from Google carries the exact same value as a prostitute saying she likes you.
Where URLs may last longer is where they are not used for the RL bit. But more like a UUID for namespacing. E.g. in XML, Java or Go.