Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.
You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.
The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.
Easy block list for sensitive things like banking, internal sites, email, etc.
I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.
archivebox oneshot --extract=screenshot 'https://example.com'
or archivebox add --extract=screenshot < ~/Desktop/browser_bookmarks.htmlWouldn't it be more useful and take less space to use SignleFile?
It makes a web archive from everything you browse, and lately I've been working on the full text search
By only extracting text and article images you could go deep into an archive. If you skip images, much more so
Saving article only view (images + text) should probably do better
I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.
It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.
There's probably enough support here to gather a few contributors for an open source project.
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox
I'm wondering if this could be applied here.
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...
I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.
If you want 1 page per URL I recommend Singlefile.
Lots more info here if you want to compare different software options: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.
If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.
Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does