Make Your Own Internet Archive with Archive Box (opens in new tab)

(nixintel.info)

257 pointsadamhearn5y ago77 comments

77 comments

59 comments · 17 top-level

lazyjeff5y ago· 10 in thread

I feel like a simple automatic capture of timestamp + url + screenshot would already be very useful. This gives you a visual memory of the things you've seen on the web. I've wanted to develop this for a while, as a browser plugin.

Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.

You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.

andai5y ago

A few years ago, in an attempt to increase productivity, I used a screen recorder that took a screenshot every 10 seconds and played it back at the end of every day. So I had a timelapse of how I was spending my time -- mostly online. It was very enlightening.

The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.

nstart5y ago

Huh. That's a pretty nifty thing to do. Just wrote a python script to do that for me and it's running in the background right now. Shall be interesting to come back to it today in the evening. Do you recall how many seconds each screenshot would show itself in the final video (basically what was the framerate?). Currently considering about 4 frames per second but would love to get your take on it :)

2 more replies

jumploops5y ago

I've dreamed about this as well, basically a personalized FullStory that allows you to search and replay all of your sessions across sites.

Easy block list for sensitive things like banking, internal sites, email, etc.

I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.

nikisweeting5y ago

Sounds like this might be what you want if you only want screenshots + timestamps and nothing else:

    archivebox oneshot --extract=screenshot 'https://example.com'

    archivebox add --extract=screenshot < ~/Desktop/browser_bookmarks.html

smnrchrds5y ago

> screenshot

Wouldn't it be more useful and take less space to use SignleFile?

lazyjeff5y ago

You'd think so, but code (even web code) needs to be executed and is brittle. Some of it doesn't even work as an archive even right after saving. All my images from 10-20 years ago work perfectly today. None of my code does without some major effort.

bravura5y ago

This doesn't allow full text search easily, though.

rhizome5y ago

PDF with an image on one page, then the plain text of the page flowed over following pages.

BrianOnHN5y ago

+ (text minus stop words)

ramraj075y ago

ocr?

1 more reply

zeckalpha5y ago· 6 in thread

How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.

andai5y ago

I often wish that I could do a full text search of every page I've already visited.

lloeki5y ago

Safari’s history used to crudely do that somewhat. You could open up the history and search for any word that appeared on a page you browsed and it would filter it to list the matching pages.

Pamar5y ago

Not exactly what you are asking here (if I understand correctly) but I have been using historio.us for a year or so and I am happy with it.

mail2merge5y ago

I'm working in that in my "self host the internet offline from your browsing history" project

https://github.com/c9fe/22120

It makes a web archive from everything you browse, and lately I've been working on the full text search

1 more reply

rhizome5y ago

Forever. Site owners will shit purple Twinkies and put the developers under a copyright cosh if the feature is released or not removed.

zeckalpha5y ago

It could respect caching headers, just make it visible in the UI and show “expired” content if current content is unavailable. I don’t see how this would be an issue for site owners, they could adjust their content headers if they want something different.

evc5y ago· 5 in thread

You will need a lot of disk storage right?

LEARAX5y ago

There are different extractors/services, and you can toggle them pretty easily. By default it screenshots everything, exports a PDF, saves like 4 different HTML copies and submits the link to the wayback machine. It also tries to extract important text, and stores that separately. You could easily configure it to only extract text, turn off some HTML extractors, or disable the PDF and screenshot captures if you want to prioritize disk space.

flas9sd5y ago

it doesn't show in the Screenshot in the article, but ArchiveBox in Aug 2020 implemented the "readability article text extractor", see description in the release notes: https://github.com/pirate/ArchiveBox/releases/tag/v0.4.14 and the module that does the work https://github.com/pirate/readability-extractor

By only extracting text and article images you could go deep into an archive. If you skip images, much more so

Ace_Archer5y ago

That probably depends on the scope of what you're looking to archive. If you're looking to make up local backup of your bookmarks folder (as one of the intentions seems to be), probably not an unreasonable amount of storage. Maybe a few GB at most(if you have a moderate to large bookmarks folder), depending on how many sites/heavy the sites are?

reefab5y ago

For reference, archivebox uses 250GB for 5000 links in my setup.

mosselman5y ago

That is an insane amount of storage for so few links. Is your setup somehow very greedy?

Saving article only view (images + text) should probably do better

I suspect your numbers come from JavaScript and css, etc? Is there a way for archivebox to not download react 5000 times but share source files? Most likely custom bundles that sites compile will not make this possible most of the time. Just thinking out loud here.

1 more reply

matt_f5y ago· 4 in thread

Interesting side note:

It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.

There's probably enough support here to gather a few contributors for an open source project.

hobo_mark5y ago

I seem to remember Google's Larry Page once proposed a similar thing in the early days, a product that would record all you read on your computer (to make it searchable later), but now I can't find it mentioned anywhere, am I imagining things?

nikisweeting5y ago

A "remember everything for me" tool is often called a "Memex" https://en.wikipedia.org/wiki/Memex

dgeiser135y ago

If you use Google Chrome as your primary browser this exists at https://myactivity.google.com/item

nikisweeting5y ago

There are a bunch of projects trying to do different flavors of this already, check out some of these:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

throwawaysea5y ago· 4 in thread

Can you configure this tool to login to websites (for paid news subscriptions) and get past those paywalls?

nikisweeting5y ago

Yeah, it supports it but there are security considerations if you're doing it for anything more serious than news content. See here: https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

throwawaysea5y ago

Thanks, much appreciated. This is a very informative set of things to watch out for that I wouldn't have thought of otherwise.

1 more reply

ernesth5y ago

That is the default for the screenshot, pdf and one of the html archives: they use your chrome cookies.

frombody5y ago

Likely not without some modification, but you could try this:

https://www.jacoduplessis.co.za/bypass-paywall/

mikece5y ago· 3 in thread

This would be a nice thing to be able to run on a Synology NAS or other kind of device that typically has terabytes of storage.

blastro5y ago

that's what i do - there's a docker image, 1 line script + cron job. it archives an rss feed of links i gather

tylorr5y ago

How do you generate that rss feed?

1 more reply

vorpalhex5y ago

It runs quite well in docker. I still feed my instance by hand but eventually need to write a firefox extension to push history semi-live.

remirk5y ago· 2 in thread

This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox

Robotbeat5y ago

Disagree. I find links to repositories to be less accessible than blog posts.

klelatti5y ago

It has its own website too.

https://archivebox.io/

mikiem5y ago· 2 in thread

How can I use this to archive sites/pages that require logging in to see?

nikisweeting5y ago

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

CodeWriter235y ago

From the blog comments, I think this is what you’re after https://github.com/c9fe/22120

ketamine__5y ago· 2 in thread

How does archive.is trick news sites into showing content without the paywall? Is it pure user agent spoofing?

I'm wondering if this could be applied here.

nikisweeting5y ago

Yeup, just the reason why we expose the USER_AGENT options in ArchiveBox config ;)

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.

mycall5y ago

How does ArchiveBox function compared to https://archivarix.com? I recently used Archivarix to backup a large website (93k pages), but it messed up the js/css.

blastro5y ago· 1 in thread

i use this every single day and think very highly of it. thanks for reminding me - i'm going to sponsor this developer on github...

m-s-sripati5y ago

It is the right thought, aligned to the spirit of open source.

jedimastert5y ago· 1 in thread

Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page

nikisweeting5y ago

The main archive formats for web content are WARC, ZIM, Memento, and static HTML (e.g. from a tool like wget or Singlefile).

If you want 1 page per URL I recommend Singlefile.

Lots more info here if you want to compare different software options: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

greypowerOz5y ago· 1 in thread

so.. you CAN have a box that is "the internet"....

mosselman5y ago

Yes Jen

dirtyid5y ago· 1 in thread

Tried this a while ago, disappointed at HD usage.

My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.

nikisweeting5y ago

I think you tried a very old version ;) all that has long since changed. As of v0.5 ArchiveBox has everything in a Sqlite3 DB and full-text search is implemented with Sonic.

nikisweeting5y ago

Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.

If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.

unnouinceput5y ago

Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"

Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does

0x4265776172655y ago

I use this with an automated script that watches my Twitter activity. If I like a tweet it determines if it contains a URL then archives it.

egberts15y ago

A real OSINT archive box would also capture all non-inline JavaScript, CSS and blob: files.

j / k navigate · click thread line to collapse

77 comments

59 comments · 17 top-level

lazyjeff5y ago· 10 in thread

andai5y ago

The most efficient format to store a sequence of screenshots in is video, because most of them will have heavily overlapping data.

nstart5y ago

2 more replies

jumploops5y ago

I've dreamed about this as well, basically a personalized FullStory that allows you to search and replay all of your sessions across sites.

Easy block list for sensitive things like banking, internal sites, email, etc.

I currently use the Session Buddy Chrome Plugin, which helps in some cases (I was able to find a hard to Google repo today, for example), but the historic context is largely missing.

nikisweeting5y ago

Sounds like this might be what you want if you only want screenshots + timestamps and nothing else:

    archivebox oneshot --extract=screenshot 'https://example.com'

    archivebox add --extract=screenshot < ~/Desktop/browser_bookmarks.html

smnrchrds5y ago

> screenshot

Wouldn't it be more useful and take less space to use SignleFile?

lazyjeff5y ago

bravura5y ago

This doesn't allow full text search easily, though.

rhizome5y ago

PDF with an image on one page, then the plain text of the page flowed over following pages.

BrianOnHN5y ago

+ (text minus stop words)

ramraj075y ago

ocr?

1 more reply

zeckalpha5y ago· 6 in thread

How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.

andai5y ago

I often wish that I could do a full text search of every page I've already visited.

lloeki5y ago

Safari’s history used to crudely do that somewhat. You could open up the history and search for any word that appeared on a page you browsed and it would filter it to list the matching pages.

Pamar5y ago

Not exactly what you are asking here (if I understand correctly) but I have been using historio.us for a year or so and I am happy with it.

mail2merge5y ago

I'm working in that in my "self host the internet offline from your browsing history" project

https://github.com/c9fe/22120

It makes a web archive from everything you browse, and lately I've been working on the full text search

1 more reply

rhizome5y ago

Forever. Site owners will shit purple Twinkies and put the developers under a copyright cosh if the feature is released or not removed.

zeckalpha5y ago

evc5y ago· 5 in thread

You will need a lot of disk storage right?

LEARAX5y ago

flas9sd5y ago

By only extracting text and article images you could go deep into an archive. If you skip images, much more so

Ace_Archer5y ago

reefab5y ago

For reference, archivebox uses 250GB for 5000 links in my setup.

mosselman5y ago

That is an insane amount of storage for so few links. Is your setup somehow very greedy?

Saving article only view (images + text) should probably do better

1 more reply

matt_f5y ago· 4 in thread

Interesting side note:

It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.

There's probably enough support here to gather a few contributors for an open source project.

hobo_mark5y ago

nikisweeting5y ago

A "remember everything for me" tool is often called a "Memex" https://en.wikipedia.org/wiki/Memex

dgeiser135y ago

If you use Google Chrome as your primary browser this exists at https://myactivity.google.com/item

nikisweeting5y ago

There are a bunch of projects trying to do different flavors of this already, check out some of these:

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

throwawaysea5y ago· 4 in thread

Can you configure this tool to login to websites (for paid news subscriptions) and get past those paywalls?

nikisweeting5y ago

Yeah, it supports it but there are security considerations if you're doing it for anything more serious than news content. See here: https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

throwawaysea5y ago

Thanks, much appreciated. This is a very informative set of things to watch out for that I wouldn't have thought of otherwise.

1 more reply

ernesth5y ago

That is the default for the screenshot, pdf and one of the html archives: they use your chrome cookies.

frombody5y ago

Likely not without some modification, but you could try this:

https://www.jacoduplessis.co.za/bypass-paywall/

mikece5y ago· 3 in thread

This would be a nice thing to be able to run on a Synology NAS or other kind of device that typically has terabytes of storage.

blastro5y ago

that's what i do - there's a docker image, 1 line script + cron job. it archives an rss feed of links i gather

tylorr5y ago

How do you generate that rss feed?

1 more reply

vorpalhex5y ago

It runs quite well in docker. I still feed my instance by hand but eventually need to write a firefox extension to push history semi-live.

remirk5y ago· 2 in thread

This article is blogspam.

The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox

Robotbeat5y ago

Disagree. I find links to repositories to be less accessible than blog posts.

klelatti5y ago

It has its own website too.

https://archivebox.io/

mikiem5y ago· 2 in thread

How can I use this to archive sites/pages that require logging in to see?

nikisweeting5y ago

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

CodeWriter235y ago

From the blog comments, I think this is what you’re after https://github.com/c9fe/22120

ketamine__5y ago· 2 in thread

How does archive.is trick news sites into showing content without the paywall? Is it pure user agent spoofing?

I'm wondering if this could be applied here.

nikisweeting5y ago

Yeup, just the reason why we expose the USER_AGENT options in ArchiveBox config ;)

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

I don't want to officially endorse using the Google bot user agent, but you're welcome to try it on your own and see if it improves the experience.

mycall5y ago

How does ArchiveBox function compared to https://archivarix.com? I recently used Archivarix to backup a large website (93k pages), but it messed up the js/css.

blastro5y ago· 1 in thread

i use this every single day and think very highly of it. thanks for reminding me - i'm going to sponsor this developer on github...

m-s-sripati5y ago

It is the right thought, aligned to the spirit of open source.

jedimastert5y ago· 1 in thread

Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page

nikisweeting5y ago

The main archive formats for web content are WARC, ZIM, Memento, and static HTML (e.g. from a tool like wget or Singlefile).

If you want 1 page per URL I recommend Singlefile.

Lots more info here if you want to compare different software options: https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

greypowerOz5y ago· 1 in thread

so.. you CAN have a box that is "the internet"....

mosselman5y ago

Yes Jen

dirtyid5y ago· 1 in thread

Tried this a while ago, disappointed at HD usage.

nikisweeting5y ago

I think you tried a very old version ;) all that has long since changed. As of v0.5 ArchiveBox has everything in a Sqlite3 DB and full-text search is implemented with Sonic.

nikisweeting5y ago

Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.

unnouinceput5y ago

Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"

Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does

0x4265776172655y ago

I use this with an automated script that watches my Twitter activity. If I like a tweet it determines if it contains a URL then archives it.

egberts15y ago

A real OSINT archive box would also capture all non-inline JavaScript, CSS and blob: files.

j / k navigate · click thread line to collapse