Help preserve the internet with Archiveteam's warrior (opens in new tab)

(selfhostedheaven.com)

170 pointsneoglow4y ago51 comments

51 comments

Yes!

Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.

Large institutions like the internet archive are doing an admirable job, but there is a lot of content that they cannot and will not cover. So we will definitely (also) need volunteer-based archival for the foreseeable future.

18TB drives are ~$300 a piece right now, go buy one and help our collective memory!

smarx0074y ago

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.

uniqueuid4y ago

Thanks, it's always good to point that out.

There's a surprising amount of tools that are able to submit data to the internet archive (and get data from there). Even wget can produce WARC archive files.

While the warrior downloads content via your line (a bit like a residential proxy network), I do think it's important that we decentralize the storage as well.

Just without the crypto mafia/drug traders/investors.

zdkl4y ago

AFAIK you can use IPFS (& clusters[0]) without relying on the crypto parts of that ecosystem. That ought to fit rather well with the use case.

[0] https://cluster.ipfs.io/

1 more reply

bradknowles4y ago

The last time I tried to do anything with or for the Archive Team, it was a mostly "just watch us work" sort of deal.

The tools couldn't be built without additional knowledge that wasn't published anywhere -- because there had been drift from what was published versus what was working, and those changes never got folded back in. And there were multiple versions and variants of the tools, with different teams using different versions or variants.

And once you built the tools, you couldn't get your Warrior into the list to be used, although you could always run your systems separately.

It's not like you could sign up for a SETI@Home type initiative and just let your equipment run.

I understand why they work this way. It's a very insular crowd, and new people and resources seem to disappear as quickly as they showed up.

So, they let you watch.

If you stay around long enough (months? years?), then they might let you start participating. But I wasn't willing to wait that long.

prox4y ago

I kind of wonder how we can make it searchable again. Is this included in this archiving effort?

In any case wonderful work.

uniqueuid4y ago

There is a standard set of tooling for indexing archives: CDX files. [1]

They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.

But in general, these archives are NOT geared towards full-blown search because it would be pretty expensive to keep the indexes in hot cache. Plus you would need to deal with historical versions of records, which is not normally done in search UX.

[1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...

camtarn4y ago

Ah, is the WARC format the reason it's called 'Warrior'? It seems like a very strange name for an archival program.

1 more reply

prox4y ago

Thank you for that information!

Thorentis4y ago

The Internet Archive has a huge noise to signal ratio, very much in favour of noise. I admire the effort and regularly make use of the quality archives. However, I wonder if much like Bitcoin, tremendous energy and amounts of resources are being put towards very little of value.

DoingIsLearning4y ago

I disagree, the unfiltered high noise is what makes it valuable. Curation is a bias.

If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.

I agree that it makes it harder to find things but I also see the value of IA as a time capsule.

lkbm4y ago

Yes, curation is very valuable, but it needs to be a layer on top of an uncurated source.

I enjoy using Open Library to re-read obscure middle grades books from the 1950s-1990, and there are some obscure DOS games I want to revisit. It's hard to find what I want sometimes, but only having access to curated lists would change it from "hard" to "impossible" in many cases.

londons_explore4y ago

Tools to separate signal from noise will also get better in the future. You can imagine that in 100 years time, using a super duper AI search engine will perform far better than whatever some human decided to categorize stuff as today.

djokkataja4y ago

Storing data is cheap and gets cheaper all the time. This isn't a super comparison, but the Internet archive's 2019 revenue is listed as $36.7 mil on Wikipedia (https://en.m.wikipedia.org/wiki/Internet_Archive).

Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.

I get that this article is about people using their personal computers to help archive things, but I don't think the Internet archive is ever going to be using resources even remotely as aggressively as cryptocurrencies unless they somehow turn all their archiving into cryptocurrency.

prox4y ago

Value is really hard to predict, but as someone who researches a lot in archives, there is no such thing as too little information. Especially if you want the views of several parties or organizations. In anthropology and history research this work (archiving) can be of tremendous value.

Usually it’s hard to say if it’s valuable now , only time can tell.

uniqueuid4y ago

Just to point this out, on a technical level, the internet archive has very (!) little overhead.

Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]

[1] https://www.iso.org/standard/68004.html

qiskit4y ago

> tremendous energy and amounts of resources are being put towards very little of value.

I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.

The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.

cyber_kinetist4y ago

I think the real problem is a bit deeper: Unorganized raw data itself is of very low value, but it becomes much more valuable when humans process, categorize, and interpret it via a higher-level system of reason. We're doing a lot of the former but not the latter: we have so much data but have no idea what they all mean as a whole.

Libraries aren't just "a bunch of books piled up in shelves", they're a historical invention built and perfected for centuries where books are extensively coded and catalogued via a complex hierarchical system. As we are dealing with far more data than the past (not just books but posts and comments from all over the world, as well as new kinds of media such as images and videos), and also have new kinds of conceptual and technological inventions that previous librarians didn't have access to (hyperlinks, databases, graph theory, machine learning, etc.), the current status of data management begs for a major overhaul. (For example, the best we are currently doing for querying and searching from massive data is Google, and it is incredibly primitive! And even then we lament that the quality of it has decreased in favor of SEO-maximizing content.) So much raw data is created every day, and we just seem to fail to understand and interpret almost all of it, I see it as one of the major historical crises we face today. Instead of just storing data, we must find radical new methodologies and tools to search, filter, and explore data, and this poses as both a philosophical problem (of semiotics, linguistics, and hermeneutics) as well as a technological problem.

sandgiant4y ago

Can you provide some details on this? I'm curious how noise and signal are defined and measured in this case.

azeirah4y ago

I disagree with the op. This is historical data and includes all kinds of interesting content. Even if severely uninteresting today it may still be really valuable 40 years from now as part of research into colloquial language, design, trends, influence of events etc.

Same reason why notes taken by random people 250 years ago are really valuable to historians today, even if it's just a todo list

stjohnswarts4y ago

I would argue that the archive.org and saving the legacy of the internet is a far more important use of energy than making up imaginary digital currency pyramid schemes.

textfiles4y ago

I've been having fun with this post all day, but now I kind of need to know: Can you give examples of noise on the Archive?

janandonly4y ago

Unlike the Archive, the "value" of Bitcoin can be measured: Today's market cap of BTC is $839.5B

nix234y ago

>Today's market cap of BTC is $839.5B

Or Zero...depends who want to exchange it to real-stuff

janandonly4y ago

Well, if you happen to have some bitcoins that you are willing to sell to me for less than their "market value" today, then please, get in touch with me...

The same goes for any other money/not-money's out there... If anyone has gold/silver/diamonds that he wants to get rid of for a price lower than the market value, then again, please get in touch with me....

1 more reply

RNAlfons4y ago

Make it an easy installable/runable Windows application and it will spread like wildfire.

capableweb4y ago

If it was only that easy. To make distributed archiving as high quality as possible, you need reproducible environments as much as possible, which is why the "official" way of participating is to run virtual machines, instead of directly on the host.

Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Has a couple of different installation methods as well.

jrwr4y ago

Yep, Using Virtual box is rather easy to get the warrior running!

cxr4y ago

Why even require that? If the data in question is available over HTTP, it should be as easy as opening a page from the relevant origin in a browser tab, optionally opening a second tab for a "Warrior Dashboard", then invoking a bookmarklet on the former to slurp up data by XHR &tc. (If it's necessary to cross origins as the thing roves around, the dashboard can alert you to this while it continues doing what it can with the first origin. Just have the human return to the dashboard from time to time and repeat the second step to run as many in parallel as they want.)

jrwr4y ago

Full Archival with the standards required by the Internet Archive require that full unmodified headers are required, and unmodified content. This tends not to work well with modern browsers. Chrome and Firefox both fail at this currently. Someone is looking into a kind of modified Firefox to help with this. but its just not that how this system works. Now the Archive.org does have a API of sorts to say hay archive this URL, and a little working on the backend goes and does it..

What the Archive Team does is on a much more massive scale. Like SETI at home scale of scraping data across the internet. At almost every point we have had to make custom tools to ensure it meets our needs in our archival efforts.

cxr4y ago

> standards required by the Internet Archive require that full unmodified headers are required

Sure, this would not be a solution for the Wayback Machine, but would be adequate[1][2] for lots of non-Wayback collections (of the sort that Archive Team is associated with).

1. https://twitter.com/textfiles/status/970912494284779520

2. http://ascii.textfiles.com/archives/4285

TheTechRobo4y ago

Similar: github.com/InternetArchive/warcprox

myself2484y ago

That would be awesome, do you think you could write that?

cxr4y ago

I'd definitely be interested in working on getting as close as possible if the grant money were to appear.

causi4y ago

Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.

myself2484y ago

Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.

lazyjeff4y ago

For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/

There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.

nix234y ago

wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

myself2484y ago

Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.

1 more reply

TheTechRobo4y ago

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.

iforgotpassword4y ago

How likely is it you end up downloading child porn on behalf of them? In other words, how well curated or specific is the list of download jobs your node gets assigned? If it's something like "just grab everything from this blog platform" I guess chances are not zero.

mhitza4y ago

I think you would be more likely to win the lottery without playing.

That type of content has long moved from clearnet to the darknet. I would be inexplicably surprised if that type of content can be found on the clearnet. But I still can be wrong.

However if you're in the US loli hentai is going to be a risk and legal headache for sure https://www.shouselaw.com/ca/blog/is-loli-illegal-in-the-uni...

As far as I'm aware, maybe excepting Australia (?) as well, in the rest of the world that type of content is not something they'll classify as child pornography, you'll just get a few sketchy looks.

charcircuit4y ago

>That type of content has long moved from clearnet to the darknet.

A fraction of it.

>I would be inexplicably surprised if that type of content can be found on the clearnet

That kind of content is a single internet search away.

smarx0074y ago

My experience has shown that list to be extremely well-curated. See https://wiki.archiveteam.org/#Warrior-based_projects for the current list.

Though if you join the Reddit archival project, all bets may be off but that's not AT team's fault, I guess.

j / k navigate · click thread line to collapse

51 comments

uniqueuid4y ago

Yes!

Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.

18TB drives are ~$300 a piece right now, go buy one and help our collective memory!

smarx0074y ago

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

uniqueuid4y ago

Thanks, it's always good to point that out.

There's a surprising amount of tools that are able to submit data to the internet archive (and get data from there). Even wget can produce WARC archive files.

While the warrior downloads content via your line (a bit like a residential proxy network), I do think it's important that we decentralize the storage as well.

Just without the crypto mafia/drug traders/investors.

zdkl4y ago

AFAIK you can use IPFS (& clusters[0]) without relying on the crypto parts of that ecosystem. That ought to fit rather well with the use case.

[0] https://cluster.ipfs.io/

1 more reply

bradknowles4y ago

The last time I tried to do anything with or for the Archive Team, it was a mostly "just watch us work" sort of deal.

And once you built the tools, you couldn't get your Warrior into the list to be used, although you could always run your systems separately.

It's not like you could sign up for a SETI@Home type initiative and just let your equipment run.

I understand why they work this way. It's a very insular crowd, and new people and resources seem to disappear as quickly as they showed up.

So, they let you watch.

If you stay around long enough (months? years?), then they might let you start participating. But I wasn't willing to wait that long.

prox4y ago

I kind of wonder how we can make it searchable again. Is this included in this archiving effort?

In any case wonderful work.

uniqueuid4y ago

There is a standard set of tooling for indexing archives: CDX files. [1]

They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.

[1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...

camtarn4y ago

Ah, is the WARC format the reason it's called 'Warrior'? It seems like a very strange name for an archival program.

1 more reply

prox4y ago

Thank you for that information!

Thorentis4y ago

DoingIsLearning4y ago

I disagree, the unfiltered high noise is what makes it valuable. Curation is a bias.

If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.

I agree that it makes it harder to find things but I also see the value of IA as a time capsule.

lkbm4y ago

Yes, curation is very valuable, but it needs to be a layer on top of an uncurated source.

londons_explore4y ago

djokkataja4y ago

Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.

prox4y ago

Usually it’s hard to say if it’s valuable now , only time can tell.

uniqueuid4y ago

Just to point this out, on a technical level, the internet archive has very (!) little overhead.

Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]

[1] https://www.iso.org/standard/68004.html

qiskit4y ago

> tremendous energy and amounts of resources are being put towards very little of value.

I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.

The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.

cyber_kinetist4y ago

sandgiant4y ago

Can you provide some details on this? I'm curious how noise and signal are defined and measured in this case.

azeirah4y ago

Same reason why notes taken by random people 250 years ago are really valuable to historians today, even if it's just a todo list

stjohnswarts4y ago

I would argue that the archive.org and saving the legacy of the internet is a far more important use of energy than making up imaginary digital currency pyramid schemes.

textfiles4y ago

I've been having fun with this post all day, but now I kind of need to know: Can you give examples of noise on the Archive?

janandonly4y ago

Unlike the Archive, the "value" of Bitcoin can be measured: Today's market cap of BTC is $839.5B

nix234y ago

>Today's market cap of BTC is $839.5B

Or Zero...depends who want to exchange it to real-stuff

janandonly4y ago

Well, if you happen to have some bitcoins that you are willing to sell to me for less than their "market value" today, then please, get in touch with me...

1 more reply

RNAlfons4y ago

Make it an easy installable/runable Windows application and it will spread like wildfire.

capableweb4y ago

Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Has a couple of different installation methods as well.

jrwr4y ago

Yep, Using Virtual box is rather easy to get the warrior running!

cxr4y ago

jrwr4y ago

cxr4y ago

> standards required by the Internet Archive require that full unmodified headers are required

Sure, this would not be a solution for the Wayback Machine, but would be adequate[1][2] for lots of non-Wayback collections (of the sort that Archive Team is associated with).

1. https://twitter.com/textfiles/status/970912494284779520

2. http://ascii.textfiles.com/archives/4285

TheTechRobo4y ago

Similar: github.com/InternetArchive/warcprox

myself2484y ago

That would be awesome, do you think you could write that?

cxr4y ago

I'd definitely be interested in working on getting as close as possible if the grant money were to appear.

causi4y ago

Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.

myself2484y ago

lazyjeff4y ago

For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/

nix234y ago

wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

myself2484y ago

1 more reply

TheTechRobo4y ago

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.

iforgotpassword4y ago

mhitza4y ago

I think you would be more likely to win the lottery without playing.

That type of content has long moved from clearnet to the darknet. I would be inexplicably surprised if that type of content can be found on the clearnet. But I still can be wrong.

However if you're in the US loli hentai is going to be a risk and legal headache for sure https://www.shouselaw.com/ca/blog/is-loli-illegal-in-the-uni...

As far as I'm aware, maybe excepting Australia (?) as well, in the rest of the world that type of content is not something they'll classify as child pornography, you'll just get a few sketchy looks.

charcircuit4y ago

>That type of content has long moved from clearnet to the darknet.

A fraction of it.

>I would be inexplicably surprised if that type of content can be found on the clearnet

That kind of content is a single internet search away.

smarx0074y ago

My experience has shown that list to be extremely well-curated. See https://wiki.archiveteam.org/#Warrior-based_projects for the current list.

Though if you join the Reddit archival project, all bets may be off but that's not AT team's fault, I guess.

j / k navigate · click thread line to collapse