Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.
Large institutions like the internet archive are doing an admirable job, but there is a lot of content that they cannot and will not cover. So we will definitely (also) need volunteer-based archival for the foreseeable future.
18TB drives are ~$300 a piece right now, go buy one and help our collective memory!
The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.
For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.
There's a surprising amount of tools that are able to submit data to the internet archive (and get data from there). Even wget can produce WARC archive files.
While the warrior downloads content via your line (a bit like a residential proxy network), I do think it's important that we decentralize the storage as well.
Just without the crypto mafia/drug traders/investors.
The tools couldn't be built without additional knowledge that wasn't published anywhere -- because there had been drift from what was published versus what was working, and those changes never got folded back in. And there were multiple versions and variants of the tools, with different teams using different versions or variants.
And once you built the tools, you couldn't get your Warrior into the list to be used, although you could always run your systems separately.
It's not like you could sign up for a SETI@Home type initiative and just let your equipment run.
I understand why they work this way. It's a very insular crowd, and new people and resources seem to disappear as quickly as they showed up.
So, they let you watch.
If you stay around long enough (months? years?), then they might let you start participating. But I wasn't willing to wait that long.
In any case wonderful work.
They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.
But in general, these archives are NOT geared towards full-blown search because it would be pretty expensive to keep the indexes in hot cache. Plus you would need to deal with historical versions of records, which is not normally done in search UX.
[1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...
If someone wants to dive into any topic in the archive 30 years from now they will have access to everything, not access to what some of us deem 'worthy' of curating.
I agree that it makes it harder to find things but I also see the value of IA as a time capsule.
I enjoy using Open Library to re-read obscure middle grades books from the 1950s-1990, and there are some obscure DOS games I want to revisit. It's hard to find what I want sometimes, but only having access to curated lists would change it from "hard" to "impossible" in many cases.
Hard to compare Bitcoin directly, but its market cap was around $1 billion in 2013 and cleared $1 trillion for the first time a little over a year ago.
I get that this article is about people using their personal computers to help archive things, but I don't think the Internet archive is ever going to be using resources even remotely as aggressively as cryptocurrencies unless they somehow turn all their archiving into cryptocurrency.
Usually it’s hard to say if it’s valuable now , only time can tell.
Crawled data is de-duplicated on the request level and response payloads can be individually gzipped as well as having per-archive-file compression. [1]
I doubt it take tremendous energy or resources. What percentage of the overall internet energy/resources is used by IA? An insignificant minuscule amount.
The problem with IA is that they are constantly attacked by institutions, corporations, etc to remove content.
Libraries aren't just "a bunch of books piled up in shelves", they're a historical invention built and perfected for centuries where books are extensively coded and catalogued via a complex hierarchical system. As we are dealing with far more data than the past (not just books but posts and comments from all over the world, as well as new kinds of media such as images and videos), and also have new kinds of conceptual and technological inventions that previous librarians didn't have access to (hyperlinks, databases, graph theory, machine learning, etc.), the current status of data management begs for a major overhaul. (For example, the best we are currently doing for querying and searching from massive data is Google, and it is incredibly primitive! And even then we lament that the quality of it has decreased in favor of SEO-maximizing content.) So much raw data is created every day, and we just seem to fail to understand and interpret almost all of it, I see it as one of the major historical crises we face today. Instead of just storing data, we must find radical new methodologies and tools to search, filter, and explore data, and this poses as both a philosophical problem (of semiotics, linguistics, and hermeneutics) as well as a technological problem.
Same reason why notes taken by random people 250 years ago are really valuable to historians today, even if it's just a todo list
Or Zero...depends who want to exchange it to real-stuff
Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
Has a couple of different installation methods as well.
What the Archive Team does is on a much more massive scale. Like SETI at home scale of scraping data across the internet. At almost every point we have had to make custom tools to ensure it meets our needs in our archival efforts.
There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.
NO "--convert-links" if you want a "pure" non local browsable copy.
That type of content has long moved from clearnet to the darknet. I would be inexplicably surprised if that type of content can be found on the clearnet. But I still can be wrong.
However if you're in the US loli hentai is going to be a risk and legal headache for sure https://www.shouselaw.com/ca/blog/is-loli-illegal-in-the-uni...
As far as I'm aware, maybe excepting Australia (?) as well, in the rest of the world that type of content is not something they'll classify as child pornography, you'll just get a few sketchy looks.
A fraction of it.
>I would be inexplicably surprised if that type of content can be found on the clearnet
That kind of content is a single internet search away.
Though if you join the Reddit archival project, all bets may be off but that's not AT team's fault, I guess.