Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!
Edit: typo
https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...
I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...
Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Modern SSDs are pretty good at things like wear levelling.
For example [1] reports that a bunch of 256 GB SSDs lasted to 2000+ terabytes written, and a handful up to 7000 terabytes written. So you could saturate a 100 megabit internet connection for 5 years before even a small SSD would wear out. And an SSD 4x the size has 4x the life.
If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)
[1] https://www.reddit.com/r/chia/comments/mukiwz/are_we_overthi...
Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.
> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?
Why would you want this? This sounds like a terrible footgun.
Demonstrated here https://stackoverflow.com/questions/39193419/docker-in-memor...
The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.
https://web.archive.org/web/20250122000033/www.google.com
Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.
What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?
We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.
Thousands of employees, dozens of facilities, billions of dollars.
Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.
I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.
https://www.archives.gov/presidential-records/research/archi...
There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.