Running ArchiveTeam's Warrior in Kubernetes (opens in new tab)

(gabrielsimmer.com)

92 pointsgmemstr1y ago36 comments

36 comments

The first thing I setup when I started to manage my own Kubernetes cluster more then a year ago was this Warrior, I completely forgot about it until this post.

Has been active for over a year steadily working the recommended project. Downloaded over 3TB in 6 days (node reboot, so pod was restarted and stats are not persistent). So rough extrapolation is about 180TB. Happy to help the good cause of the ArchiveTeam!

Edit: typo

ch71r221y ago

For anyone else interested in running this, it only took a couple seconds to launch their docker-compose.yml

https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...

NortySpock1y ago

I noticed from the docker overlay filesystem that the container was spraying files all over the disk. (Ephemeral, destroyed on container shutdown, sure, but I wanted to reduce write-wear on my ssd...)

I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...

Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?

michaelt1y ago

> I wanted to reduce write-wear on my ssd

Modern SSDs are pretty good at things like wear levelling.

For example [1] reports that a bunch of 256 GB SSDs lasted to 2000+ terabytes written, and a handful up to 7000 terabytes written. So you could saturate a 100 megabit internet connection for 5 years before even a small SSD would wear out. And an SSD 4x the size has 4x the life.

If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)

[1] https://www.reddit.com/r/chia/comments/mukiwz/are_we_overthi...

lopkeny12ko1y ago

> the container was spraying files all over the disk

Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.

> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?

Why would you want this? This sounds like a terrible footgun.

2 more replies

j4ah4n1y ago

I think you'll just need to mount it at the right place, with right permissions.

Demonstrated here https://stackoverflow.com/questions/39193419/docker-in-memor...

Havoc1y ago

You can put the entire docker directory in a ramdisk. Same as you would when trying to move it to a secondary harddisk. Risky though as a reboot would wipe everything

crtasm1y ago

How should I approach looking at what that will install before I run it? Every path on the site returns 'nope' https://atdr.meo.ws/archiveteam/warrior-dockerfile

Havoc1y ago

Isn't there substantial risk involved in having who knows what scraped from your IP?

tech234a1y ago

Yes but many projects are usually restricted to specific websites. A few projects, such as the URLs project, are generally unrestricted.

badlibrarian1y ago

Many of these sites are already captured and archived by proper entities as required by federal law. More is better, I guess, except when it isn't. Duplication of effort is a huge problem in the humanities in general and with archiving in particular.

The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.

https://web.archive.org/web/20250122000033/www.google.com

Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.

myself2481y ago

> by proper entities as required by federal law.

What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?

badlibrarian1y ago

Some of the mass deletions are merely a new administration setting up shop. Policies from the previous administration don't belong on the current whitehouse.gov. They wind up here instead https://bidenwhitehouse.archives.gov/

We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.

Thousands of employees, dozens of facilities, billions of dollars.

Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.

I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.

3 more replies

homebrewer1y ago

How do I as a non-US citizen get access to information from those "proper entities"? Is it even possible for US citizens? This is often a surprise for some visitors of this fine website, but there's a large world outside the US where "federal law" does not apply.

badlibrarian1y ago

We fund the Library of Congress (largest library in the world) and the National Archives (NARA) who make all of this stuff public. Other goverments do similar things. It's all on the web.

https://www.archives.gov/presidential-records/research/archi...

There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.

j / k navigate · click thread line to collapse

36 comments

WildGreenLeave1y ago

The first thing I setup when I started to manage my own Kubernetes cluster more then a year ago was this Warrior, I completely forgot about it until this post.

Edit: typo

ch71r221y ago

For anyone else interested in running this, it only took a couple seconds to launch their docker-compose.yml

https://github.com/ArchiveTeam/warrior-dockerfile/blob/maste...

NortySpock1y ago

I tried setting it up with /tmp as a tmpfs (ramdisk) but it then refused to start...

Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?

michaelt1y ago

> I wanted to reduce write-wear on my ssd

Modern SSDs are pretty good at things like wear levelling.

If you're running on a raspberry pi with a microsd card for storage, feel free to keep worrying though :)

[1] https://www.reddit.com/r/chia/comments/mukiwz/are_we_overthi...

lopkeny12ko1y ago

> the container was spraying files all over the disk

Right, that's basically the point...the Warrior downloads files, compresses them, and uploads them for archival. This necessarily requires staging the files somewhere between download and upload.

> Anyone know any broad-spectrum docker incantations to force all overlay writes to RAM, for a container?

Why would you want this? This sounds like a terrible footgun.

2 more replies

j4ah4n1y ago

I think you'll just need to mount it at the right place, with right permissions.

Demonstrated here https://stackoverflow.com/questions/39193419/docker-in-memor...

Havoc1y ago

You can put the entire docker directory in a ramdisk. Same as you would when trying to move it to a secondary harddisk. Risky though as a reboot would wipe everything

crtasm1y ago

How should I approach looking at what that will install before I run it? Every path on the site returns 'nope' https://atdr.meo.ws/archiveteam/warrior-dockerfile

Havoc1y ago

Isn't there substantial risk involved in having who knows what scraped from your IP?

tech234a1y ago

Yes but many projects are usually restricted to specific websites. A few projects, such as the URLs project, are generally unrestricted.

badlibrarian1y ago

https://web.archive.org/web/20250122000033/www.google.com

Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.

myself2481y ago

> by proper entities as required by federal law.

What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?

badlibrarian1y ago

We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.

Thousands of employees, dozens of facilities, billions of dollars.

3 more replies

homebrewer1y ago

badlibrarian1y ago

We fund the Library of Congress (largest library in the world) and the National Archives (NARA) who make all of this stuff public. Other goverments do similar things. It's all on the web.

https://www.archives.gov/presidential-records/research/archi...

There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.

j / k navigate · click thread line to collapse