undefined | Better HN

0 pointsdylan6044y ago0 comments

Serious question, but why?

0 comments

6 comments · 2 top-level

sillysaurusx4y ago· 4 in thread

I think it's a historically significant dataset. We've seen other datasets be preserved, such as GitHub arctic vault.

I agree that it's tenuous. I would give it 20% odds of hitting the 500 year mark at best. And I don't think all of the data will survive.

But if archive.org ever becomes unsustainable to run, the existing data will likely be preserved. Lots of companies will be incentivized to continue hosting the data, as it's excellent PR if nothing else. They don't need to continue gathering the data, just host it.

Hosting is only going to become cheaper as t -> infinity, and given the massive amount of compute I've seen Google wield, it's hard to imagine that an operation like archive.org can't find some way to be preserved.

All that said, the biggest threat is sudden data loss. This only works as long as the data doesn't get lost. Has archive.org posted their operations policies anywhere? It would be interesting reading.

londons_explore4y ago

Archive.org has substantial legal risks too.

Imagine a future gdpr-like policy that gives people's descendants ownership and copyright over everything they've said. Suddenly every word written into archive.org has an owner, who might come and sue archive.org or its managers. Soon every person alive has some grandparent who wrote something in the archive and some of them are wanting compensation for all the decades archive.org has been distributing grandpa's words for free.

dylan604OP4y ago

It's less about the "getting it done" aspect. It's more about are they going to be around in 50/100/500 years. Will the tech be around that long? Will they keep up with the conversion of old tech into new tech? In my opinion, any kind of digital archive is just not a sound way to go about it. Analog all the way for long term archival.

metalliqaz4y ago

not sure I share your sentiment about companies hosting the data, considering what happened to Geocities and others.

sillysaurusx4y ago

Mm, you're right, but Geocities might be less interesting to historians than an archive of all internet history.

Also, as someone who has trained a few large GPT models, I think ML has a chance of preserving a lot of this data. Training datasets are only growing larger and larger, and although those aren't updated (yet), there's no reason to think they won't last for a long time.

I imagine that in 500 years, imagenet2012 might still be around as a historical curiosity, at least somewhere.

1 more reply

postingawayonhn4y ago

I think the chance of future generations having the motivation to continue preserving OP's specific website would be quite low but there would be a much greater motivation to maintain a large organised archive.

j / k navigate · click thread line to collapse