undefined | Better HN

0 pointstoyg3y ago0 comments

The problem with the Internet Archive, which does an amazing job, is that they do an amazing job despite the problem being fundamentally intractable. Web content expands too quickly and too massively.

I wonder if the answer is a network of topic-focused archives; like moving from a "Library of Alexandria" model to a modern nationwide system of libraries.

0 comments

7 comments · 4 top-level

cheschire3y ago· 2 in thread

Okay so you build a knowledge graph on top of the internet archive. Now you are struggling to prioritize the resources necessary to capture long-tail content that doesn't mesh easily into popular corpuses. I imagine this would lead to the library equivalent of an echo chamber.

toygOP3y ago

I was thinking more of a federated "webring" structure, with some content being present in more than one node, and where maintenance and curation are distributed (and gathered independently) among nodes.

The nation of, say, Japan, has limited interest in funding an american noprofit today; but they would likely have a great deal of interest in funding an equivalent focused on Japanese content, for example.

cheschire3y ago

Ah so more like mastadon or ipfs, but specifically for the purposes of federated archiving.

So now you get into the issue of haves and have nots. Who is allowed to be considered an authorized archivist from a robots.txt perspective? Or what happens if an archivist becomes blacklisted for not respectfully crawling? How do national sanctions affect the Internet Archive of Russia? I imagine there would be a certification process and it would probably cost some money.

It's an interesting topic and I'm simply looking at the weak spots. I'm not against the overall concept though.

1 more reply

sho_hn3y ago· 1 in thread

The curator being bandwidth-limited is not necessarily a problem if the problem you are solving is an overwhelmed audience in need of a curator. In other words, the Archive missing things may not really be a problem if the stuff is not missing is on average of value.

It raises the issue of governance of the curator, but the IA is already more transparent than Goole & co.

pixl973y ago

How do you measure the future value of something you don't keep?

fsloth3y ago

” Web content expands too quickly and too massively.”

If most of it is crap I would call not archiving it a feature.

There is a weird convoluted analogue to CERN particle detectors. They smash particles together and then image the resulting storm of particle contrails via detector that is basically a sandwhiched ccd detector (like you have in camera, but different) the size of a cathedral. Resulting in far too much data for any system to analyze or even store in the first place. Hence they need/needed to runtime filter the massive amount of particle trail signals and only pick out the critical ones.

If there is too much data you simply need to drop the parts you are fairly confident you don’t need.

There is no reason there should be only one internet archive, there might very well be parallel operations filtering a bit different things.

I guess it’s a bit odd Unesco does not already have a parallel effort.

pelasaco3y ago

would be nice if we could have a way to navigate just in the old web...

j / k navigate · click thread line to collapse