Wayback Machine was down (opens in new tab)

(web.archive.org)

112 pointsk-ian6y ago30 comments

30 comments

21 comments · 7 top-level

nikisweeting6y ago· 6 in thread

In the meantime, distributed archiving ftw, run your own archives with Webrecorder.io, ArchiveBox.io, SingleFile, kiwix.org, etc!

trevyn6y ago

The kiwix Wikipedia-en full scrape with images has been broken for over 18 months, I think they could use some technical help. I tried running their scraper myself on a nice AWS instance and it just stalls after many days of downloading articles. Could probably use a rewrite. ;)

https://sourceforge.net/p/kiwix/discussion/604121/thread/1f2...

https://github.com/openzim/mwoffliner/issues/1020

https://github.com/openzim/mwoffliner

traverseda6y ago

The whole zim file infrastructure is pretty broken. I've been trying to put together a system for generating a WARC file by rendering all the wikitext content in a database dump, which is a lot more reasonable of an approach.

Rendering wikitext is challenging though, since wikitext can include chunks of other wikitext, and wikitext can use some pretty complicated templating functionality.

Oddly enough where I've run into the biggest issues is in weird slowdowns of the python WARCIO library that making dealing with large archives just about impossible. I haven't had time to really track that down, but if anyone want to it's pretty easy to reproduce, just try adding a few million lorum-ipsum articles and look at how far from linear time it's running.

There are a lot of advantages to starting from a dump, you can provide much better tools for filtering articles, probably even provide rudimentary document classification. You can also do things like re-compress and minify images, a dump intended for a cellphone probably doesn't need 4k images.

WARC is also probably a better tool for distributing web-archive type content, like wikipedia dumps. You can distribute a package of text content and image content as separate files, for example. Generally I have not been very impressed with the quality of ZIM file tooling. One disadvantage is you need to provide separate search indexing, but that's doable.

I'd love to be able to get a wikimedia grant to work on this, and take on less contract work, but so far their grant process is pretty hard to follow.

2 more replies

Rebelgecko6y ago

I'm not sure what your use case is so maybe this isn't helpful, but Wikipedia has weekly or so database dumps that you can download, as well as static HTML (although that might be more out of date)

https://en.wikipedia.org/wiki/Wikipedia:Database_download

1 more reply

nikisweeting6y ago

I'm actually helping work on that right now, we're improving the node-libzim bindings that mwoffliner uses to write the ZIM files, and providing some additional server power to do larger archives and hopefully catch up on the backlog of wikipedia-en dumps.

Kaiyou6y ago

Is there a tool that downloads every website I visit locally and then, upon revisit, shows me my local copy for instant load, but does a diff in the background with the online version and asks me to show newer version only if there are differences?

nikisweeting6y ago

Try https://github.com/WorldBrain/Memex, it has annotation and lets you review previously seen versions of a site as you're browsing.

Or check out some of the other options here:

- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

- https://github.com/iipc/awesome-web-archiving

username20206y ago· 4 in thread

Massive layoffs today I heard.

wideasleep16y ago

Any kinks would be appreciated.

saagarjha6y ago

I think you may have had a Freudian slip ;)

1 more reply

progman326y ago

Source?

nvahalik6y ago

I don't believe GP means at Archive.org. But there are a number of companies letting people go.

1 more reply

dublinben6y ago· 3 in thread

Now would be a great time to donate to the Internet Archive if you're able to. They can surely use the help.

Kaiyou6y ago

Don't they generate any income?

dheera6y ago

If only shareholders would think the same way about PG&E and other companies that could use infrastructure upgrades ...

hinkley6y ago

PG&E is so far behind on deferred maintenance that people have been petitioning for California to socialize it so the state stops catching on fire from high tension powerlines.

You will be glad to know that they've protected executive bonuses, though.

1 more reply

jarfil6y ago· 1 in thread

The Internet Archive just announced a no-waitlist book lending due to COVID-19, I'd guess their servers might not be too happy about the inrush of users.

http://blog.archive.org/2020/03/24/announcing-a-national-eme...

mirimir6y ago

Right, and https://archive.org/details/nationalemergencylibrary is up and responsive.

So I'm guessing that they've shifted resources.

thegeekpirate6y ago

I sent them a bug yesterday where I was being blocked for "Too Many Requests" regarding an endpoint I wasn't actually using (they thought I was attempting to submit URLs using the "Save Page Now" feature), so they've been having issues across the board.

This is good though, as they're now hopefully aware of some previously unknown deficiencies.

Best of luck to the Archive team to get things up and running again with minimal stress!

tgsovlerkhgsel6y ago

That would explain the random issues I saw recently (within the past ~12 hours) where I asked for a page version from 2019 and got one from 2018.

pcdoodle6y ago

It's been down for a few weeks for cnn.com (Can't load feb 1st to current day). I wonder if they're getting pressure from somewhere. Check it out for yourself.

j / k navigate · click thread line to collapse

30 comments

21 comments · 7 top-level

nikisweeting6y ago· 6 in thread

In the meantime, distributed archiving ftw, run your own archives with Webrecorder.io, ArchiveBox.io, SingleFile, kiwix.org, etc!

trevyn6y ago

https://sourceforge.net/p/kiwix/discussion/604121/thread/1f2...

https://github.com/openzim/mwoffliner/issues/1020

https://github.com/openzim/mwoffliner

traverseda6y ago

Rendering wikitext is challenging though, since wikitext can include chunks of other wikitext, and wikitext can use some pretty complicated templating functionality.

I'd love to be able to get a wikimedia grant to work on this, and take on less contract work, but so far their grant process is pretty hard to follow.

2 more replies

Rebelgecko6y ago

I'm not sure what your use case is so maybe this isn't helpful, but Wikipedia has weekly or so database dumps that you can download, as well as static HTML (although that might be more out of date)

https://en.wikipedia.org/wiki/Wikipedia:Database_download

1 more reply

nikisweeting6y ago

Kaiyou6y ago

nikisweeting6y ago

Try https://github.com/WorldBrain/Memex, it has annotation and lets you review previously seen versions of a site as you're browsing.

Or check out some of the other options here:

- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...

- https://github.com/iipc/awesome-web-archiving

username20206y ago· 4 in thread

Massive layoffs today I heard.

wideasleep16y ago

Any kinks would be appreciated.

saagarjha6y ago

I think you may have had a Freudian slip ;)

1 more reply

progman326y ago

Source?

nvahalik6y ago

I don't believe GP means at Archive.org. But there are a number of companies letting people go.

1 more reply

dublinben6y ago· 3 in thread

Now would be a great time to donate to the Internet Archive if you're able to. They can surely use the help.

Kaiyou6y ago

Don't they generate any income?

dheera6y ago

If only shareholders would think the same way about PG&E and other companies that could use infrastructure upgrades ...

hinkley6y ago

PG&E is so far behind on deferred maintenance that people have been petitioning for California to socialize it so the state stops catching on fire from high tension powerlines.

You will be glad to know that they've protected executive bonuses, though.

1 more reply

jarfil6y ago· 1 in thread

The Internet Archive just announced a no-waitlist book lending due to COVID-19, I'd guess their servers might not be too happy about the inrush of users.

http://blog.archive.org/2020/03/24/announcing-a-national-eme...

mirimir6y ago

Right, and https://archive.org/details/nationalemergencylibrary is up and responsive.

So I'm guessing that they've shifted resources.

thegeekpirate6y ago

This is good though, as they're now hopefully aware of some previously unknown deficiencies.

Best of luck to the Archive team to get things up and running again with minimal stress!

tgsovlerkhgsel6y ago

That would explain the random issues I saw recently (within the past ~12 hours) where I asked for a page version from 2019 and got one from 2018.

pcdoodle6y ago

It's been down for a few weeks for cnn.com (Can't load feb 1st to current day). I wonder if they're getting pressure from somewhere. Check it out for yourself.

j / k navigate · click thread line to collapse