https://sourceforge.net/p/kiwix/discussion/604121/thread/1f2...
Rendering wikitext is challenging though, since wikitext can include chunks of other wikitext, and wikitext can use some pretty complicated templating functionality.
Oddly enough where I've run into the biggest issues is in weird slowdowns of the python WARCIO library that making dealing with large archives just about impossible. I haven't had time to really track that down, but if anyone want to it's pretty easy to reproduce, just try adding a few million lorum-ipsum articles and look at how far from linear time it's running.
There are a lot of advantages to starting from a dump, you can provide much better tools for filtering articles, probably even provide rudimentary document classification. You can also do things like re-compress and minify images, a dump intended for a cellphone probably doesn't need 4k images.
WARC is also probably a better tool for distributing web-archive type content, like wikipedia dumps. You can distribute a package of text content and image content as separate files, for example. Generally I have not been very impressed with the quality of ZIM file tooling. One disadvantage is you need to provide separate search indexing, but that's doable.
I'd love to be able to get a wikimedia grant to work on this, and take on less contract work, but so far their grant process is pretty hard to follow.
Or check out some of the other options here:
- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
You will be glad to know that they've protected executive bonuses, though.
http://blog.archive.org/2020/03/24/announcing-a-national-eme...
So I'm guessing that they've shifted resources.
This is good though, as they're now hopefully aware of some previously unknown deficiencies.
Best of luck to the Archive team to get things up and running again with minimal stress!