For actually ingesting the archives, dignifiedquire expanded a Rust utility aptly named Zim, which you can find here https://github.com/dignifiedquire/zim
Both repos contain information (and code of course) on how to extract information from the Zim archives
It took a bit to get accustomed to the format, but after looking at the files and doing a bit of research on the documentation, using Python with lxml made it relatively straightforward to do what I was interested in.
I'd recommend doing the same, only because it worked for me: get the XML dump, manually check out some files to understand what is going on, search for documentation on the file format and maybe read a few blog posts, and then convert the XML files to data structures suited for what you're interested in.
===
import xml.dom.pulldom as pulldom
from lxml import etree
from xml.etree import ElementTree as ET
sInputFileName = "/my/input/wiki_file.xml"
context = etree.iterparse(sInputFileName, events=('end',), tag='doc')
for event, elem in context:
iThisArticleCharLength = len(elem.text)
sPageURL = elem.get("url")[0:4000]
sPageTitle = elem.get("title")[0:4000]
SPageContents = elem.text
<do what you want with these vars...>
===And I've changed it a little bit to extract only the first n characters, this might be of some use since wikipedia dump are supposed to be pretty large: https://github.com/mooss/ruskea/blob/master/make_wiki_corpus... .