Do you know if is there a place to host other of sqlite dumps? I mean from other websites? Recently I dumped the whole hackernews api and I got thinking about it.
I keep meaning to do the same thing with Wikipedia. Although the Wikipedia dumps are so inscrutably named and seemingly undocumented it seems the organization does not want me to pursue the idea.
Pulling useful content out of the dumps has been an exercise in frustration. I'm sure I could figure something out if I had a bunch of time to dedicate to the effort.
If I just had sqlite dumps they'd be trivial to work with and I'd be much happier with them.
Fandom is usually the first example I think of whenever I hear the word "enshitification". First Wikia ate all the independent wikis because they offered free/managed MediaWiki hosting. Then slowly started making Wikia worse until the full Fandomization. Now the site is literally unusable without an ad blocker and all of that GFDL content on the site is locked behind obfuscation and incompetence. I desperately miss the old Wookiepedia and MemoryAlpha.
Presumably they have a script that does something similar to that process, and then writes the resulting data into a predefined table structure.
Yep, my process is similar. It goes...
- decompress (users|posts)
- split into batches of 10,000
- xsltproc the batch into sql statements
- pipe the batches of statements into sqlite in parallel using flocks for coordination
On my M1 Max it takes about 40 minutes for the whole network. Then I compress each database with brotli which takes about 5 hours.