How Akka Streams can be used to process the Wikidata dump in parallel (opens in new tab)

(engineering.intenthq.com)

108 pointsArturSoler11y ago19 comments

19 comments

18 comments · 5 top-level

frik11y ago· 5 in thread

> Process the whole Wikidata in 7 minutes with your laptop

Wikidata is several magnitudes smaller than Freebase (closed by Google in May) and it won't fit in your RAM (laptop).

As you comment, Freebase is bigger than Wikidata. It is 22GB compressed (250GB uncompressed) while Wikidata is 5GB compressed (49GB uncompressed) [1].

Said that, I believe the process described in the blog post is not loading the whole Wikidata dump into memory and it would work the same to process Freebase or even larger data dumps with your laptop.

From the post: How Akka Streams can be used to process the Wikidata dump in parallel and using constant memory with just your laptop.

[1] https://developers.google.com/freebase/data http://dumps.wikimedia.org/other/wikidata/

thibaut_barrere11y ago

What are your favorite large, publicly available datasets?

Smerity11y ago

Biased reply (I'm a data scientist there): Common Crawl[1]. We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone completely free.

[1]: http://commoncrawl.org/

rcpt11y ago

This thread is pretty good

http://www.quora.com/Where-can-I-find-large-datasets-open-to...

nextos11y ago

The Cancer Genome Atlas, Ensembl, 1000Genomes.

mtrn11y ago· 4 in thread

On a related note: When I indexed the whole English wikipedia last year, I was surprised, that it was possible to have a JSON version of it indexed[1] and searchable within half an hour on my laptop.

[1] Using parallel bulk indexer for ES: https://github.com/miku/esbulk

mhuffman11y ago

How about ~17 minutes (including wikipedia data download and extraction time)! Using json-wikipedia and lbzip2

[1] https://github.com/diegoceccarelli/json-wikipedia

mtrn11y ago

Thanks, JSON exports make wikipedia data much more approachable.

andrewvc11y ago

I wrote a similar thing with an emphasis on multiple analyzed fields (making it slower to index) but much more flexible to query.

https://github.com/andrewvc/wikiparse/tree/java

That being said, when it comes to indexing wikipedia, the indexing can be done well across multiple threads internally by elasticsearch. Multithreading the reading/parsing isn't a huge win. Doing decompression in a separate thread is however.

mtrn11y ago

Yes, ES uses multiple threads nicely. But as you move to 32 or 64 cores - in my experience - a single threaded client won't keep ES/Lucene busy enough.

With SOLR, it's similar:

> Sometimes you need to index a bunch of documents really, really fast. [...] The solution is two-fold: batching and multi-threading

From: http://lucidworks.com/blog/high-throughput-indexing-in-solr/

cristianpascu11y ago· 3 in thread

From their video: The presenter: "Why would you (the assistent lady) be interested in cars?" The assistent: "I'm the perfect chick to be into Masserati."

It's a bit disturbing to see an employee presenting her personal life, kids, interests, and what not. Good job, IntentHQ!

The video: https://www.intenthq.com/resources/interest-fingerprint/

laumars11y ago

I wouldn't say it was disturbing, but it was definitely cringe worthy. A lot of their blog feels that way. They've gone for an informal corporate approach like using puns[1] and memes[2] as headings. Even that video felt badly scripted; like it was meant to sound like an informal pub conversation but instead it came off awkward and unprofessional.

I'm sure their products are of the highest quality, but their blog isn't a great advert in my opinion.

[1] http://engineering.intenthq.com/2015/06/for-those-about-to-c...

[2] http://engineering.intenthq.com/2015/06/wikidata-akka-stream...

cristianpascu11y ago

I have to say I have mixed feelings about the video. On one hand I understand there's a whole world of people out there, and I don't mind openness and honesty. Big thumbs to her for being honest and cool. On the other hand, it goes the other way when you're bragging about your awesome product that analyzes people's life and sells that info to corporations.

1 more reply

dsabanin11y ago

Why is it disturbing to see someone who is open and not afraid of a bunch of kids and freaks on the Internet? :)

jimbokun11y ago· 1 in thread

Could this example have been accomplished with awk and xargs just as fast, with same or less memory usage, in fewer lines of code?

Seems so to me after skimming the article, but maybe I missed an important advantage of using Akka Streams for this task?

thelastnode11y ago

Yes, the initial parts of the example could be accomplished with awk and xargs, but as the article goes on to demonstrate, even doing something like printing every nth element would be difficult.

I think the intent was for this to be more of a demonstrative example, and with a more complex, evolving, real-world processing pipeline, Akka streams could be really useful.

MrDosu11y ago

Are streaming json parsers that rare?

j / k navigate · click thread line to collapse

19 comments

18 comments · 5 top-level

frik11y ago· 5 in thread

> Process the whole Wikidata in 7 minutes with your laptop

Wikidata is several magnitudes smaller than Freebase (closed by Google in May) and it won't fit in your RAM (laptop).

xomateix11y ago

As you comment, Freebase is bigger than Wikidata. It is 22GB compressed (250GB uncompressed) while Wikidata is 5GB compressed (49GB uncompressed) [1].

From the post: How Akka Streams can be used to process the Wikidata dump in parallel and using constant memory with just your laptop.

[1] https://developers.google.com/freebase/data http://dumps.wikimedia.org/other/wikidata/

thibaut_barrere11y ago

What are your favorite large, publicly available datasets?

Smerity11y ago

Biased reply (I'm a data scientist there): Common Crawl[1]. We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone completely free.

[1]: http://commoncrawl.org/

rcpt11y ago

This thread is pretty good

http://www.quora.com/Where-can-I-find-large-datasets-open-to...

nextos11y ago

The Cancer Genome Atlas, Ensembl, 1000Genomes.

mtrn11y ago· 4 in thread

On a related note: When I indexed the whole English wikipedia last year, I was surprised, that it was possible to have a JSON version of it indexed[1] and searchable within half an hour on my laptop.

[1] Using parallel bulk indexer for ES: https://github.com/miku/esbulk

mhuffman11y ago

How about ~17 minutes (including wikipedia data download and extraction time)! Using json-wikipedia and lbzip2

[1] https://github.com/diegoceccarelli/json-wikipedia

mtrn11y ago

Thanks, JSON exports make wikipedia data much more approachable.

andrewvc11y ago

I wrote a similar thing with an emphasis on multiple analyzed fields (making it slower to index) but much more flexible to query.

https://github.com/andrewvc/wikiparse/tree/java

mtrn11y ago

Yes, ES uses multiple threads nicely. But as you move to 32 or 64 cores - in my experience - a single threaded client won't keep ES/Lucene busy enough.

With SOLR, it's similar:

> Sometimes you need to index a bunch of documents really, really fast. [...] The solution is two-fold: batching and multi-threading

From: http://lucidworks.com/blog/high-throughput-indexing-in-solr/

cristianpascu11y ago· 3 in thread

From their video: The presenter: "Why would you (the assistent lady) be interested in cars?" The assistent: "I'm the perfect chick to be into Masserati."

It's a bit disturbing to see an employee presenting her personal life, kids, interests, and what not. Good job, IntentHQ!

The video: https://www.intenthq.com/resources/interest-fingerprint/

laumars11y ago

I'm sure their products are of the highest quality, but their blog isn't a great advert in my opinion.

[1] http://engineering.intenthq.com/2015/06/for-those-about-to-c...

[2] http://engineering.intenthq.com/2015/06/wikidata-akka-stream...

cristianpascu11y ago

1 more reply

dsabanin11y ago

Why is it disturbing to see someone who is open and not afraid of a bunch of kids and freaks on the Internet? :)

jimbokun11y ago· 1 in thread

Could this example have been accomplished with awk and xargs just as fast, with same or less memory usage, in fewer lines of code?

Seems so to me after skimming the article, but maybe I missed an important advantage of using Akka Streams for this task?

thelastnode11y ago

Yes, the initial parts of the example could be accomplished with awk and xargs, but as the article goes on to demonstrate, even doing something like printing every nth element would be difficult.

I think the intent was for this to be more of a demonstrative example, and with a more complex, evolving, real-world processing pipeline, Akka streams could be really useful.

MrDosu11y ago

Are streaming json parsers that rare?

j / k navigate · click thread line to collapse