undefined | Better HN

0 pointsstupidcar9y ago0 comments

Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.

0 comments

2 comments · 1 top-level

LunaSea9y ago· 1 in thread

Threading for a crawler is just a dirty way of not handling distribution. When you will need more than one server your threads won't save you. It has nothing to do with Node.js and thread support.

stupidcarOP9y ago

I wasn't creating a new search engine, I was doing a one-off scraping job in my spare time. Creating a fully distributed solution would have been total overkill. But threading could and would have helped.

Honestly, stupidly hostile and ignorant comments like this are the absolute worst thing about Hacker News.

j / k navigate · click thread line to collapse