undefined | Better HN

0 pointskarterk12y ago0 comments

A little bit of context: I have done a lot of hadoop, and also well aware of spark and storm. Storm is mostly well suited for handling a stream of real-time data. Spark is specifically for running iterative algorithms - it can read from HDFS, and with the expressiveness of Scala, it's great for building machine-learning related stuff.

However, 5GB of data is literally nothing, and that statement holds till your data size is atleast 50-60 GB. Given that 64 GAM RAM machines are now commodity, I would just load the entire thing in RAM and write a multi-threaded program. Sounds old school, but regardless of how well documented hadoop, spark and storm are, there is nevertheless a learning curve and a maintenance cost. Both of which are well worth only if you see your data rapidly growing to the X TB range. Otherwise, it might be just easier to stick it in a single machine and get stuff done.

You can stick to Scala/Java, and so long you develop good abstractions around your core algorithms, you can always move to spark/hadoop when you need it. Feel free to send me an email if you want to talk more (email in profile).

0 comments

2 comments · 1 top-level

henrythe9th12y ago· 1 in thread

Thanks for the suggestion. We've actually thought about just writing a multithreaded system on a single machine. What type of in-memory storage would you recommend in this case? (which hopefully may be extended to a distributed cluster of machines if 1 really large machine becomes expensive)

Thanks

karterkOP12y ago

I suggest storing your data in files and just memory mapping them during start-up. JVM can't memory map more than 2GB per file, so just create logical shards, and map them independently.

Since you will be mostly iterating over all records during your iterative algorithms, storing them in a separate in-memory DB makes no sense (have to call external process via socket).

You can then use a framework like zookeeper/akka for managing nodes in the event that you have to scale out. Even a simple master/slave set-up using thrift services will do.

j / k navigate · click thread line to collapse