Most problems are not Big Data problems. The size a problem must be before it qualifies as a Big-Data problem grows larger every day with the availability of machines with ever-more cores and memory. `Sed`, `awk`, `grep`, `sort`, `join`, and so forth are some of the least appreciated tools in the Unix toolbox.
People want to think they have Big Data problems but they probably just have plain old normal-data problems. I have had to unwind the ridiculous, heavy-weight, Big Data solutions to normal-data problems that "kids today" love.
If you don't work for Netflix or Google or Facebook or insert maybe a hundred other companies here, you probably do not have a Big Data problem.
Upside: I had one of the most paying position among the technical people. I also got to play with expensive stuff.
Downside: it was soul crushing, I was delivering no value whatsoever and had a really hard time looking at my colleagues in the eye, as they were making a third of my salary (at best).
I got out, joined a new company with that in mind and now have a very exciting job. They do have a dedicated R&D, Data Science team which is a shit show: absolutely brilliant people completely wasted as they can't build anything for lack of programming/technology experience, in an environment where their theoretical skills are mostly useless. I'm genuinely sad for them.
[edit: one of the company still has a Hadoop/Spark cluster for their whooping 500Mb of data]
However, I'm really glad to find out about SnappyData.io, that's gonna save me a lot of time waiting. It would truly be my perfect dream, if they allowed running any programming language inside an environment like Jupyter.org or BeakerNotebook.com, but with Pandoc.org Markdown. So that I can essentially work fulltime programming, while I can also document it and also be able to export my documentation to a good looking latex thesis.
I disagree over here. I have worked across multiple scenarios which warranted big data solutions and such solutions were not feasible before Apache Spark and such were available. Even our current startup (www.aihello.com) has 8.7 million products and calculating LDA + Cosine Similarity reaches trillions of matrices which is simply not feasible with traditional tools.
Telstra/Sensis, the telecom company in Australia that I consulted for, went from a month delayed reporting to near real time reporting due to apache spark.
Also keep in mind that the scale of data is growing exponentially for all of us since storage is getting cheaper and big data analysis is proving game changer in many scenarios.
https://aadrake.com/command-line-tools-can-be-235x-faster-th...
I'll also take this opportunity to plug Make and Drake for manipulating data in a replicable way:
https://bost.ocks.org/mike/make/
https://github.com/Factual/drake
If you're processing data using tools that cannot trace their ancestry directly to some time before 1985, you're probably wasting your own and your colleagues' time.
This reminds me of the time, way back when, that a coworker told me about how our customer was filling a rack with a terabyte of hard drives. My eyes bulged a little bit to think of it. Now I chuckle to think that the laptop I had two laptops ago had a terabyte drive in it.
> Consulting service: you bring your big data problems to me, I say "your data set fits in RAM", you pay me $10,000 for saving you $500,000.
Considering https://www.supermicro.com/products/system/4U/8048/SYS-8048B... which is a plain old 4U server not some fancy, super expensive NUMA machine can eat up 12TB memory, this quip and parent has quite some merits.
6TB is not even https://memory.net/product/s26361-f3843-e618-fujitsu-1x-64gb... horrible at 57 504 dollars. That's about 48 engineering days if your engineer related expenses are 150 an hour (and it's likely they are more).
> SGI UV 300 now scales up to 64 CPU sockets and 64TB of cache-coherent shared memory in a single system.
This is the current limit of Linux hardware memory support so going above it is tricky. But still, 64TB.
There are other use cases other than mere size that can necessitate "big data" solutions. E.g. timeliness, resiliency, maintainability...
If you are building production data processing systems that have constraints on data size, latency, resiliency, scheduling, dependency management, etc., you might be better off with a "big data" system. Even if the data could all fit on a beefy box. This was a painful lesson for me to learn.
That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM.
If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.
At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.
> how is resilience helped with a big data solution?
The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.
Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.