By far the biggest factor influencing the success of an analytics project is that the company has a human who has the time and inclination to think and reason about the business. They figure out what questions are important to ask and then go look at the data to see what they find. Collecting the data is the easy part. There is no analytics product that asks & answers your most important business questions for you.
I enjoyed the jab at predictive modeling; it's almost comical how many companies dream about predictive when they haven't yet got basic tracking in place for what's _already_ happening in their business.
Love the post, thanks for sharing.
The effect of these marketing campaigns on would be clients is terrible. They start going after crazy crackpot solutions to gain revenue while they haven't addressed the simplest easy to reach low risk revenue gains. In a a lot of cases integrating complex side effect data costs a lot and provides only marginal revenue gains.
The primary difference between a model and an insight is that insights require a human to process - anything more automatic is a model. Insights are easy to implement and are great for finding patterns and anomalies (the human mind is basically designed to pick these out). But the human element makes insights less scalable with significantly higher latency. For some problems these are unacceptable tradeoffs, and this has little to do with how stable a company's environment is. It's purely a product/strategy question, and about understanding all the tradeoffs.
Of course, that made no sense, so I checked a little deeper. You know what else people also buy when they buy DVD players? TV's. The DVD/furniture relationship was an artifact of the high degree of correlation between TV's and DVD players, which the visualization tool failed to account for.
I brought this up immediately, but received tepid response. Of course, months later, I was still hearing about DVD players and furniture. It had become part of the institutional lore, and no facts were going to replace that.
And there was me thinking it made sense, because I'd done the same thing. A TV can stand on many surfaces, but once you've got a DVD player (some thin, wide rectangle) it makes a lot more sense to get a TV cabinet to put the DVD player in.
Perhaps not so much sense now, but 8-9 years ago I went through this logic.
What the business people don't appreciate is that the ML models don't know they're looking at DVD players and TV stands. They just know that vector elements 27 and 291 have the strongest correlation. It takes a human in the loop to say, "item 291 is technically just TV Stands, but we've done dimensional reduction to pool TVs, TV Stands, and Projectors all into cluster 14, which then correlates to cluster 12, consisting of DVD players and Xboxes".
If you look at the most "controversial" data science paper from 2013 where a study correlated intelligence to Liking the Facebook pages "Curly Fries" and "Thunderstorms" (here is a summary: http://www.wired.com/2013/03/facebook-like-research/), there were a lot of proponents saying that there was no causation, and the correlation was not founded, etc.
Of course, you would say the study "makes no sense". Intelligence can't be predicted by Facebook Likes. There is no correlation there, etc. But why not? If you read the paper (http://www.pnas.org/content/110/15/5802.full.pdf) their logic is sound. Is the marketing campaigns that the company bought based on the TV Stand<>DVD Player connection any different than other marketing campaigns? Facebook does all of their ad display based on similar data analysis as above, and it seems to be working for them.
Note: There is the not-so-hidden machine learning feedback loop now (explained better here: http://www.john-foreman.com/blog/the-perilous-world-of-machi...), where people Like the 'Curly Fries' and 'Thunderstorms' pages because of the research.
What? If a data scientist sees something seems illogical, there is no reason not to investigate it and see if he/she can find a more logical explanation. Sure, if the effect seems real but unexplained, you can accept and use it but advocating a kind of big data mysticism, "don't investigate, accept" seems to be buying into the senseless hype. And if you read the post, you'll notice the parent actually discovered the association was just an artifact of an easily explained association.
And, no, there's no much reason for companies to advertise just a TV stand and DVD player. Common sense tells one what the data actually data, that those two items, by themselves aren't and weren't what many people were just dreaming about.
The article is very refreshing and I bookmarked the site. What I am more frustrated with is that a lot of people use this stupid term "big daata" for things which do not fit the description. If it's structured, it's not big data. If it comes at 2MB/s it's not big data. If it fucking fits in your RAM, it most certainly is not big data.
(a) Association rules are big data when you are doing them on large data sets with many variables. I work at a company that sells tens of thousands of different products and tens of millions of customers. Definitely takes us a while to compute those rules.
(b) The majority of big data is structured. For most big data projects it is typically stored in old school Oracle/Teradata/etc data warehouses and shipped into a Hadoop cluster. It may not be consolidated but it is definitely structured.
(c) The total RAM of our Hadoop cluster is 4TB and ours is small. I would consider that to be big data in the sense that it overwhelms any applications that directly try to access the raw data.
I think that the hype and buzzwords around Big Data and data science cause more than just bad business decisions. I believe they are also damaging the industry and creating a larger sense of disillusionment (I'm mostly thinking of "deep learning"). Not sure what this means for data science in the long term though, just thinking out loud.
I'll also add that I frequently see sledge hammers being used to hang a picture frame. By that I mean using huge clusters to run algos that would actually run in Tableau, Excel etc.
Secondly, the author seems to have conflated two different parts of the data science picture. Yes great analysts who do amazing work is important. But it relies on (a) having data available and (b) in the right format. For those of us doing significant volume ingestions it is not trivial to do this. Hadoop is painfully slow and overall data science end to end tooling is slow, fragmented and incomplete. Some of us do need vendors to be bold and coming up with new technologies/approaches.
And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.
I think you are doing Hadoop wrong, or confusing current technical reality with "Hadoop". Hadoop is very cheap, and it allows all the datas to be in one place This is huge for large scale data science, because in the past we had to pull data across networks fiddle, sample and chuck. The business case for single enterprise datawarehouses was difficult to make (because of the cost) and maintaining them when a CIO with vision did make the case was impossible because it took about 10 minutes for some genius to start running a tactical operational system on it, which was followed (in about 10 minutes more) by a howling call of rage from an MD about why his operational system was locked up due to someone doing stupid queries, which was followed by a lock down on queries in the warehouse.
If your hadoop cluster is slow then 1) move to CHD5 and use spark, use Impala, upgrade to 40Gbe throughout and make sure that you have balance in your architecture, for god's sake do not be telling people Hadoop is slow if you are using AWS. 2) brew your own cluster with GPU's and the various crazy infrastructures supporting said architecture (good luck) 3) go talk to an FPGA vendor or a super computer vendor and upgun (but you must be rich) Exalitics or Yark might work for you.
>And the point about IBM is just stupid. Did you ever think that maybe Watson DID help them slow their sales losses ? Weird that a data scientist would make predictions based on inadequate data.
Every IBM rep I have met for the last 3 years has told me that Watson will deal with churn and provide better offer management. I have repeatedly tried to get POC's and always always failed. Then we saw Watson tools on Bluecloud and all our suspicions of what Watson is and was are confirmed. Cudos to the Watson team, they spotted that Jeopardy questions can be rewritten as search queries, and spotted that search responses can be rewritten as jeopardy answers.
BTW. did anyone get far with Deepdive?
You're right, there is a lot of misinformation and hope about Hadoop out there, and I think there is a lot of value in Hadoop as a cheap data integration archive. But I think the parent poster's point still stands. A Hadoop-based infrastructure currently has a lot of impedance mismatch for full end-to-end advanced analytics with a bunch of stats, linear algebra, or graph stuff from native code which are not Java-based.
I would love to see a TCO analysis on Hadoop+analytics versus buying a more traditional "supercomputer" stack with infiniband or one of the nifty Cray/SGI NUMA systems. Current data warehouse and BI folks are fixated on cost per PB of storage, and Hadoop is very cheap based on that single metric. I suspect that if enough human factors and accuracy/agility of modeling results are considered, the latter may be quite cost effective. It's just that the "big iron" vendors are still in the middle of retooling their marketing for the BI/DW/ETL crowd. When they finally figure it out, it's going to be a bloodbath.
For instance, SGI UVs can give me 24TB-64TB of RAM in a single "system". I still have to make sure I do multithreading/multiprocessing well, but the interconnects are lower latency than 40GBe. https://www.sgi.com/products/servers/uv/
HP ProLiants now can fit 48-60 cores and 6TB in a single 4U system: http://www8.hp.com/us/en/products/servers/proliant-servers.h...
Buying a few of these scale-up systems is a LOT cheaper than hundreds of nodes of Hadoop sitting around maxing out I/O while their expensive Xeons have 10% CPU load. Especially given than you can hire anyone out of science/engineering grad school and they can program these scale-up systems, whereas writing a bunch of Java MR jobs for Hadoop is quite foreign to them.
Since slide decks get busy I moved my bibliography of links to a gist. So, while it didn't factor into my presentation I've now added this blog post. :-)
Maybe it's just not a good book for developers? shrug I would love to have a copy of that book that doesn't use Excel.
1) Excel is "visual" in the sense that you can watch the data change as you tweak things. There is no command line or program to execute, it's all happening live
2) For programmers, there's no "well I'm a python guy and this book is written in Java so it's not for me." None of us as coders really depend on Excel for writing code (basically) so it's kind of a way to take the technology decisions out of the equation. It's just the techniques.
All that being said, it's not trivial to port the logic of a spreadsheet over to code, and I think if anything that would make a great followup book.
(disclaimer: I am in no way tied to John Foreman. Also, I work at a company that provides a data processing/collaboration SaaS...for big data! http://www.treasuredata.com)
A quote from the OP:
>If your business is currently too chaotic to support a complex model, don't build one. Focus on providing solid, simple analysis until an opportunity arises that is revenue-important enough and stable enough to merit the type of investment a full-fledged data science modeling effort requires.
This is consistent with what we see in our customers. The use cases we see most with processing big data boils down to generating reports.
Generating reports may sound really prosaic, but as I learned from our customers, most organizations are very, very far from providing access to their data in a cogent, accessible manner. Just to generate reports/summaries/basic descriptive statistics, incredibly complex enterprise architectures have been proposed, built by a cadre of enterprise architects and deployed with obscenely high maintenance subscription fees billed by various vendors. That's the reality at many companies.
As bad and confusing the buzzword "big data" is, one good byproduct is that it has forced slow-moving enterprises to rethink their data collection/storage/management/reporting systems.
Finally, I am starting to see folks do meaningful predictive modelling on top of large-ish data (in the order of terabytes). Some of them are our customers at Treasure Data, some aren't, but they are definitely not "build[ing] a clustering algorithm that leverages storm and the Twitter API" but actually doing the hard work of thinking through how (or if) the data they collect is meaningful and useful.
And that's a good thing.
In defense of the hype, many tools like storm are worth their hype many times over when used for the right application.
The author makes this distinction, but it can easily be lost in the post.
>A lot of vendors want to cast the problem as a technological one. That if only you had the right tools then your analytics could stay ahead of the changing business in time for your data to inform the change rather than lag behind it.
many people like the author just don't get it and it is fine. The same way like people didn't get the search before Google.
>But how do I feel good about my graduate degree if all I'm doing is pulling a median?
the graduate degree is what allows to receive $Nx10e5/year (for a respectable value of N) for that pulling of a median
>If your goal is to positively impact the business, not to build a clustering algorithm that leverages storm and the Twitter API, you'll be OK.
on the other hand if your goal is power(OK, OK) instead of just OK then the clustering algorithm/storm/twitter is the way to go.