But I'm really just a developer who's good at databases and ETL, along with my regular tasks of writing near-realtime background processing systems, web api's, SQL, etc.
I think the data science industry seems to have been massively overhyped, and now they want people who can use AI and statistical learning methods and all this other stuff I don't know to do plain old data engineer work.
A sad outcome for a discipline that once held so much promise.
On the other side of things, this might be rarer than you think.
My experience is that a lot of newish programmers have very little database experience. What they do have is often centered on Mongo or other non-relational stores, used more for persistent storage than as interactive entities. The ability to get info out of a SQL database is pretty standard, obviously. But handling aggregated or joined tables are not entirely standard. (Interviewing for an entry-level backend dev job at a major company, I was pretty startled to have the databases section cap out at 'group by' and 'join'.) And anticipating error sources (e.g. MySQL's rollup handling), reading and responding to 'explain' plans, or knowing about backend issues like InnoDB settings is well outside a lot of developers' familiarity.
I assume part of this is the heavy focus of bootcamps and some college programs on building web apps, and the optional status of databases classes in many college CS programs. But I could imagine a lot of other factors stopping people from picking it up elsewhere, like the changing divisions among DBA/SysAdmin/DevOps/SRE.
So on one end, a data science boom turned out a lot of people with advanced skills in a field with lots of simple work, and at the other there's a gap in developer knowledge which makes it convenient to hire highly-trained people and dump them into roles that are a mix of analyst and DBA work.
I wouldn't have a job if this were true. And on a related note, I've found software engineers to be generally poor at writing queries (compared to DBAs/Analysts/Data Scientists).
Cathy O'Neil and Rachel Schutt's book, Doing Data Science (http://shop.oreilly.com/product/0636920028529.do) covers this almost immediately. Because the term developed sort of organically as a cross-discipline approach to solving certain challenging problems, students in their classes were often a mix of scientists, statisticians, and software/database engineers like yourself.
So, you may not consider yourself "a data scientist", and that's fine, but there's certainly a role for your specialization in data science, and that doesn't at all indicate trouble in the field. On the contrary, it's exciting that there's a marriage of these specializations underway.
If I had the time, I'd write a much more in-depth reply along the same lines to many of the criticisms in the article. The lack of a clear definition for "data science" or "data scientist" has caused some confusion, but at the same time, there is new technology available and new approaches to working with large-scale datasets that weren't available before, and that does represent new skillsets.
They, in turn, realize their predicament and set out building their own such infrastructure to the best of their abilities. Sometimes they do okay and squeeze out real value, but they're never going to produce the cutting edge ai and prescriptive analytics that execs think they will.
I think (one of) the problems with the data science career field is that there are a lot of juniors who want to run sklearn and call it a day, following the tutorials that seem to 'just work' that real-world data doesn't without a fight.
To get value out of the work, you have to be methodical, careful, and really dig into the data. The observation that 85% of the time is cleaning doesn't eliminate the need to know what you're doing, what approaches to use, how to judge success, how to communicate results, etc.
Another thing to consider: I've found big, boring companies are usually better to do DS at than small ones. Big, boring companies have better discipline in collecting and managing data. Also, a 1% improvement to an existing process matters a lot at BigCo, and very little at a startup - and a lot of DS models are that sort of incremental progress over rules engines or heuristics.
1) 'The Old Guard' who are extremely skeptical. They tend to be extremely dismissive of models and predictions, distrust anything but most basic analysis. If they can't do the analysis on an excel spreadsheet it's too complicated and "will never work". These people tend to be Engineers (mechanical and chem type) and Plant Operations roles. A lot of the time there is value in listening to there skepticism but they tend to be extremely conservative by nature (Fortran ort to be enough for anyone...).
2) 'The Optimists' people who think "big data" and "machine learning" is the panacea to every problem in our org. To these people a prediction is a good as a real measurement - they trust forecasting implicitly. They have probably read an article somewhere about machine learning but don't really grasp any of the intricacies. These people tend to be in logistics/accounting/finance type roles and a large part of my job tends to be spent in phone calls with these people explaining why their forecasts did not match the actual results.
3) 'The KPI guy' - usually a manager who is somewhat out of their depth who wants to distill everything he can into a single number which can be displayed on a dashboard. The end result is a dilbert-esque situation where the 'KPI guy' decides that to make his mark in the org he needs to come up with a new metric. You end up with the bizarre situation where people are discussing a 'super metric' made by combing other metrics into a single number. I also spend a lot of time on phone with these guys because they forget what undpin their super metrics and don't understand all the subtleties they've distilled out of the data by focusing so much on higher level metrics. They get angry when you question the value of their dashboard. Whenever someone starts talking about "Yield" "OEE" "DIFOT" good chance they are a 'KPI guy'
Most of my job is balancing out interactions between the three 'customers'. Tempering the optimists enthusiasm, reigning in the KPI guys and nudging the Old Guard.
Personally getting stuff done with data in this environment is more satisfying than using the latest neural network, I presume you're the same?
There are other pathologies, too, but it's amazing how much worldview and the basic behavioral psychology elements of high/low trust and autonomy manifest themselves in what should be "objective" analytics projects.
I think it would make things much more interesting
This post rings extremely true to my experience, and largely aligns with what I've been telling people for the last couple of years. I see so many bootcamp or Masters grads with a wildly skewed understanding of what the job entails. I also see a lot of MBA types diluting the meaning of the DS term as a whole.
A "data science" curriculum as such will basically prepare you only for an analyst role. You're not going to be able to compete with the glut of science PhDs flooding every open role, either. DS may be your title but you will not be doing any of the exciting things you want to be doing. To differentiate yourself you need to specialize, and good engineering skills are a prime way to do that.
That's almost certainly diluting the term but it's much closer to the work I do than the title might imply. Since 90ct of business problems can be solved with regression, typically logistic or decision trees, knowing what tools are appropriate to apply to a certain problem is more valuable than being able to actually write those tools. Bootcamps don't spend enough on the why of what we do because I think it's just something you pick up through experience.
All these trendy terms eventually devolve into noisy marketing to attract talent.
Even at the companies that used to be famous for just making tires.
In my years in finance, there was a similar problem. One guy in particular I worked with reckoned himself an "ideas guy" and would simply spout out gibberish that he expected the rest of us to implement. He could barely use excel himself, let alone code.
The fact is the best coders I met never fancied themselves as specialists. They could certainly fit some models for you, but they could also write some SQL, set up replication and other maintenance, write cron jobs, set up ssh keys, merge some git branches, and write front and back end code in several different languages, declarative and imperative. I always put it down to a mix of curiosity and humility, giving these people a very good grasp of the fundamentals plus a foothold in almost every area of coding that I could think of.
As it should be. In order to have confidence in your ML you need to really understand your data and data processing.
Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.
Same in real science - for every minute you spend thinking about what nature might be doing, you spend tens of hours carrying things around, mixing things, checking things, repeating things, etc. This is how all real work is.
Most modern languages are procedural: Java, Python, Scala, R, Go, etc.
If someone has a friend who does Scala, can they read them this quote and film the reaction? Thanks.
> Isn’t SQL a programming language? It is, but it’s declarative. You specify the outputs you want (i.e. which columns from your table you want to pull), but not how those columns are actually returned to you. SQL abstracts a lot of what’s going on under the covers of a database.
You want a procedural language, one where you have to specify how and where the data is selected from. Most modern languages are procedural: Java, Python, Scala, R, Go, etc.
The author is trying to contrast fully Turing complete languages with a declarative domain specific language like SQL. (Yes, I know that some extensions provided by various database implementations make SQL Turing-complete.) Unfortunately, the word she chose to express this is already a term-of-art in the programming language world which means something different. Luckily, we're all charitable readers, so we can correct on the fly and understand what she meant.
The first generation machine learning experts were proper scientists with proper Ph. D. degrees, academic track records, etc. that would typically be very opinionated on what algorithms (and quite possibly wrote a few of their own) to use but not necessarily experienced engineers. I saw a lot of clumsy engineering and convoluted testing and evaluation processes.
This explains a lot about the current state of the art which involves a lot of tools that are aimed at people who are not primarily engineers and need to be shielded from complex infrastructure and code but do know a lot about statistics, machine learning algorithms, and all the stuff that first generation machine learning experts would know.
The second generation of machine learning experts is basically riding an ongoing commoditization boom. They use toolkits from Google, Facebook and others pretty much as is. These tools are easy to use for them but not necessarily for non expert engineers that know a lot about pumping data around but not necessarily about machine learning algorithms. This is getting a lot easier. I've heard of high school kids getting ML jobs with no college training whatsoever and just high school math and a bit of online training. My impression is that you can get nice results with a little effort.
The next generation of machine learning engineers won't be scientists and they'll indeed mostly work on manipulating data. All the machine learning algorithms will be provided in the form of black box libraries and tools that will mostly work in a fully automated mode. IMHO the whole point of deep learning is that the algorithms figure things out by themselves. Even the job of picking the right algortithms and configuring them is ultimately going to be something that machine learning algorithms will be better at than a junior engineer with no relevant scientific background.
Or indeed an experienced software engineer with a classic computer science background, like myself. I have no clue what e.g. a tensor is. articles on the topic seem to be very math heavy and tend to give me headaches. But should I even have to care to be able to configure some black boxes that process data and produce models that I can plug into my runtime? My pet theory is that we're already past that point and that lots of companies are getting decent results not having to care about the underlying algorithms already.
I went to a great meetup at Soundcloud last week about how they used off the shelf machine learning tooling to improve their saerch ranking in elasticsearch. It was all about the training data, the parameters in the search query that they wanted to machine learn, their tooling for evaluating model performance in terms of being able to rank real queries against real data, tooling for annotating training data, integrating models with their software, the devops for retraining the models, etc.
My experience working with the machine learing team search group in Nokia Maps (now Here) eight years ago was that the tools were an obstacle to getting results fast and that iterations on model improvements were measured in months. A lot of engineering went into things like feature extraction, model tuning, and other stuff that scientists do as well as building essentially all of the tools from the ground up so that models could actually be generated evaluated, and integrated. Only problem: many of these people weren't experienced engineers so the tools were kind of clunky and there were lots of integration headaches, insanely long integration cycles, and lots of missed opportunities to fix (rather obvious) data problems due to a bias towards endless tweaking of algorithms instead of applying pragmatic fixes to the data. It kind of worked and the search wasn't horrible but the biggest problem was that the underlying data wasn't great to begin with (mis-categorized, full of duplicates, incomplete/stale, etc.).
The people at Soundcloud got it down to iterating in hours with a few months of engineering. That's from idea to proof of concept to having code in production that outperformed a manually crafted query.
That sounds like something I could do but it also sounds like a greenfield for proper tools to emerge that make all of this a lot less painful than it currently is. The next generation hopefully won't have to build a lot of in house tooling and reinvent a lot of wheels while doing so.
Of course. Academic papers (and a disturbingly large number of Wikipedia pages) are not meant to explain things, they’re meant to emphasise just how smart the authors are.
> I have no clue what e.g. a tensor is.
Well, even Einstein struggled with Tensors. In the context of TensorFlow they’re just multi-dimensional arrays.
I think DS has been abused by some people as an umbrella to not produce qualify code, yet they somehow they put themselves in higher regards in the value chain.
However I do see there is a real position for DS in the industry, but it should be a specialization of senior SDE when they decide to further their career, not its own job family. Otherwise it should be renamed as data analyst for clarity.
Hit the nail on the head here. I worked in an DevOPs/ETL team across from the data science team, all they did was write SELECT * FROM sales and complain Teradata was slow and when they got the result set they'd use "R" to SUM the column and display it with GGPLOT.
If you have a good academic background it can be possible to enter a DS role immediately but often you will be doing work far more towards the Business Intelligence end of things rather than deploying Deep Neural Nets in production or whatever.
I have friends who transitioned into Data Engineering and it does seem like the outlook is better there.
It's an excellent post.
Both are currently not transparent enough for the data science newbies; which is why on my end I try to be transparent as possible whenever the topic comes up (I wrote a post similar to the OP last year: https://minimaxir.com/2018/10/data-science-protips/).
Reliably getting any data science analysis or model running in a real world setting is a demand that's naturally going to follow from the Data Science glut.
Yeah they were called quants (aka mathematics/statistics graduates).
I'd be amazed if even 10% of the people are able to do anything more than just import scikit-learn, and train a classifier through tutorials.
This is IMO no different than when the software dev. craze started, and people with 3 weeks of coding experience started applying for entry-level jobs. You start interviewing them, and they can't even explain the difference between a for or while loop-
In the end, there's just more noise. You need to find a good way to cut through this noise, both qualified candidates and employers
1. Lesson: 355k
2. Lesson: 144k
7. Lesson: 34k
Surprisingly close to those 7%.
Data science is still a thing, and it's maturing in the way that applied sciences do when they get to the point of needing a little more engineering background. Tech. just is never that glamorous, but the dirty secret is that only people in tech. seem to really get that, so we have this hype cycle every few years.