But the best way to find a good "data scientist" is probably the best way to find a good programmer -- be one yourself; tap your professional network; and hire people as consultants/freelancers on non-critical projects before making a real commitment. Identifying someone with a deep skill that one doesn't possess oneself is pretty much impossible. And on the flip side, I have trouble imagining that someone who really knows what he or she is doing would want to work for some unknown.
If you want someone to scrape and clean data with Perl and generate some scatter plots and histograms, look for undergrads with good grades who worked as Research Assistants, or recent grads working as RAs at consulting firms, research centers, governmental agencies, or think tanks. They'll do great (by and large), they've have had some informal training from a more senior researcher to help put everything in context, and faculty often steer their best students into those sorts of jobs, so there's a pretty strong quality screen. I'm sure there are other places to find people too.
I think most people do, but I've never heard a good term for the job. It's like "we want someone who can take large amounts of data and do something awesome with it". What do you call that?
>Unless you're hiring someone really junior, you want the "data scientist" to have a specialty -- anyone good will have one.
Not sure I agree with this, I want people who are well-rounded. I think it's great to find someone who specialized at something, but I'd want that person to be able to grow the rest of his abilities up to par. Example: let's say you specialized in machine learning. If you don't understand building scalable systems, you can't take a holistic view of a project; how will you know if your algorithms can scale to a production environment? Or, if you can't program well, you can't write code to actually get your algorithms into place. Or if you can't understand the business side of things, you won't be able to build trust with the rest of the company, and hence you won't be able to contribute.
Analyst
Yeah, that's kind of the problem; until "awesome" gets defined, it's awfully hard to be specific about what the company needs. But this is why it's going to be really hard for someone that couldn't do part of the job themselves to hire effectively.
As to the second point, I guess it's going to depend on the business and on whether the statistical component is crucial or just peripheral. It's nice if everyone is pretty broadly well trained. But if you're a hedge fund building algorithmic trading rules, you need different people than a marketing research firm or a litigation consulting agency.
It has always seemed to me just an excuse to run away from the "Artificial Intelligence 2.0" moniker and all the negative connotations that would imply. I dislike the label "Data Science" because there is not really much "Science" with a capital-S being done by people who adopt the moniker and with whom I have had chance to meet.
I have always thought that "Knowledge Engineer" was a more descriptive and useful term for what they actually do. The more abstract you get it seems to fall into the field properly known as Computational Mathematics.
That quote is more than just humorous, it points out one compelling answer to the question in the title of the post -- Perhaps all scientists are data scientists, you simply have to lure them into a new domain of study.
I do think that people need to stop expecting to get a physicist, statistician, economist or applied mathematician (every graduate analysis job I've seen had that wonderful qualification) as most of those people already have really well paying jobs in finance (or satisfying careers in academia), and open their eyes to the fact that for many data science roles, social scientists are probably a better fit.
If you're dealing with numbers generated by people's interaction with a website, a handy background to have is in some form of quantitative social science. (I am of course horribly biased, by being a quantitatively trained social scientist).
In any case, I expect data science to go through a dot.com like boom and then a horrible crash, so it may be a good idea to get the skills (and possibly qualifications) now, while the sun still shines and people are still hypnotised by the promise of big data, rather than the tedium and slog of extracting value from it.
FWIW, I agree with you that quantitative social science is a great background to have. My data science team lead has a background in Neuroscience and he seems to be having fun applying that intuition to recommendations.
Yes, I agree that on average, social science people tend to like maths less. However, you're not looking for the average social scientist, you're looking for the ones who weren't satisfied with SPSS (like me). The other, more common kind, are not likely to end up in a data scientist position (though perhaps this may change).
Hilariously enough, after I learned R, SPSS made way more sense to me, while before that it just seemed way too easy to generate reports without a shred of insight.
I've always taken the data scientist distinction to be the startup world's answer to the "quant" designation. Quant jobs are a lot better than "just a" programmer jobs, even at hot startups, so I see the "data scientist" title as an answer to that. It's a startup quant.
1. Go to a scientific journal that involves serious computational work.
2. Look at the last author on an article. This is usually the lab boss.
3. Google "$LASTNAME lab webpage".
4. Look for graduate student and post-doc profiles.
5. Email them and offer more than the 38k they make for working 60+ hrs/wk.
6. Now you have a "Data Scientist"
Alternatively, just Google "Argonne National Labs".
There is no shortage of Data Scientists with years of experience (every scientist I've talked to has guffawed heartily and asked "what other kind of scientist is there?"). Yes, I get that the term is just a very unfortunate choice of words (on the order of General Linear Model vs Generalized Linear Model), but the point is that no one knows to brand themselves this way. Recruiters are probably just frustrated by the lack of linked-in-ness, and the fact that most people competent at "Data Science" don't know what that means.
Here are things virtually every PhD in a computational discipline will have done:
1. Written code. It might be just Matlab, Python, R, etc.
2. Written up and communicated the results with compelling visualizations, both orally and written
3. Published a paper at some point demonstrating they can do this.
4. Dealt with failure and hunches that didn't work out after weeks of work.
On the hiring part, after you've found these mythical creatures:
As someone who does science (with Data, even!), here is the best question I can think of to ascertain my competence: "Diagram an approach to answer a particular question, with emphasis on ruling out competing explanations and demonstrating whether a result is true. Sketch some hypothetical visualizations you'd use that show what your result is, how large the effect is, and how sure you are that you're right."
I should be able to do that. On the flip side, if an employer can reason with a post-doc about that process in an intelligent fashion, they would be excited about leaving academia. You might need to teach them to use a cluster, EC2, whatever, but you will not have to teach them to ask and answer questions.
I would add that hunting people in the relevant communities should work, as well, for example biostars.org or seqanswers.com for bioinformatics, and then check out the posts of the top-contributing members.
I also see more bioinformaticians storing their work on GitHub! So far I can name about 5 but I think that's going to grow in the next few years. There's no central point storing these (not sure how to structure that) but at some point there will be.
BTW: this is my reference point for the definition of 'Data Scientist':
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist...
It could be that the author's first criterion is an important predictor. But it seems to me that unless somebody is actually in academia, publishing papers is more akin to a hobby than a professional qualification, especially given the inherent bias against unaffiliated authors.
Edit: On re-reading, the author (hardtke) writes the above when talking specifically about weeding through post-doc applicants. So my quote is out of context and my criticism unfair.
I will also say that most people that will excel at data science may not ever be pushing the envelope of statistical methods enough to warrant writing a research paper. Being able to apply and understand the state of the art algorithms is inherently a great skill to have.
The world is a dirty place, and just like there are thousands of applications that just need a developer to implement a CRUD app to expose an API on the web, there are tens or hundreds of domain specific problems to where a 'data scientist' can implement the state of the art data cleaning and machine learning algorithms.
However, being able to understand and apply graduate level textbook statistical methods to a large dataset (bigger than an Excel file) might be boring to some research scientists, it is cool as hell to see what the data is saying.
1. What is you favorite programming language and why?
2. Which is your least favorite programming language and why?
3. Explain how Bayesian spam filtering works.
4. How do you determine if a given data sample is statistically significant?
5. Suppose we ran a brand advertising campaign on radio and on television. Neither ad campaign uses special tracking codes or custom landing pages, both ads simply mention the web site address. What tools and methodology would you use to measure the response rate from radio ads vs. tv ads, and predict the total response that will generated from each ad?
And while this is all good, this is certainly not enough.
In reality applied data science is not different from any other area. You get proficiency with experience, as usual.
That's a weird question - what am I to prove? Are there differences in the dataset? How does the data look like? Is there a second dataset against which to compare? What kind of data is it, ordinal or nominal? Is the dataset normally distributed?
I can't just say "that data sample is statistically significant" without a background against which to compare against!
>3. Explain how Bayesian spam filtering works. I would say the majority of non-web developer scientists don't know how this works, why should we?
I would expect any data science candidate to have (at least) a basic understanding of Bayesian statistics.
In a previous job I did lots of analysis on very large data (at the time) data volumes, millions of structured or unstructured records, homo and heterogeneous datasets. Lots of aggregation, sifting, sorting, simplifying, deduping, summarizing, etc. All in support of similar kinds of things that "Data Scientist" positions seem to be intended to support. But the output was not a statistical model, or a machine learning exercise or some other similar. It was the distillation of gigabytes of data into a handful of slides and a report. Usually with a virtuous cycle of feedback directly into software development to improve and expand the next go-around.
But almost no statistics. Very very little, and what I did was very basic stuff.
What is that kind of job called? In my day we called it a "Data Analyst" but I don't see that around much.
I'm going to make a prediction, "Data Scientist" as "Senior Statistician" is going to be short-lived. I don't think they're going to provide the value companies think they will in most cases. "Data Analyst" is much more general purpose and useful cross-domains, except most Data Analyst don't have proper statistical training.
A Data Analyst with statistical training would be a much more useful tool to an organization seeking to make sense out of large volumes of data than a Senior Statistician as they'll have a much wider variety of tools at their disposal than just looking at the world through the statistics lens.
Bonus, jobs advertising "Data Analyst" can demand things like machine learning AND entity extraction AND automatic summarization AND data sanitation AND automatic correlation analysis AND automatic colocation analysis etc.
Most of the jobs I've seen looking for Data Scientists are for companies that are probably going to try and end up using them as high-priced Data Analysts, except the job reqs are all wrong and the candidates that get hired are way over qualified.
But this role is still evolving I suppose, IBM [1] views it as an evolution from the business/data analyst. So they definitely seem to be on the side of not so much statistics and more analysis.
1 - http://www-01.ibm.com/software/data/infosphere/data-scientis...
I've been what you could call a data scientist for over five years now, and worked with dozens of people you could also call data scientists with different degrees and varying experience. From my sample, I don't think PhDs add much, if any, value over masters degrees after a year or two of experience (I'm biased here, I don't have a PhD). I think industry experience can add a tremendous amount of value you can't get from a degree, but it comes at a cost premium. Not related to your article: I've also found the best people have physics or computer science + applied math backgrounds.
http://www.youtube.com/watch?v=-3dw09N5_Aw
I think there are too many ways to miss the mark when it comes to hiring data scientists. Looking for PhDs only is just as dumb as looking for people with 10 years of experience with Hadoop. There are some important things that they need to know, sure, but where they come from is next to meaningless.
When we interview someone I like to start talking/asking about matrix decomposition (eigenvectors, svd ect) and see how excited they get.
I consider knowing about things like MDS and pagerank a bare minimum and if someone can bring up a more recent or esoteric application (locally linear embedding, graph partitioning) they stand out.
Asking about the nuances of estimating probability densities from data (bin/histogram vs kde etc) is another good one and something that stops a lot of the cooler theoretical statistics and information theory from getting used (or used well) in the real world.
Both of these questions get more at "do you understand the basic building blocks that come up over and over again" more then "is your research groundbreaking and new." Asking about techniques to do the above at scale also ups the difficulty of the interview.
As an aside, it isn't just the postdocs who are disgruntled. I was awarded tenure last year, and while there are aspects of the job I love, there are others that push me toward making a change.
By their definition i'm a scientist because I've built a few products! I certainly don't see myself as one.