Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.
And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.
My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.
(Context: Been doing data work for decades, before it got its recent “data science” name.)
I don't necessarily agree with this. Yes, a sound understanding of the domain and knowledge of the mathematics and statistics are vital to gaining insights. But. I would make a very clear distinction between exploratory data viz and explanatory data viz. Data visualization when presenting those insights is an important part of driving decision making.
I don't fully agree with this neither. Especially for mathematical concepts, visualization can give insight into how theorems are constructed and combined. This can prove to be vital when applying concepts and theorems to new problems.
I would especially like to bring forth 3blue1brown[1]. He is a creator of videos which beautifully visualizes and explains complex mathematical problems. His efforts has given me an insight into math which theorems explained in text and variables could never do.
However I do see your point that visualizations without understanding can be misleading. Hence the pure, written math is important to read and reason about, but I do believe that some concepts need to be visualized to be fully understood.
With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX looks promising[1].
[1] https://www.edx.org/course/essential-mathematics-for-artific...
Basically, any problem where you can establish relations between elements can be treated as a graph. I've used graphs for image analysis before too: pixels are vertices, edges represent neighborhood relations - especially useful when you make nonlocal connections (e.g., nonlocal means; graph-cut methods for segmentation; etc...)
I've worked with them in three of the above contexts: cybersecurity (my current projects), retail analytics, and image analysis. I've avoided social network stuff - never cared for that area much.
I definitely think a solid mathematical understanding helps to build quantitative and critical thinking skills which are very key in data science.
I use the `tidyverse` from R[0] for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science" [1] is great free resource for getting started.
Snakemake [2] is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.
An example is the janitor::clean_names function I like to use for standardizing the column names on a data.frame.
However, the tidyverse is really serious in terms of api consistency and functional style, with pipes and purrr's functionalities. The unixy style of base R is unproductive in terms of fast iterating an analysis. Also, the idea of "everything in a data frame" (or tibble, with list columns and whatnot) together with the tidy data principles really takes the cognitive load off to just get things started.
As a half-solution, I ended up restricting myself to a very few libraries in this family (mainly dplyr, lubridate, stringr, broom) and to using packrat to consistently freeze the library versions for these.
There are definitely some issues if you have to reliably run scripts (not to mention the difficulties of putting into production)
The thing I really like about R over python is for SPECIFIC tasks like inspecting data and trying to get an answer out quickly, there really isn't a quicker or better tool to use. The ONLY reason I still even use R is because of the ease to get answers with the tidyverse
I on the other hand, find most R packages provide barely readable documentation. I can just hope that the vignette exists and actually explains the inputs/outputs.
Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.
I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.
On this specific point, it's worth noting that up until now there's been a single massive repository of every Julia package ever published, regardless of its current state or utility. Starting with the upcoming 0.7 release, Julia will introduce the concept of "curated" repositories so that, going forward, if you stick just with the default curated repository of packages you should have much less chance of running into a broken or unmaintained package.
- Jupyter + Pandas for exploratory work, quickly define a model
- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)
I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain
Go is quite straightforwards though - WYSIWYG for the most parts, hence you probably won't find a lot of sexy tutorials. Almost everything is just a loop away, and in the next version of Gorgonia, even more native looping capability is coming
... and in particular the resources lists at
- https://github.com/gopherdata/resources
Also, Dan's GopherCon talk on Go for data science is a great way to get yourself convinced enough to try it out:
Programming languages:
- python (for general purpose programming)
- R (for statistics)
- bash (for cleaning up files)
- SQL (for querying databases)
Tools: - Pandas (for Python)
- RStudio (for R)
- Postgres (for SQL)
- Excel (the format your customers will want ;-) )
Libraries: - SciPy (ecosystem for scientific computing)
- NLTK (for natural language)
- D3.js (for rendering results online)It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).
I think certain libraries depend very much on where you focus. Machine learning? Native language processing? Visualization? Something in economics? Fundamental sciences? For instance, I never need NLTK in theoretical astrophysics ;-) Instead, I need powerful GPU based visualization, which is however very old school with VTK and Visit/Amira/Paraview (also very much pythonic).
If you're doing a lot of work with matrices, model fitting in production, then python seems fine. However, a lot of data scientists I see are more like scrappy data analysis / visualization types, who are churning out small dashboards. In that case R's tidy verse and shiny are just incredibly fast to develop with.
For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.
$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook
WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.
1. easier interpretation of results than frequentist methods for lay people (business strata, elected officials, or other decision makers)
2. Uncertainty can be quantified and visualized reasonably well, which helps decision makers not think of stats as a magic box that produces a single answer.
3. Sensitivity analysis can be placed right up front: selection of priors representative of the beliefs of differing opinions / ideologies can inform decision makers of when they should consider changing their minds, and when they might still hold out.
Downsides of Bayesian methods:
1) Conceptually more involved than typical maximum likelihood estimation methods
2) Computationally expensive
3) Methods might not be as well known to a nominally stats-savvy audience.
Also, get used to reading the Stan forums on Discourse. Happy Stanning
A sound understanding of mathematics, in particular statistics.
It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.
Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses
Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup
My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.
Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.
The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.
To sum it up, Dask alone makes all the difference for me!
I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.
Maybe we need a Data Scientist to work out what a Data Scientist is?
It means someone who can work with business stakeholders to break down a problem e.g. "we don't know why customers are churning", produce a machine learning model or some adhoc analysis (usually the former) and either communicate the results back or assist in deploying the model into production.
Typically there will be data engineers who will be doing acquisition and cleaning and so the data scientists are all about (a) understanding the data and (b) liaising with stakeholders.
As for technologies it is typically R/Python with Spark/H20 on top of a data lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW, Presto or a Feature store e.g. Cassandra.
const Y = a => (b => b(b))(b => a(x => b(b)(x)));https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...
(Disclaimer: I wrote the post at the above link).
If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.
There are resources out there for learning good design. This is a great introduction and points to many other good materials:
https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...
1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.
2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.
3. Some scripting language (Python, R, w/e)
4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.
Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.
1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?
2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?
3. Difference between Adam and SGD.
Numba for custom algorithms.
Dataiku (amazing tool for preprocessing and complex flows)
Amazon RDS (postgress), but thinking about redshift.
Spark
Tableau or plotly/seaborn
* statistical methods (more math)
* big, in-production model fitting (more python)
* quick, scrappy data analyses for internal use (more R)
For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).
Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.
I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos
No it won't.
That combination can't handle large datasets that are typical for most data science teams i.e. maybe include PySpark. And then it's very limited so far as ML/DL technologies.
Pandas and Spark are both DataFrame libraries, and seem to offer very similar functionality to me. Why do you prefer Spark over Pandas?
> very limited so far as ML/DL technologies
I mean, getting Tensorflow up and running with GPU support isn't trivial, but it's not exactly hard, and Keras[1] provides excellent support for a wide variety of other backends. What, in your experience, is less limited?
[1]: https://keras.io/
I would bet that the mean size of dataset people are dealing with is a lot bigger than the median size.
Pattern matching helps you write code faster (that is, spending less human time).
Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.
Coconut is an extension of Python that offers all of those.
Test driven development also helps you write more correct code.
It seems like getting into the upper echelons of Kaggle is a matter of refining your model, and I do wonder how much value these refinements offer over a more basic and general approach in a real world scenario. To be clear, when I say I wonder, I'm not saying I'm rejecting the value, I really do mean it, I'm uncertain about the value. I think it's probably very scenario specific.
Think of it this way - a predictive value of 90% vs 95% could be the difference between placing in the top 10% and the bottom third. Now, 5% isn't nothing, it could be very valuable. It really depends.
But Kaggle is an environment where the question is already posed, the data has been collected, the test and train sets are already split apart for you, and winning model is the one that scores best on a hidden test set by a predefined goodness of fit score.
In a real world scenario, suppose someone does a great job figuring out the question to ask, gathering the data, and determining the most effective way to act on the results, but uses a fairly basic, unrefined model. Someone else does a middling job on those things, but builds a very accurate model as measured by the data that has been collected. I'd say the first scenario is likely to be more valuable, but again, it depends of course.
A couple other things, since I am a fan of Kaggle and do highly recommend it. First, these things aren't necessarily exclusive - you can have a particularly well conceived and refined model as well as a thorough and excellent businesss and data collection process (though you may have to decide where to put your time and resources).
Also, refining a model with Kaggle can be an exceptional training opportunity to really understand what drives these things. So go for it! (I also find these things kinda fun).
a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure
That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.
If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.
Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.
A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.
IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.
Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.
Have you ever needed any Microsoft skills(MSSQL/C#) so far?
However, I can't seem to recall the name. Has any one seen what I'm talking about?
Oh I don't know about that. Programming languages are force multipliers, and each language has a different force coefficients for different problem domains. They are not all equivalent. They have their different points of leverage, and simply being good in one does not mean you can solve problems in any domain with ease. In fact the wrong programming language can often be harmful if it's ill-suited to the problem at hand, and especially if it contorts your mental model of what you can do with the data.
One example I encounter a lot in industry is Excel VBA. I'm fairly good at VBA and have seen very sophisticated code in VBA. I've also seen many basic operations implemented badly in VBA that should not have been written in VBA at all. By solving the problem in VBA, the solution is often "hemmed in" by the constraints of VBA.
For instance, unpivoting data is often done badly in VBA (with for-loops), but is trivial to do well in dplyr or pandas.
So I would say one has to choose one's programming language somewhat carefully. Not any language will do.
Every single large scale data science team e.g. Google, Spotify, AirBnb will be using Spark for most of their work. It is by far the defacto standard for working with large datasets. Especially since it integrates so well with machine learning (H2O) and different languages (Scala, Python, R).
Would you use pyspark mllib in a webservice instead of scikit ?
However, if you use a lot of UDFs where Spark has to serialize your Python functions, you might consider rewriting those UDFs in a JVM language. Serialization overhead is still fairly substantial. Arrow is trying to address this by implementing a common in-memory format, but it's still early days.
I would still recommend PySpark to most people. It's more than good/fast enough for most data munging tasks. Scala does buy you two things: type safety and low serialization overhead (i.e. significant!), which can be critical in some situations, but not all.
Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.
2) Spark MLLib is still fairly rudimentary in its coverage of major ML algorithms, and Spark's linear algebra support, while serviceable, is currently not very sophisticated. There are a few functions that are useful in the data prep stage (encoding, tokenizers, etc.) but overall, we don't really use MLlib very much.
Companies that have simple needs (e.g. a simple recommender) and that don't have a lot of in-house expertise, might use MLlib though -- I believe someone from a startup said that they did at a recent meetup.
Most of us need better algorithmic coverage and Scikit's coverage is currently much better, plus it is more mature. We also have Numpy at our disposal, which lets us do matrix-vector manipulation easily. There is some serialization cost, but we can usually just throw cloud computational power at it.
Also note that for most workloads, the majority of the cost is incurred in training. For models in production, one is typically processing a much smaller amount of data using a trained model, so less horsepower is required.
Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet
"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).
The analysis part is usually quite simple, often if it gets really complex then that's a sign that the data is being tortured. Sometimes the marginal gains that complex methods create (vs simple but good approaches) are not worthwhile even if they are valid - simply in terms of time spent and difficulty in communications.
Or maybe whole humanities should be considered as « not science ».
Beside a data analyst that don’t use scientific method is just a bad analyst. Some media outlet showcase blatantly lying charts made by people that understand the technicals but get everything wrong about the concepts.
So this is my advice, focus on understanding the concepts before the tooling. That is what will really make your value.