Ask HN: As a data scientist, what should be in my toolkit in 2018?

341 pointsmxgr8y ago169 comments

169 comments

107 comments · 38 top-level

ms0138y ago· 21 in thread

Mathematics. Which branch of math is domain dependent. Stats come up everywhere. Graphs do too. In addition to baseline math, you really need to understand the problem domain and goals of the analysis.

Languages and libraries are just tools: knowing APIs doesn’t tell you at all how to solve a problem. They just give you things to throw at a problem. You need to know a few tools, but to be honest, they’re easy and you can go surprisingly far with few and relatively simple ones. Knowing how, when, and where to apply them is the hard part: and that often boils down to understanding the mathematics and domain you are working in.

And don’t over use viz. Pictures do effectively communicate, but often people visualize without understanding. The result is pretty pictures that eventually people realize communicate little effective domain insight. You’d be surprised that sometimes simple and ugly pictures communicate more insight than beautiful ones do.

My arsenal of tools: python, scipy/matplotlib, Mathematica, Matlab, various specialized solvers (eg, CPLEX, Z3). Mathematical arsenal: stats, probability, calculus, Fourier analysis, graph theory, PDEs, combinatorics.

(Context: Been doing data work for decades, before it got its recent “data science” name.)

p33p8y ago

I don't necessarily agree with this. Yes, a sound understanding of the domain and knowledge of the mathematics and statistics are vital to gaining insights. But. I would make a very clear distinction between exploratory data viz and explanatory data viz. Data visualization when presenting those insights is an important part of driving decision making.

barskern8y ago

> And don’t over use viz. [...]

I don't fully agree with this neither. Especially for mathematical concepts, visualization can give insight into how theorems are constructed and combined. This can prove to be vital when applying concepts and theorems to new problems.

I would especially like to bring forth 3blue1brown[1]. He is a creator of videos which beautifully visualizes and explains complex mathematical problems. His efforts has given me an insight into math which theorems explained in text and variables could never do.

However I do see your point that visualizations without understanding can be misleading. Hence the pure, written math is important to read and reason about, but I do believe that some concepts need to be visualized to be fully understood.

[1]: http://www.3blue1brown.com

biswaroop8y ago

I think what he/she means is poorly designed visualizations. Just because a plot is grayscale and not interactive doesn't mean it's worse than a cluttered poorly-designed super interactive web widget. It's a poor choice of wording, but I think by "overuse" they might mean "unclear but eye-catching". Besides "overuse" is literally the quantity that is excessive.

gesman8y ago

Good viz is what connects non-ML/AI users to the "magical" results of ML/AI

gaius8y ago

Indeed. Too many people when asked about their skills or experience just rattle off a list of tools or libraries. Usually the same ones as everyone else!

anewhnaccount28y ago

Out of interest, can you give an example of a problem you've solved using Z3?

ms0138y ago

One data problem boiled down to being an instance of the set cover problem (https://en.m.wikipedia.org/wiki/Set_cover_problem). Pretty easy to pose as an integer constraint problem, and Z3 solved it in about 20 minutes for me.

edem8y ago

I really like to get a degree in Mathematics but I simply don't have the time to throw at it (work, children, etc). What do you suggest I should do to have something on my resume? MOOC maybe?

CoVar8y ago

Usually MOOC for resume don't help as everyone does them. The advice that I found useful for resume building is working on projects that you can catalog in a portfolio.

With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX looks promising[1].

[1] https://www.edx.org/course/essential-mathematics-for-artific...

2 more replies

gtani8y ago

You could work thru one of the ML in R/python books e.g. Géron or Raschka/Mirjalili and then dig into the lin alg, prob, stats, calculus/analysis you see there with the books everybody recommends, LA by Axler, Strang or Proabilty by Bertsekas /Tsitsiklis, Real Analysis by Pugh, Abbott, Strichartz etc

maayank8y ago

I have good background in graph theory (IMHO) but don't know many data science use-cases (I'm amateur at that). Could you point to some good start points?

ms0138y ago

Graphs show up all over the place. Social media: who is connected, which people interact. Cybersecurity: which computers/programs/users interact with which other computers/programs/users. Retail analytics: which products are bought with which other products; which products are more important in a graph than others.

Basically, any problem where you can establish relations between elements can be treated as a graph. I've used graphs for image analysis before too: pixels are vertices, edges represent neighborhood relations - especially useful when you make nonlocal connections (e.g., nonlocal means; graph-cut methods for segmentation; etc...)

I've worked with them in three of the above contexts: cybersecurity (my current projects), retail analytics, and image analysis. I've avoided social network stuff - never cared for that area much.

Aardwolf8y ago

R is not present in your list, did you ever try it and what's your opinion about it?

mojoe8y ago

R is good for one-off analysis problems, but is really bad for large distributed production systems. My team moved a large internal analysis application from R to Python because Python works well for both statisticians and software engineers.

ms0138y ago

I know R and have used it in the past. I just don’t like the language. I keep RStudio around though because on rare occasions I do look around in it to see if it has something I need. So rarely though that I forgot to list it...

asafira8y ago

I think it is safe to say that they don't think it's too important for them given their main message, but hey, maybe they have an opinion anyway...

throwaway76458y ago

If you have Mathematica, you might not need R as both are like Swiss Army Chainsaws for Data Analytics.

1 more reply

uptownfunk8y ago

I think visualization can be a helpful tool to understand the data. I have seen some DS's get caught up in visualization for visualization's sake which I think can be wasteful.

I definitely think a solid mathematical understanding helps to build quantitative and critical thinking skills which are very key in data science.

fhk8y ago

@ms013 interested to know how you are using the solvers, are you willing to share any further details?

ms0138y ago

Responded to someone else earlier about this. Used solvers for problems that end up requiring solutions to problems like minimum set cover or schedule optimization problems. Basically, problems where a naïve or brute force approach will take forever to run and you need to use a real solver to attack it. These usually are data problems that end up looking like what would traditionally be considered under the umbrella of operations research.

Balgair8y ago

Honestly, in any STEM major, esp. the physics heavy ones, those maths areas should be well understood. Is any (physics heavy) STEM major also a Data Scientist then too?

trevz8y ago· 10 in thread

A couple of thoughts, off the top of my head:

Programming languages:

  - python (for general purpose programming)
  - R (for statistics)
  - bash (for cleaning up files)
  - SQL (for querying databases)

Tools:

  - Pandas (for Python)
  - RStudio (for R)
  - Postgres (for SQL)
  - Excel (the format your customers will want ;-) )

Libraries:

  - SciPy (ecosystem for scientific computing)
  - NLTK (for natural language)
  - D3.js (for rendering results online)

ktpsns8y ago

I make the claim that you can go very far in the SciPy ecosystem without ever touching R.

It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).

I think certain libraries depend very much on where you focus. Machine learning? Native language processing? Visualization? Something in economics? Fundamental sciences? For instance, I never need NLTK in theoretical astrophysics ;-) Instead, I need powerful GPU based visualization, which is however very old school with VTK and Visit/Amira/Paraview (also very much pythonic).

closed8y ago

I disagree, even though python is the language I do most of my development in. But it probably depends on the problems we're thinking of a data scientist solving.

If you're doing a lot of work with matrices, model fitting in production, then python seems fine. However, a lot of data scientists I see are more like scrappy data analysis / visualization types, who are churning out small dashboards. In that case R's tidy verse and shiny are just incredibly fast to develop with.

marmaduke8y ago

I second that R is nice to have, but not needed. I’ve been doing science in Python for a decade without ever needing R.

For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.

albertgoeswoof8y ago

Agree, I would drop R, Python has you mostly covered now. Julia is also worth learning.

2 more replies

jxub8y ago

Maybe SpaCy for NLP. Way more intuitive and fast too. Good list.

pytonslange8y ago

Most of these are conveniently packaged in:

$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

Ultimatt8y ago

I'd gently suggest basic CLI Perl over BASH for cleaning up files, as it combines grep/sed/awk in a language thats more generally useful.

jonathankoren8y ago

Agreed. Perl was designed for text munging, and is superior to pretty much everything for this task.

WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.

1 more reply

cmdlinetips8y ago

good list. I would add tidyverse in R ecosystem to it

p33p8y ago

I would go as far as saying the tidyverse is an essential piece of working with R. Base R sans tidyverse is not a pleasant experience.

1 more reply

elsherbini8y ago· 7 in thread

I'm a scientist (PhD student in microbiolgy) that works with lots of data. My data is on the order of hundreds of gigabytes (genome collections and other sequencing data) or megabytes (flat files).

I use the `tidyverse` from R[0] for everything people use `pandas` for. I think the syntax is soooo much more pleasant to use. It's declarative and because of pipes and "quosures" is highly readable. Combined with the power of `broom`,fitting simple models to the data and working with the results is really nice. Add to that that `ggplot` (+ any sane styling defaults like `cowplot`) is the fastest way to iterate on data visualizations that I've ever found. "R for Data Science" [1] is great free resource for getting started.

Snakemake [2] is a pipeline tool that submits steps of the pipeline to a cluster and handles waiting for steps to finish before submitting dependent steps. As a result, my pipelines have very little boilerplate, they are self documented, and the cluster is abstracted away so the same pipeline can work on a cluster or a laptop.

[0] https://www.tidyverse.org/

[1] http://r4ds.had.co.nz/

[2] http://snakemake.readthedocs.io/en/stable/

nonbel8y ago

Sometimes I think I'm the only one who isn't really a fan of the tidyverse. I've found it slower, more prone to dependency issues, more prone to silent errors, and less well documented than most R packages (ie most of what you find on CRAN).

in98y ago

Dependency management, in my opinion, is one of the problems in the R ecosystem. The lack of name spaces when calling functions has made the community have many little packages that only do one thing on you are not really sure where it was actually used, unless you know the code and the package.

An example is the janitor::clean_names function I like to use for standardizing the column names on a data.frame.

However, the tidyverse is really serious in terms of api consistency and functional style, with pipes and purrr's functionalities. The unixy style of base R is unproductive in terms of fast iterating an analysis. Also, the idea of "everything in a data frame" (or tibble, with list columns and whatnot) together with the tidy data principles really takes the cognitive load off to just get things started.

1 more reply

zamazingo8y ago

I agree on these reservations, especially in terms of silent errors (which get compounded through minor ways in which backwards incompatibility can sneak in to the existing scripts) and dependencies.

As a half-solution, I ended up restricting myself to a very few libraries in this family (mainly dplyr, lubridate, stringr, broom) and to using packrat to consistently freeze the library versions for these.

chasedehan8y ago

I really enjoy the tidyverse, especially dplyr. I do most of my work in python now and find myself moving more and more of time in python.

There are definitely some issues if you have to reliably run scripts (not to mention the difficulties of putting into production)

The thing I really like about R over python is for SPECIFIC tasks like inspecting data and trying to get an answer out quickly, there really isn't a quicker or better tool to use. The ONLY reason I still even use R is because of the ease to get answers with the tidyverse

1 more reply

swarchal8y ago

You're not the only one. Though I've found there seems to be a bit of a cult surrounding the tidyverse, a mere hint of criticism usually results in outrage and attacking other tools/packages (by users, not the authors).

1 more reply

vobios8y ago

> less well documented than most R packages

I on the other hand, find most R packages provide barely readable documentation. I can just hope that the vignette exists and actually explains the inputs/outputs.

1 more reply

DataWorker8y ago

You are not alone. I think it’s a great thing for some people, but a net negative for the R community in the long run.

1 more reply

eggie58y ago· 6 in thread

a lot of people using spark?

threeseed8y ago

Absolutely

Every single large scale data science team e.g. Google, Spotify, AirBnb will be using Spark for most of their work. It is by far the defacto standard for working with large datasets. Especially since it integrates so well with machine learning (H2O) and different languages (Scala, Python, R).

rajman1878y ago

Definitely. It's very nice to do large jobs in such a scalable manner. And interacting with databases is very straightforward. I'd also recommend Scala especially if using Spark. I've grown to like it as much if not more than python and you can use zeppelin/jupyter notebooks with is as well.

ivanceras8y ago

Maybe too soon, but this framework[0] claims to be 2x faster than spark

[0]: https://datafusion.rs/

aretaic8y ago

We use spark for most of our work. We love it so far, able to handle all our use cases so far and we really appreciate the fact that Scala also runs on the JVM.

sandGorgon8y ago

same question that i have. Anyone using pyspark in production ?

Would you use pyspark mllib in a webservice instead of scikit ?

wenc8y ago

1) Yes, PySpark is great if you're mostly just doing dataframe manipulation in Spark, using built-in functions. PySpark actually has similar performance to Scala Spark for dataframes. (We've moved away from RDDs)

However, if you use a lot of UDFs where Spark has to serialize your Python functions, you might consider rewriting those UDFs in a JVM language. Serialization overhead is still fairly substantial. Arrow is trying to address this by implementing a common in-memory format, but it's still early days.

I would still recommend PySpark to most people. It's more than good/fast enough for most data munging tasks. Scala does buy you two things: type safety and low serialization overhead (i.e. significant!), which can be critical in some situations, but not all.

Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.

2) Spark MLLib is still fairly rudimentary in its coverage of major ML algorithms, and Spark's linear algebra support, while serviceable, is currently not very sophisticated. There are a few functions that are useful in the data prep stage (encoding, tokenizers, etc.) but overall, we don't really use MLlib very much.

Companies that have simple needs (e.g. a simple recommender) and that don't have a lot of in-house expertise, might use MLlib though -- I believe someone from a startup said that they did at a recent meetup.

Most of us need better algorithmic coverage and Scikit's coverage is currently much better, plus it is more mature. We also have Numpy at our disposal, which lets us do matrix-vector manipulation easily. There is some serialization cost, but we can usually just throw cloud computational power at it.

Also note that for most workloads, the majority of the cost is incurred in training. For models in production, one is typically processing a much smaller amount of data using a trained model, so less horsepower is required.

1 more reply

xitrium8y ago· 4 in thread

If you care about quantifying uncertainty, knowing about Bayesian methods is a good idea I don't see represented here yet. I care so much about uncertainty quantification and propagation that I work on the Stan project[0] which has an extremely complete manual (600+ pages) and many case studies illustrating different problems. Full Bayesian inference such as that provided by Stan's Hamiltonian Monte Carlo inference algorithm is fairly computationally expensive so if you have more data than fits into RAM on a large server, you might be better served by some approximate methods (but note the required assumptions) like INLA[1].

[0] http://mc-stan.org/ [1] http://www.r-inla.org/

peatmoss8y ago

I think this is an important point. Having worked in / proximate to public policy kinds of problems, Bayesian methods have some really great properties:

1. easier interpretation of results than frequentist methods for lay people (business strata, elected officials, or other decision makers)

2. Uncertainty can be quantified and visualized reasonably well, which helps decision makers not think of stats as a magic box that produces a single answer.

3. Sensitivity analysis can be placed right up front: selection of priors representative of the beliefs of differing opinions / ideologies can inform decision makers of when they should consider changing their minds, and when they might still hold out.

Downsides of Bayesian methods:

1) Conceptually more involved than typical maximum likelihood estimation methods

2) Computationally expensive

3) Methods might not be as well known to a nominally stats-savvy audience.

smartera8y ago

I have also used Bayesian quantification of uncertainty in pricing forecast models. Decision makers love a measure of uncertainty when one recommends a pricing scenario that can have significant impact on revenue. Also, you get the chance to build multilayer models to combine knowledge from independent samples. PyMC3 is fantastic for building these models within Jupyter and Gelman's Bayesian Data Analysis is a great introduction for different Bayesian model applications.

hobolord8y ago

do you have a recommended guide/textbook on learning stan? I've recently started doing more bayesian analysis, mainly bayesian estimation supercedes the t-test.

wishart_washy8y ago

As someone who uses Stan - I would recommend reading the Stan reference documentation, it's essentially a textbook.

Also, get used to reading the Stan forums on Discourse. Happy Stanning

Xcelerate8y ago· 3 in thread

As a data scientist who has been using the language for 5 years now, Julia is by far the best programming language for analyzing and processing data. That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore. (I still don’t know how to connect to Postgres in a bug-free way using Julia.) And you’d be hard pressed to find teams of data scientists that use Julia. So in that sense, Python has much more mature and stable libraries, and it’s used everywhere. (But I really hope Julia overtakes it in the next couple of years because it’s such a well-designed language.)

Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.

I think memorizing basic stats knowledge is not as useful as understanding deeper concepts like information theory, because most statistical tests can easily be performed nowadays using a library call. No one asks people to program in assembler to prove they can program anymore, so why would you memorize 30 different frequentist statistical tests and all of the assumptions that go along with each? Concepts like algorithmic complexity, minimum description length, and model selection are much more valuable.

jballanc8y ago

> That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore.

On this specific point, it's worth noting that up until now there's been a single massive repository of every Julia package ever published, regardless of its current state or utility. Starting with the upcoming 0.7 release, Julia will introduce the concept of "curated" repositories so that, going forward, if you stick just with the default curated repository of packages you should have much less chance of running into a broken or unmaintained package.

chubot8y ago

Has Julia converged on a solution for data frames? I watched some JuliaCon videos and got the impression that they hadn't. There seem to be a lot of different overlapping efforts.

0kto8y ago

Well, only the DataFrames.jl package comes to my mind. However, there exist a few packages that extend this package (DataFramesMeta.jl or Query.jl; these overlap to some extend, but the newer Query package seems to go beyond DataFrames and offers some piping capabilities to interface with plotting packages). In general: During the three years of my PhD some language / package upgrades broke some of my scripts (during 0.4 -> 0.5 and -> 0.6), but the language (and its extensive documentation, online and from the source code of the packages) is very pleasant to use - the deprecation warnings usually help you to adjust your code in time. I have been relaying heavily on said DataFrame package, and am quite happy - the community is usually responsive and helpful in case of problems or questions.

chewxy8y ago· 3 in thread

My toolkit hasn't changed since 2016:

- Jupyter + Pandas for exploratory work, quickly define a model

- Go (Gonum/Gorgonia) for production quality work. (here's a cheatsheet: https://www.cheatography.com/chewxy/cheat-sheets/data-scienc... . Additional write-up on why Go: https://blog.chewxy.com/2017/11/02/go-for-data-science/)

I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain

ZeroCool2u8y ago

I'm a big Go fan, but this is the first time I've seen someone recommend Go for data science. After looking at this cheat sheet you've got me convinced though. Would you mind pointing me to any other less cheat sheet style and more in depth examples that you particularly like?

chewxy8y ago

Working on it. Part of my goal for 2018 is to write a lot more soft documentation - tutorials etc.

Go is quite straightforwards though - WYSIWYG for the most parts, hence you probably won't find a lot of sexy tutorials. Almost everything is just a loop away, and in the next version of Gorgonia, even more native looping capability is coming

1 more reply

samuell8y ago

You might also want to have a look at:

- http://gopherdata.io

... and in particular the resources lists at

- https://github.com/gopherdata/resources

Also, Dan's GopherCon talk on Go for data science is a great way to get yourself convinced enough to try it out:

- https://www.youtube.com/watch?v=D5tDubyXLrQ

fredley8y ago· 3 in thread

IPython/Jupyter, Pandas/Numpy and Python will get you everywhere you need to go. Currently, until maybe Go gets decent DataFrame support, in terms of the total time to get to your solution, I'd be amazed if any other setup got you there quicker.

threeseed8y ago

> get you everywhere you need to go

No it won't.

That combination can't handle large datasets that are typical for most data science teams i.e. maybe include PySpark. And then it's very limited so far as ML/DL technologies.

fredley8y ago

> i.e. maybe include PySpark

Pandas and Spark are both DataFrame libraries, and seem to offer very similar functionality to me. Why do you prefer Spark over Pandas?

> very limited so far as ML/DL technologies

I mean, getting Tensorflow up and running with GPU support isn't trivial, but it's not exactly hard, and Keras[1] provides excellent support for a wide variety of other backends. What, in your experience, is less limited?

[1]: https://keras.io/

1 more reply

ppod8y ago

>typical for most data science teams

I would bet that the mean size of dataset people are dealing with is a lot bigger than the median size.

trollied8y ago· 2 in thread

What does "Data Scientist" actually mean these days? Does it mean "Write 10 lines of Python or R, and not fully understand what it actually does"? Or something else?

I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.

Maybe we need a Data Scientist to work out what a Data Scientist is?

threeseed8y ago

I hire data scientists so can tell you.

It means someone who can work with business stakeholders to break down a problem e.g. "we don't know why customers are churning", produce a machine learning model or some adhoc analysis (usually the former) and either communicate the results back or assist in deploying the model into production.

Typically there will be data engineers who will be doing acquisition and cleaning and so the data scientists are all about (a) understanding the data and (b) liaising with stakeholders.

As for technologies it is typically R/Python with Spark/H20 on top of a data lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW, Presto or a Feature store e.g. Cassandra.

jxub8y ago

That's a good meta reflection. Let's make an Y Combinator of Data Scientist a and Data Scientist b (recursive data scientist) to prove they can support recursion if Data Scientists a and b are first class functions, just because we can:

  const Y = a => (b => b(b))(b => a(x => b(b)(x)));

bitL8y ago· 2 in thread

Spark + MLlib, Python + Pandas + NumPy + Keras + TensorFlow + PyTorch, R, SQL, top placement in some Kaggle competitions. This would get you long way.

geebee8y ago

Good tool set recommendations (+1 for mentioning SQL, immensely helpful), and I enjoy Kaggle. Not sure how critical top placement is, though.

It seems like getting into the upper echelons of Kaggle is a matter of refining your model, and I do wonder how much value these refinements offer over a more basic and general approach in a real world scenario. To be clear, when I say I wonder, I'm not saying I'm rejecting the value, I really do mean it, I'm uncertain about the value. I think it's probably very scenario specific.

Think of it this way - a predictive value of 90% vs 95% could be the difference between placing in the top 10% and the bottom third. Now, 5% isn't nothing, it could be very valuable. It really depends.

But Kaggle is an environment where the question is already posed, the data has been collected, the test and train sets are already split apart for you, and winning model is the one that scores best on a hidden test set by a predefined goodness of fit score.

In a real world scenario, suppose someone does a great job figuring out the question to ask, gathering the data, and determining the most effective way to act on the results, but uses a fairly basic, unrefined model. Someone else does a middling job on those things, but builds a very accurate model as measured by the data that has been collected. I'd say the first scenario is likely to be more valuable, but again, it depends of course.

A couple other things, since I am a fan of Kaggle and do highly recommend it. First, these things aren't necessarily exclusive - you can have a particularly well conceived and refined model as well as a thorough and excellent businesss and data collection process (though you may have to decide where to put your time and resources).

Also, refining a model with Kaggle can be an exceptional training opportunity to really understand what drives these things. So go for it! (I also find these things kinda fun).

bitL8y ago

Top placement in Kaggle attracts recruiters for higher positions; i.e. I observed a top 10 person getting a job of Head/VP of analytics in a large European company even if let's say formal education wasn't top 100. I agree real-world it is often useless, but people are drawn to proven winners.

1 more reply

nrjames8y ago· 2 in thread

There are two "poles" in data science: math/modeling and backend/data-wrangling. Most of the time, the backend/data-wrangling piece is a prerequisite to the math/modeling. The vast majority of small and medium sized companies have not set up the systems they would need to support a data scientist who knows only math/modeling. Depending on the domain, it's not uncommon to find that a small/medium company outsourced analytics to Firebase, Flurry, etc...

That's fine, but when it comes time to create some customer segmentation models (or whatever) the data scientist they hire is going to need to know how to get the raw data. Questions become: how do I write code to talk to this API? How do I download 6 months of data, normalize it (if needed) and store it in a database? Those questions flow over into: how do I set up a hosted database with a cloud provider? What happens if I can't use the COPY command to load in huge CSV files? How do I tee up 5 TB of data so that I can extract from it what I need to do the modeling? Then you start looking at BigQuery or Hadoop or Kafka or NiFi or Flink and you drown for a while in the Apache ecosystem.

If you take a job at a place that has those needs, be prepared to spend months or even up to a year to set up processes that allow you to access the data you need for modeling without going through a painful 75 step process each time.

Case in point: I recently worked on a project where the raw data came to me in 1500 different Excel workbooks, each of which had 2-7 worksheets. All of the data was in 25-30 different schemas, in Arabic, and the Arabic was encoded with different codepages, depending on whether it came from Jordan, Lebanon, Turkey, or Syria. My engagement was to do modeling with the data and, as is par for the course, it was an expectation that I would get the data organized. Well - to be more straightforward, the team with the data did not even know that the source format would present a problem. There were ~7500 worksheets, all riddled with spelling errors and the type of things that happen when humans interact with Excel: added/deleted columns, blank rows with ID numbers, comments, different date formats, PII scattered everywhere, etc.

A data scientist's toolkit needs to be flexible. If you have in mind that you want to do financial modeling with an airline or a bank, then you probably can focus on the mathematics and forget the data wrangling. If you want the flexibility to move around, you're going to have to learn both. The only way to really learn data wrangling is through experience, though, since almost every project is fundamentally different. From that perspective, having a rock solid understanding of some key backend technologies is important. You'll need to know Postgres (or some SQL database) up and down; how to install, configure, deploy, secure, access, query, tweak, delete, etc. You really need to know a very flexible programming language that comes with a lot of libraries for working with data of all formats. My choice there was Python. Not only do you need to know the language well, you need to know the common libraries you can use for wrangling data quickly and then also for modeling.

IMO, job descriptions for "Data Scientist" positions cover too broad of a range, often because the people hiring have just heard that they need to hire one. Think about where you want to work and/or the type of business. Is it established? New? Do they have a history of modeling? Are you their first "Data Scientist?" All of these questions will help you determine where to focus first with your skill development.

dermybaby8y ago

So basic DBA skills + expert programming skills + very good math/stats?

Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.

Have you ever needed any Microsoft skills(MSSQL/C#) so far?

nrjames8y ago

Yep, I’ve used MS SQL products and I write C# sometimes and read and write code to parse it very often because it is the primary language of the products I support.

anc848y ago· 2 in thread

Any programming language that you are proficient in. A solid understanding how a computer works. Solid basis of statistics. Anything else is just sprinkles, trends and field-specific.

wenc8y ago

> Any programming language that you are proficient in.

Oh I don't know about that. Programming languages are force multipliers, and each language has a different force coefficients for different problem domains. They are not all equivalent. They have their different points of leverage, and simply being good in one does not mean you can solve problems in any domain with ease. In fact the wrong programming language can often be harmful if it's ill-suited to the problem at hand, and especially if it contorts your mental model of what you can do with the data.

One example I encounter a lot in industry is Excel VBA. I'm fairly good at VBA and have seen very sophisticated code in VBA. I've also seen many basic operations implemented badly in VBA that should not have been written in VBA at all. By solving the problem in VBA, the solution is often "hemmed in" by the constraints of VBA.

For instance, unpivoting data is often done badly in VBA (with for-loops), but is trivial to do well in dplyr or pandas.

So I would say one has to choose one's programming language somewhat carefully. Not any language will do.

greyman8y ago

Hard to say... I was more proficient in PHP than python, but when doing AI, we use python anyway, since in PHP some necessary libraries just aren't there...

eps8y ago· 2 in thread

You probably mean "data analyst".

"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).

sgt1018y ago

Designing experiments is a key part of Data Science work. Another key part is determining where & how revealing observations can be made.

The analysis part is usually quite simple, often if it gets really complex then that's a sign that the data is being tortured. Sometimes the marginal gains that complex methods create (vs simple but good approaches) are not worthwhile even if they are valid - simply in terms of time spent and difficulty in communications.

Twisell8y ago

Define natural world... And gather a consensus around your definition...

Or maybe whole humanities should be considered as « not science ».

Beside a data analyst that don’t use scientific method is just a bad analyst. Some media outlet showcase blatantly lying charts made by people that understand the technicals but get everything wrong about the concepts.

So this is my advice, focus on understanding the concepts before the tooling. That is what will really make your value.

drej8y ago· 1 in thread

grep, cut, cat, tee, awk, sed, head, tail, g(un)zip, sort, uniq, split; curl; jq, python3

proc08y ago

So unix? lol

1 more reply

kome8y ago· 1 in thread

Excel, VBA, SPSS ;)

babayega28y ago

OpenRefine has helped me a lot in data cleaning tasks.

piqufoh8y ago

> what tools should be in my arsenal

A sound understanding of mathematics, in particular statistics.

It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.

justusw8y ago

Dealing with large data processing problems my main tools are as follows:

Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses

Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup

My problems usually can be solved by 2 memory-heavy EC2 instances. This setup works really well for me. Reading and writing intermediate results to S3 is blazing fast, especially when partitioning data by days if you work with time series.

Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.

The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.

To sum it up, Dask alone makes all the difference for me!

schaunwheeler8y ago

A lot of people in this thread are focusing on technical tools, which is normal for a discussion of this type, but I think that focus is misplaced. Most technical tools are easily learnable and are not the limiting factor is creating good data science products.

https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...

(Disclaimer: I wrote the post at the above link).

If you have a sound design you can still create a huge amount of value even with a very simple technical toolset. By the same token, you can have the biggest, baddest toolset in the world and still end up with a failed implementation if you have bad design.

There are resources out there for learning good design. This is a great introduction and points to many other good materials:

https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...

severo8y ago

I'd say:

1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.

2. Familiarity with math (which? depends on what you are doing) to the point where you can read articles that may have a solution to your problem and the ability to propose changes, creating proprietary algorithms.

3. Some scripting language (Python, R, w/e)

4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.

dxbydt8y ago

> What’s the fizzbuzz test for data scientists anyway?

Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.

1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?

2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?

3. Difference between Adam and SGD.

ever18y ago

Python: Jupyter, pandas, numpy, scipy, scikit-learn

Numba for custom algorithms.

Dataiku (amazing tool for preprocessing and complex flows)

Amazon RDS (postgress), but thinking about redshift.

Spark

Tableau or plotly/seaborn

closed8y ago

I would think about which of these you see yourself doing more..

* statistical methods (more math)

* big, in-production model fitting (more python)

* quick, scrappy data analyses for internal use (more R)

For example, I would feel weird writing a robust web server in R, but it's straightforward in python. On the other hand R's shiny lets you put up quick, interactive web dashboards (that I wouldn't trust in exposing to users).

greyman8y ago

If you will work in some bigger company doing data analytics, you can also come across Tableau instead of Excel. Apart from SQL, if there is more data, you might want to use Bigquery or something similar.

kmax128y ago

One crucial skill you will need is feature engineering. Formal methods for it aren’t typically in data science classes. Still, it’s worth understanding in order to build ML applications. Unfortunately, there aren't many available tools today, but I expect that to change this year.

Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.

I’m the developer of a library called Featuretools (https://github.com/Featuretools/featuretools) which is a good tool to know for automated feature engineering. Our demos are also a useful resource to learn using some interesting datasets and problems: https://www.featuretools.com/demos

cwyers8y ago

You can get a lot of mileage out of just using R, dplyr, ggplot2 and lm/glm. OLS still performs well in a lot of problem spaces. Understanding your data is the key there, and a lot of exploratory visualization there will help a lot.

innovather8y ago

Hey everyone, I'm not a data scientist or a developer but I work with a lot of them. My company, Introspective Systems, recently released xGraph, an executable graph framework for intelligent and collaborative edge computing that solves big problems: those that have massive decision spaces, tons of data, are highly distributed, dynamically reconfigure, and need instantaneous decision making. It's great for the modeling work that data scientists do. Comment if you want more info.

Jeff_Brown8y ago

Static typing lets you catch errors before running the code.

Pattern matching helps you write code faster (that is, spending less human time).

Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.

Coconut is an extension of Python that offers all of those.

Test driven development also helps you write more correct code.

ChrisRackauckas8y ago

A good understanding of calculus (probability), linear algebra, and your dataset/domain. Anything else can be picked up as you need it. Oh, and test-driven development in some programming language, otherwise you can't develop code you know is correct.

ak_yo8y ago

Experimental design and observational causal inference would be excellent skills to have. Especially if you’re working with people who are asking you “why” questions, ML is helpful but isn’t going to cut it alone.

pentium108y ago

As 1TB is free for processing every month, using SQL 2011 standard + combined with Javascript UDFs, the winner solution is Google BigQuery for us, combined with Dataprep

larrykwg8y ago

Nobody mentioned this yet: ETE: http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.ht...

a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure

in98y ago

I saw a simple tool somewhere a while ago (maybe a month or so ago) of a simple cli for data inspection in the terminal. It seemed very useful for inspecting data ssh'ed into a machine.

However, I can't seem to recall the name. Has any one seen what I'm talking about?

latenightcoding8y ago

If you use Python: scikit-learn, Pandas, NumPy, Tensorflow or PyTorch

Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet

spdustin8y ago

OpenRefine (openrefine.org) is definitely a handy (and automate-able) part of my data-cleansing workflow.

sdfjkl8y ago

numpy, Jupyter (formerly IPython Notebook) and probably Mathematica anyways.

amelius8y ago

Any book recommendations?

ellisv8y ago

Counting and dividing.

topologie8y ago

Random Matrix Theory.

j / k navigate · click thread line to collapse

169 comments

107 comments · 38 top-level

ms0138y ago· 21 in thread

(Context: Been doing data work for decades, before it got its recent “data science” name.)

p33p8y ago

barskern8y ago

> And don’t over use viz. [...]

[1]: http://www.3blue1brown.com

biswaroop8y ago

gesman8y ago

Good viz is what connects non-ML/AI users to the "magical" results of ML/AI

gaius8y ago

Indeed. Too many people when asked about their skills or experience just rattle off a list of tools or libraries. Usually the same ones as everyone else!

anewhnaccount28y ago

Out of interest, can you give an example of a problem you've solved using Z3?

ms0138y ago

edem8y ago

I really like to get a degree in Mathematics but I simply don't have the time to throw at it (work, children, etc). What do you suggest I should do to have something on my resume? MOOC maybe?

CoVar8y ago

Usually MOOC for resume don't help as everyone does them. The advice that I found useful for resume building is working on projects that you can catalog in a portfolio.

With regards to gaining math skills, this upcoming MOOC from Microsoft on EdX looks promising[1].

[1] https://www.edx.org/course/essential-mathematics-for-artific...

2 more replies

gtani8y ago

maayank8y ago

I have good background in graph theory (IMHO) but don't know many data science use-cases (I'm amateur at that). Could you point to some good start points?

ms0138y ago

I've worked with them in three of the above contexts: cybersecurity (my current projects), retail analytics, and image analysis. I've avoided social network stuff - never cared for that area much.

Aardwolf8y ago

R is not present in your list, did you ever try it and what's your opinion about it?

mojoe8y ago

ms0138y ago

asafira8y ago

I think it is safe to say that they don't think it's too important for them given their main message, but hey, maybe they have an opinion anyway...

throwaway76458y ago

If you have Mathematica, you might not need R as both are like Swiss Army Chainsaws for Data Analytics.

1 more reply

uptownfunk8y ago

I think visualization can be a helpful tool to understand the data. I have seen some DS's get caught up in visualization for visualization's sake which I think can be wasteful.

I definitely think a solid mathematical understanding helps to build quantitative and critical thinking skills which are very key in data science.

fhk8y ago

@ms013 interested to know how you are using the solvers, are you willing to share any further details?

ms0138y ago

Balgair8y ago

Honestly, in any STEM major, esp. the physics heavy ones, those maths areas should be well understood. Is any (physics heavy) STEM major also a Data Scientist then too?

trevz8y ago· 10 in thread

A couple of thoughts, off the top of my head:

Programming languages:

  - python (for general purpose programming)
  - R (for statistics)
  - bash (for cleaning up files)
  - SQL (for querying databases)

Tools:

  - Pandas (for Python)
  - RStudio (for R)
  - Postgres (for SQL)
  - Excel (the format your customers will want ;-) )

Libraries:

  - SciPy (ecosystem for scientific computing)
  - NLTK (for natural language)
  - D3.js (for rendering results online)

ktpsns8y ago

I make the claim that you can go very far in the SciPy ecosystem without ever touching R.

It is worth understanding the concepts of numpy and pandas. Furthermore, try out IPython/Jupyter, especially for rapid publishing (people run their blogs on jupyter notebooks).

closed8y ago

I disagree, even though python is the language I do most of my development in. But it probably depends on the problems we're thinking of a data scientist solving.

marmaduke8y ago

I second that R is nice to have, but not needed. I’ve been doing science in Python for a decade without ever needing R.

For powerful GPU viz, have you considered vispy? Four authors of four independent Python science visualization libs got together to build it.

albertgoeswoof8y ago

Agree, I would drop R, Python has you mostly covered now. Julia is also worth learning.

2 more replies

jxub8y ago

Maybe SpaCy for NLP. Way more intuitive and fast too. Good list.

pytonslange8y ago

Most of these are conveniently packaged in:

$ docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

Ultimatt8y ago

I'd gently suggest basic CLI Perl over BASH for cleaning up files, as it combines grep/sed/awk in a language thats more generally useful.

jonathankoren8y ago

Agreed. Perl was designed for text munging, and is superior to pretty much everything for this task.

WRT bash, where to begin? In the past 40 years, there’s pretty much a better tool for everything someone tries to do with bash. It lives on pretty much through inertia and pride.

1 more reply

cmdlinetips8y ago

good list. I would add tidyverse in R ecosystem to it

p33p8y ago

I would go as far as saying the tidyverse is an essential piece of working with R. Base R sans tidyverse is not a pleasant experience.

1 more reply

elsherbini8y ago· 7 in thread

I'm a scientist (PhD student in microbiolgy) that works with lots of data. My data is on the order of hundreds of gigabytes (genome collections and other sequencing data) or megabytes (flat files).

[0] https://www.tidyverse.org/

[1] http://r4ds.had.co.nz/

[2] http://snakemake.readthedocs.io/en/stable/

nonbel8y ago

in98y ago

An example is the janitor::clean_names function I like to use for standardizing the column names on a data.frame.

1 more reply

zamazingo8y ago

I agree on these reservations, especially in terms of silent errors (which get compounded through minor ways in which backwards incompatibility can sneak in to the existing scripts) and dependencies.

chasedehan8y ago

I really enjoy the tidyverse, especially dplyr. I do most of my work in python now and find myself moving more and more of time in python.

There are definitely some issues if you have to reliably run scripts (not to mention the difficulties of putting into production)

1 more reply

swarchal8y ago

1 more reply

vobios8y ago

> less well documented than most R packages

I on the other hand, find most R packages provide barely readable documentation. I can just hope that the vignette exists and actually explains the inputs/outputs.

1 more reply

DataWorker8y ago

You are not alone. I think it’s a great thing for some people, but a net negative for the R community in the long run.

1 more reply

eggie58y ago· 6 in thread

a lot of people using spark?

threeseed8y ago

Absolutely

rajman1878y ago

ivanceras8y ago

Maybe too soon, but this framework[0] claims to be 2x faster than spark

[0]: https://datafusion.rs/

aretaic8y ago

We use spark for most of our work. We love it so far, able to handle all our use cases so far and we really appreciate the fact that Scala also runs on the JVM.

sandGorgon8y ago

same question that i have. Anyone using pyspark in production ?

Would you use pyspark mllib in a webservice instead of scikit ?

wenc8y ago

Also, the Python way has always been to prototype fast, profile, and rewrite bottlenecks in a faster language, and PySpark conforms to that pattern.

1 more reply

xitrium8y ago· 4 in thread

[0] http://mc-stan.org/ [1] http://www.r-inla.org/

peatmoss8y ago

I think this is an important point. Having worked in / proximate to public policy kinds of problems, Bayesian methods have some really great properties:

1. easier interpretation of results than frequentist methods for lay people (business strata, elected officials, or other decision makers)

2. Uncertainty can be quantified and visualized reasonably well, which helps decision makers not think of stats as a magic box that produces a single answer.

Downsides of Bayesian methods:

1) Conceptually more involved than typical maximum likelihood estimation methods

2) Computationally expensive

3) Methods might not be as well known to a nominally stats-savvy audience.

smartera8y ago

hobolord8y ago

do you have a recommended guide/textbook on learning stan? I've recently started doing more bayesian analysis, mainly bayesian estimation supercedes the t-test.

wishart_washy8y ago

As someone who uses Stan - I would recommend reading the Stan reference documentation, it's essentially a textbook.

Also, get used to reading the Stan forums on Discourse. Happy Stanning

Xcelerate8y ago· 3 in thread

Aside from programming languages, Jupyter notebooks and interactive workflows are invaluable, along with maintaining reproducible coding environments using Docker.

jballanc8y ago

> That said, it’s common to find many Julia packages that are only half-maintained and don’t really work anymore.

chubot8y ago

Has Julia converged on a solution for data frames? I watched some JuliaCon videos and got the impression that they hadn't. There seem to be a lot of different overlapping efforts.

0kto8y ago

chewxy8y ago· 3 in thread

My toolkit hasn't changed since 2016:

- Jupyter + Pandas for exploratory work, quickly define a model

I echo ms013's comment very much. Everything is just tools. More important to understand the math and domain

ZeroCool2u8y ago

chewxy8y ago

Working on it. Part of my goal for 2018 is to write a lot more soft documentation - tutorials etc.

1 more reply

samuell8y ago

You might also want to have a look at:

- http://gopherdata.io

... and in particular the resources lists at

- https://github.com/gopherdata/resources

Also, Dan's GopherCon talk on Go for data science is a great way to get yourself convinced enough to try it out:

- https://www.youtube.com/watch?v=D5tDubyXLrQ

fredley8y ago· 3 in thread

threeseed8y ago

> get you everywhere you need to go

No it won't.

That combination can't handle large datasets that are typical for most data science teams i.e. maybe include PySpark. And then it's very limited so far as ML/DL technologies.

fredley8y ago

> i.e. maybe include PySpark

Pandas and Spark are both DataFrame libraries, and seem to offer very similar functionality to me. Why do you prefer Spark over Pandas?

> very limited so far as ML/DL technologies

[1]: https://keras.io/

1 more reply

ppod8y ago

>typical for most data science teams

I would bet that the mean size of dataset people are dealing with is a lot bigger than the median size.

trollied8y ago· 2 in thread

What does "Data Scientist" actually mean these days? Does it mean "Write 10 lines of Python or R, and not fully understand what it actually does"? Or something else?

I just see the term flinged around so much recently, and applied to so many different roles, it has all become a tad blurred.

Maybe we need a Data Scientist to work out what a Data Scientist is?

threeseed8y ago

I hire data scientists so can tell you.

Typically there will be data engineers who will be doing acquisition and cleaning and so the data scientists are all about (a) understanding the data and (b) liaising with stakeholders.

As for technologies it is typically R/Python with Spark/H20 on top of a data lake i.e. HDFS, S3. Every now and again on top of an SQL store e.g. EDW, Presto or a Feature store e.g. Cassandra.

jxub8y ago

  const Y = a => (b => b(b))(b => a(x => b(b)(x)));

bitL8y ago· 2 in thread

Spark + MLlib, Python + Pandas + NumPy + Keras + TensorFlow + PyTorch, R, SQL, top placement in some Kaggle competitions. This would get you long way.

geebee8y ago

Good tool set recommendations (+1 for mentioning SQL, immensely helpful), and I enjoy Kaggle. Not sure how critical top placement is, though.

Also, refining a model with Kaggle can be an exceptional training opportunity to really understand what drives these things. So go for it! (I also find these things kinda fun).

bitL8y ago

1 more reply

nrjames8y ago· 2 in thread

dermybaby8y ago

So basic DBA skills + expert programming skills + very good math/stats?

Also - your model of asking questions before starting a new gig is very relevant to nearly every programming job. Could also be some of the questions a candidate asks in an interview.

Have you ever needed any Microsoft skills(MSSQL/C#) so far?

nrjames8y ago

Yep, I’ve used MS SQL products and I write C# sometimes and read and write code to parse it very often because it is the primary language of the products I support.

anc848y ago· 2 in thread

Any programming language that you are proficient in. A solid understanding how a computer works. Solid basis of statistics. Anything else is just sprinkles, trends and field-specific.

wenc8y ago

> Any programming language that you are proficient in.

For instance, unpivoting data is often done badly in VBA (with for-loops), but is trivial to do well in dplyr or pandas.

So I would say one has to choose one's programming language somewhat carefully. Not any language will do.

greyman8y ago

Hard to say... I was more proficient in PHP than python, but when doing AI, we use python anyway, since in PHP some necessary libraries just aren't there...

eps8y ago· 2 in thread

You probably mean "data analyst".

"Data scientist" title would apply only if you are applying scientific method to discover new fact about natural world exclusively through data analysis (as opposed to observation and experiments).

sgt1018y ago

Designing experiments is a key part of Data Science work. Another key part is determining where & how revealing observations can be made.

Twisell8y ago

Define natural world... And gather a consensus around your definition...

Or maybe whole humanities should be considered as « not science ».

So this is my advice, focus on understanding the concepts before the tooling. That is what will really make your value.

drej8y ago· 1 in thread

grep, cut, cat, tee, awk, sed, head, tail, g(un)zip, sort, uniq, split; curl; jq, python3

proc08y ago

So unix? lol

1 more reply

kome8y ago· 1 in thread

Excel, VBA, SPSS ;)

babayega28y ago

OpenRefine has helped me a lot in data cleaning tasks.

piqufoh8y ago

> what tools should be in my arsenal

A sound understanding of mathematics, in particular statistics.

It's amazing how many people will talk endlessly about the latest python/R packages (with interactive charting!!!) who can't explain the student's t-test.

justusw8y ago

Dealing with large data processing problems my main tools are as follows:

Libs: - Dask for distributed processing - matplotlib/seaborn for graphing - IPython/Jupyter for creating shareable data analyses

Environment: - S3 for data warehousing, I mainly use parquet files with pyarrow/fastparquet - EC2 for Dask clustering - Ansible for EC2 setup

Lots of difficult problems require custom mapping functions. I usually use them together with dask.dataframe.map_partitions, which is still extremely fast.

The most time-consuming activity is usually nunique/unique counting across large time series. For this, Dask offers hyperloglog based approximations.

To sum it up, Dask alone makes all the difference for me!

schaunwheeler8y ago

https://towardsdatascience.com/data-is-a-stakeholder-31bfdb6...

(Disclaimer: I wrote the post at the above link).

There are resources out there for learning good design. This is a great introduction and points to many other good materials:

https://www.amazon.com/Design-Essays-Computer-Scientist/dp/0...

severo8y ago

I'd say:

1. You need research skills that will allow you to ask the right questions, define the problem and put it in a mathematical framework.

3. Some scripting language (Python, R, w/e)

4. (optional) Software Engineering skills. Can you put your model into production? Will your algorithm scale? Etc.

dxbydt8y ago

> What’s the fizzbuzz test for data scientists anyway?

Here's 3 questions I was recently asked on a bunch of DS interviews in the Valley.

1. Probability of seeing a whale in the first hour is 80%. What's the probability you'll see one by the next hour ? Next two hours ?

2. In closely contested election with 2 parties, what's the chance only one person will swing the vote, if there are n=5 voters ? n = 10 ? n = 100 ?

3. Difference between Adam and SGD.

ever18y ago

Python: Jupyter, pandas, numpy, scipy, scikit-learn

Numba for custom algorithms.

Dataiku (amazing tool for preprocessing and complex flows)

Amazon RDS (postgress), but thinking about redshift.

Spark

Tableau or plotly/seaborn

closed8y ago

I would think about which of these you see yourself doing more..

* statistical methods (more math)

* big, in-production model fitting (more python)

* quick, scrappy data analyses for internal use (more R)

greyman8y ago

kmax128y ago

Deep learning addresses it to some extent, but isn’t always the best choice if you don’t have image / text data (eg tabular datasets from databases, log files) or a lot of training examples.

cwyers8y ago

innovather8y ago

Jeff_Brown8y ago

Static typing lets you catch errors before running the code.

Pattern matching helps you write code faster (that is, spending less human time).

Algebraic data types, particularly sum types, let you represent complicated kinds of data concisely.

Coconut is an extension of Python that offers all of those.

Test driven development also helps you write more correct code.

ChrisRackauckas8y ago

ak_yo8y ago

pentium108y ago

As 1TB is free for processing every month, using SQL 2011 standard + combined with Javascript UDFs, the winner solution is Google BigQuery for us, combined with Dataprep

larrykwg8y ago

Nobody mentioned this yet: ETE: http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.ht...

a fantastic tree visualization framework, its intended for phylogenetic analysis but can really be used for any type of tree/hierarchical structure

in98y ago

I saw a simple tool somewhere a while ago (maybe a month or so ago) of a simple cli for data inspection in the terminal. It seemed very useful for inspecting data ssh'ed into a machine.

However, I can't seem to recall the name. Has any one seen what I'm talking about?

latenightcoding8y ago

If you use Python: scikit-learn, Pandas, NumPy, Tensorflow or PyTorch

Language agnostic: XGBoost, LibLinear, Apache Arrow, MXNet

spdustin8y ago

OpenRefine (openrefine.org) is definitely a handy (and automate-able) part of my data-cleansing workflow.

sdfjkl8y ago

numpy, Jupyter (formerly IPython Notebook) and probably Mathematica anyways.

amelius8y ago

Any book recommendations?

ellisv8y ago

Counting and dividing.

topologie8y ago

Random Matrix Theory.

j / k navigate · click thread line to collapse