A Gentle Visual Intro to Data Analysis in Python Using Pandas (opens in new tab)

(jalammar.github.io)

195 pointsjalammar7y ago51 comments

51 comments

47 comments · 10 top-level

Bishonen887y ago· 11 in thread

A very gentle intro, indeed ;)

Pandas is such a vast monster, that even after going through the book from the original author of Pandas (https://www.safaribooksonline.com/library/view/python-for-da...), I was absolutely unprepared for doing real analysis.

Whilst I understood the basics, such as data loading, (simple) cleaning, selections, functions, groupby', indexes etc., I spent most of my time on stackoverflow looking for solution to actual problems I was facing. I reckon that many other users have made the same experience - there is lot's of general info out there when it comes to pandas, but every data is different and the devil lies in the details. Long Story Short: learning pandas is all about trial-and-error and will take months (years even), to be efficient in.

samfriedman7y ago

As a daily user of pandas for a few years now, I really must suggest that anyone looking to use it for serious data analysis familiarize themselves with the Split/Apply/Combine paradigm [0].

Lots of data munging has been enabled or sped up by judicious application those concepts.

[0] https://pandas.pydata.org/pandas-docs/stable/groupby.html

gjreda7y ago

I agree. Hadley Wickham (a very prolific author of important R libraries) wrote a great paper about this method using one of his libraries. I'm a Python + pandas user, but his paper really helped me understand the approach better: https://vita.had.co.nz/papers/plyr.pdf

dorfsmay7y ago

It might take months to be be able to use its full power and be efficient (delivering at a high velocity), but as someone who was proficient in python but had never used pandas it took me just over a week to write a process to clean data and produce graphs (with seaborn) to compare sets with boxplots. This incldes anonymizing the data properly and play with different Turkey's fence values for the graphs to make the most sense. This is after people spent weeks, and failed, to try to get a similar process with excel.

It's good to warn people, but let's not scare them.

istjohn7y ago

Serious question: I've tried to use Pandas for some data analysis for my small business. Data sets are on the order of 10,000 data points or less. After struggling for days with Pandas, I've begun to wonder if it wouldn't be easier to code the analyses in raw Python. I wouldn't mind taking longer to complete the task at hand if in the process I was acquiring skills that will pay off down the road, but I wonder if Pandas isn't so esoteric and difficult that I may never reach the point that I can cash in that investment of time and effort as long as I am only a casual user.

In contrast, while I'm not an expert in JS or Python, I find that time spent struggling with those technologies pays dividends since the lessons learned make everything I do in the future easier.

This is highly subjective of course, but in your opinion, should I keep fighting with Pandas? Is it worth it?

laichzeit07y ago

Make sure you learn what a Series is and how it relates to the things in the DataFrame and how selection works, specifically .loc and .iloc. Then your life will be much easier. Try starting with this article: https://medium.com/dunder-data/selecting-subsets-of-data-in-...

rchaud7y ago

10,000 data points is well within the range of what Excel can handle without needing PowerQuery. What type of analyses are you attempting?

1 more reply

cuchoi7y ago

I would recommend to "keep fighting with Pandas". Many of its feature seem confusing at first, but later on you see the value of them.

geebee7y ago

I think some of the problem is that people who use pandas don't necessarily know how to drop one level in programming. Like, if the file isn't a nicely formatted csv, they don't know how to read and parse the file directly. If they can't use basic filtering or a boolean mask easily with pandas, they don't know how to use lists, loops, and conditionals directly. It's great to use pandas rather than reinventing the wheel, pandas is an excellent library, but if you're going to deal with data at a very intricate level, you do need to know how and when to punt and just write the code yourself.

I think this comes up particularly in the context of pandas, because it's a common entry point into a programming language for people who don't think of themselves as programmers and may resist the notion that this is actually what they're doing.

manojlds7y ago

I still see myself going to spark and Spark SQL for some tasks like stratified sampling which I haven't been able to properly do with pandas. Somehow the spark DF API feels more intuitive and I was able to figure out a lot by myself.

mharrison7y ago

That's interesting as the Spark api was inspired by Pandas

xiao_haozi7y ago

Can you use groupby for your stratified sampling work?

Mefis7y ago· 10 in thread

How does Pandas compare to R's Tidyverse?

Tidyverse was super easy to pick up, and I can do almost anything I want with. Why would I want to switch to Panda?

Has anyone tired the python tydiverse port? How does it compare to the original?

wenc7y ago

Echoing other comments, Tidyverse is somewhat more coherent (aided significantly by magrittr's %>% operator). Beginners might get tripped up by Non-Standard Evaluation (NSE), which is a little unintuitive, but there are packages to help with that.

The Pandas's API is a generalized solution to complicated, variegated use cases and its syntax reflects that (it was also hemmed by strictures of Python). There are several indexing methods, several ways to slice, several ways to do apply's, all of which behave slightly differently. Even expert Pandas users have trouble remembering the syntax for all of these, so they typically have a Pandas API browser window open or a printed cheat sheet pasted on some corkboard. Pandas definitely takes longer to get used to than Tidyverse but the payoff is that you get to use Python, which is a somewhat "deeper" language than R.

R is great for interactive work, and for data munging jobs that don't interact too much with non-R libraries. However Python is sinply more versatile end-to-end.

I used to start my interactive analysis in R and port to Python for production, but these days I start in Python straight away so there's no impedance mismatch. I've personally found that writing production code in Python (and by extension Pandas) to be much more pleasant than in R, even with Tidyverse.

peatmoss7y ago

The Tidyverse is more coherent and is generally bigger than what’s just in Pandas (R’s Tidyverse; I haven’t used the Python port).

If you already have a good grasp of Python, sure why not learn Pandas too? In my case, I’m reasonably ambidextrous in Python and R but find myself not reaching for Python unless there are colleague / deployment considerations that remove R as an option. The reason? R’s Tidyverse is pretty awesome, and reflects one of the better parts of the R language, namely the meta programming that is a holdover from Scheme’s influence on R.

Now, if you don’t already know Python and don’t have some other reason (such as specific deployment considerations or a team of Python collaborators) to learn? I don’t think so. Python is a fine language, just as R is a fine language. You’re already getting things done in R.

If you want a mental challenge, or to get in on the ground floor of something that might be the future, learn Julia, or F#, or (my favorite) Racket. Or heck, learn Spark, or a new modeling method.

danso7y ago

Pandas' syntax and conventions are significantly more cumbersome than R, but it does pretty well given the Python syntax and convention that it has to work with. I haven't done a lot with pandas because of how difficult it is to remember the syntax and API, but I feel it's good enough that if you're already a Python user, you can stick to doing your data work in pandas rather than move over to R.

jalammarOP7y ago

I haven't used tidyverse myself, but I know that pandas is heavily influenced and inspired by R. Most analysis tasks are doable in both platforms. If later stages of your pipeline involve deep learning (or machine learning, generally), then it could pay to be in the python universe given the wide adoption of python ML/DL tools. I generally wouldn't advise switching unless you have a certain pain point, though.

wenc7y ago

> pandas is heavily influenced and inspired by R.

Is it? How so?

minimaxir7y ago

I use both: Python/Pandas for working with production code and pipelining TensorFlow/Keras code, and R/tidyverse/ggplot2 for ad hoc data reports and visualizations. They both have their advantages and disadvantages and it doesn't hurt to know both workflows.

_Wintermute7y ago

I find pandas far easier to actually program with, whereas the tidyverse is better for quick one-off scripts. The tidyverse and its obsession with non-standard evaluation, makes writing functions more difficult than it should be, and readability goes out the window when using tidyeval.

alexcnwy7y ago

Neural net universe is in Python and you can use Python to build production pipelines.

thanatropism7y ago

Pandas is inspired by R's dataframes, which I'm told are native.

minimaxir7y ago

Native doesn't necessarily mean it's the best option. (tidyverse/dplyr leverages Rcpp for data transformation, which makes it a lot faster at common ETL tasks)

jalammarOP7y ago· 8 in thread

Hello HN, author here. If you've ever wanted to get into data analysis, this is my best attempt at getting you past that first hump. A lot of these concepts are easier than you might think.

qwerty4561277y ago

Thank you very much. I would certainly love to read more and more about Pandas (or anything) written this style and go deeper in the subject.

Are you going to write more? Can you (or anybody) recommend where (a book, a YouTube channel, a website or whatever) do I continue from the point where you intro ends? As for now all I use of Pandas is a datetime-indexed array of real numbers + simple vector operations on its columns but I feel like I would like to take a learning/career path to becoming a Pandas expert.

happy-go-lucky7y ago

You may want to check this out:

> Short hands-on challenges to perfect your data manipulation skills

https://www.kaggle.com/learn/pandas

Also this:

> Things in Pandas I Wish I'd Known Earlier

http://nbviewer.jupyter.org/github/rasbt/python_reference/bl...

skadamat7y ago

Hey there, I'm involved with Dataquest and we have a Pandas and NumPy fundamentals course where we dive into more intermediate concepts like vectorization, key data structures, and the key functions.

https://www.dataquest.io/course/pandas-fundamentals

We use a similar approach to the OP. Lots of diagrams and visual aids and you always work with a real dataset.

jalammarOP7y ago

A reasonable next step would be pick up a dataset that interests you (in a domain you're comfortable with) and explore it with pandas. Kaggle has a bunch of data sets (https://www.kaggle.com/datasets) in various domains. You can look at the "Kernels" where other users often use pandas to uncover insights and show you their process.

Thanks for the kind words!

lesss3657y ago

This is a perfect out-of-tutor-session reference for my novice data analysis pupils. Will be sharing with them later today. Thank you!

bussiere7y ago

Thanks for the tutorial, it's really what i was looking to give to some friends.

reacharavindh7y ago

Very pleasant to read. I will pass this to my wife who is an accounting professor trying to break ground into using Python/Pandas/Numpy instead of Stata.

I really enjoyed your style of writing and use of visual examples. I wish for such an explainer for SQL. If you made that as a book, you can tun with my money.

pyrenan7y ago

Thanks for this. This is very useful for a beginner.

happy-go-lucky7y ago· 4 in thread

Note to author:

> We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):

> df[1:3]

That will slice beginning from the row with integer location 1 up to 3, exclusive of the last element. So, just two rows, not three as shown.

jalammarOP7y ago

Thanks for the heads up! Indeed it's df.loc[1:3] that would return three rows, not straight-up df[1:3] which indeed returns two rows.

Edit: Corrected in the post. Thanks again!

rapfaria7y ago

Genuine question, did you run the original before publishing it on the website?

yufeng667y ago

But why design it that way? Seems to be a sure way to confuse new user.

1 more reply

happy-go-lucky7y ago

(＾＾)ｂ

NightlyDev7y ago· 3 in thread

What kind of syntax is var['string'] in these examples?

Haven't really used python for anything and I'm just wondering, since it looks like an array or map, but clearly seems to have some logic behind it as it seems to reference the specified column at each row. What is this functionality, something that's built in to python or use of some sort of magic functions?

cgriswald7y ago

var['string'] means get the item of object var with key 'string'. Any Python class can define the magic method __getitem__ to define the behavior of/overload the [] operator.

NightlyDev7y ago

Thanks! Didn't know python had magic functions so I was really confused.

xapata7y ago

It's key-lookup syntax, like for a dict (map or hashmap in other languages). A Pandas DataFrame can be thought of as a dict of columns.

Scarbutt7y ago· 1 in thread

Curious, for stuff like this, why not just use sql(sqlite)?

joelschw7y ago

There are some things which Pandas is just better at, such as: extracting content via RegEx and pivoting... However, there are also some situations where you should use SQL such as UPSERT or date-range joins.

skadamat7y ago

Love this intro! All of the popular dataframe oriented tools (tidyverse, pandas, etc) all require familiarity with vectorization and thinking with related mental models. I'm involved with Dataquest Labs and we teach data science interactively in the browser. We're pretty big believers in using diagrams and visual aids to help people learn these concepts as well.

We've had a pandas course (https://www.dataquest.io/course/pandas-fundamentals) for a while and we just launched some R courses that teach a lot of vectorization (https://www.dataquest.io/path/data-analyst-r).

BrandoElFollito7y ago

I wanted to try to analyze application logs (usually a timestamp and some text) but all examples in pandas deal with numbers.

Is this useful for the analysis of such data (with a machine learning mid term goal (clustering and anomaly detection)?

catacombs7y ago

This is great. I hope the author posts more Pandas visual guides.

pleasecalllater7y ago

The last time I wanted to use Pandas, it ate 32GB or RAM and then I just killed it, and made all the analysis in Postgres.

j / k navigate · click thread line to collapse

51 comments

47 comments · 10 top-level

Bishonen887y ago· 11 in thread

A very gentle intro, indeed ;)

samfriedman7y ago

As a daily user of pandas for a few years now, I really must suggest that anyone looking to use it for serious data analysis familiarize themselves with the Split/Apply/Combine paradigm [0].

Lots of data munging has been enabled or sped up by judicious application those concepts.

[0] https://pandas.pydata.org/pandas-docs/stable/groupby.html

gjreda7y ago

dorfsmay7y ago

It's good to warn people, but let's not scare them.

istjohn7y ago

In contrast, while I'm not an expert in JS or Python, I find that time spent struggling with those technologies pays dividends since the lessons learned make everything I do in the future easier.

This is highly subjective of course, but in your opinion, should I keep fighting with Pandas? Is it worth it?

laichzeit07y ago

rchaud7y ago

10,000 data points is well within the range of what Excel can handle without needing PowerQuery. What type of analyses are you attempting?

1 more reply

cuchoi7y ago

I would recommend to "keep fighting with Pandas". Many of its feature seem confusing at first, but later on you see the value of them.

geebee7y ago

manojlds7y ago

mharrison7y ago

That's interesting as the Spark api was inspired by Pandas

xiao_haozi7y ago

Can you use groupby for your stratified sampling work?

Mefis7y ago· 10 in thread

How does Pandas compare to R's Tidyverse?

Tidyverse was super easy to pick up, and I can do almost anything I want with. Why would I want to switch to Panda?

Has anyone tired the python tydiverse port? How does it compare to the original?

wenc7y ago

R is great for interactive work, and for data munging jobs that don't interact too much with non-R libraries. However Python is sinply more versatile end-to-end.

peatmoss7y ago

The Tidyverse is more coherent and is generally bigger than what’s just in Pandas (R’s Tidyverse; I haven’t used the Python port).

If you want a mental challenge, or to get in on the ground floor of something that might be the future, learn Julia, or F#, or (my favorite) Racket. Or heck, learn Spark, or a new modeling method.

danso7y ago

jalammarOP7y ago

wenc7y ago

> pandas is heavily influenced and inspired by R.

Is it? How so?

minimaxir7y ago

_Wintermute7y ago

alexcnwy7y ago

Neural net universe is in Python and you can use Python to build production pipelines.

thanatropism7y ago

Pandas is inspired by R's dataframes, which I'm told are native.

minimaxir7y ago

Native doesn't necessarily mean it's the best option. (tidyverse/dplyr leverages Rcpp for data transformation, which makes it a lot faster at common ETL tasks)

jalammarOP7y ago· 8 in thread

Hello HN, author here. If you've ever wanted to get into data analysis, this is my best attempt at getting you past that first hump. A lot of these concepts are easier than you might think.

qwerty4561277y ago

Thank you very much. I would certainly love to read more and more about Pandas (or anything) written this style and go deeper in the subject.

happy-go-lucky7y ago

You may want to check this out:

> Short hands-on challenges to perfect your data manipulation skills

https://www.kaggle.com/learn/pandas

Also this:

> Things in Pandas I Wish I'd Known Earlier

http://nbviewer.jupyter.org/github/rasbt/python_reference/bl...

skadamat7y ago

Hey there, I'm involved with Dataquest and we have a Pandas and NumPy fundamentals course where we dive into more intermediate concepts like vectorization, key data structures, and the key functions.

https://www.dataquest.io/course/pandas-fundamentals

We use a similar approach to the OP. Lots of diagrams and visual aids and you always work with a real dataset.

jalammarOP7y ago

Thanks for the kind words!

lesss3657y ago

This is a perfect out-of-tutor-session reference for my novice data analysis pupils. Will be sharing with them later today. Thank you!

bussiere7y ago

Thanks for the tutorial, it's really what i was looking to give to some friends.

reacharavindh7y ago

Very pleasant to read. I will pass this to my wife who is an accounting professor trying to break ground into using Python/Pandas/Numpy instead of Stata.

I really enjoyed your style of writing and use of visual examples. I wish for such an explainer for SQL. If you made that as a book, you can tun with my money.

pyrenan7y ago

Thanks for this. This is very useful for a beginner.

happy-go-lucky7y ago· 4 in thread

Note to author:

> We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):

> df[1:3]

That will slice beginning from the row with integer location 1 up to 3, exclusive of the last element. So, just two rows, not three as shown.

jalammarOP7y ago

Thanks for the heads up! Indeed it's df.loc[1:3] that would return three rows, not straight-up df[1:3] which indeed returns two rows.

Edit: Corrected in the post. Thanks again!

rapfaria7y ago

Genuine question, did you run the original before publishing it on the website?

yufeng667y ago

But why design it that way? Seems to be a sure way to confuse new user.

1 more reply

happy-go-lucky7y ago

(＾＾)ｂ

NightlyDev7y ago· 3 in thread

What kind of syntax is var['string'] in these examples?

cgriswald7y ago

var['string'] means get the item of object var with key 'string'. Any Python class can define the magic method __getitem__ to define the behavior of/overload the [] operator.

NightlyDev7y ago

Thanks! Didn't know python had magic functions so I was really confused.

xapata7y ago

It's key-lookup syntax, like for a dict (map or hashmap in other languages). A Pandas DataFrame can be thought of as a dict of columns.

Scarbutt7y ago· 1 in thread

Curious, for stuff like this, why not just use sql(sqlite)?

joelschw7y ago

skadamat7y ago

BrandoElFollito7y ago

I wanted to try to analyze application logs (usually a timestamp and some text) but all examples in pandas deal with numbers.

Is this useful for the analysis of such data (with a machine learning mid term goal (clustering and anomaly detection)?

catacombs7y ago

This is great. I hope the author posts more Pandas visual guides.

pleasecalllater7y ago

The last time I wanted to use Pandas, it ate 32GB or RAM and then I just killed it, and made all the analysis in Postgres.

j / k navigate · click thread line to collapse