Pandas is such a vast monster, that even after going through the book from the original author of Pandas (https://www.safaribooksonline.com/library/view/python-for-da...), I was absolutely unprepared for doing real analysis.
Whilst I understood the basics, such as data loading, (simple) cleaning, selections, functions, groupby', indexes etc., I spent most of my time on stackoverflow looking for solution to actual problems I was facing. I reckon that many other users have made the same experience - there is lot's of general info out there when it comes to pandas, but every data is different and the devil lies in the details. Long Story Short: learning pandas is all about trial-and-error and will take months (years even), to be efficient in.
Lots of data munging has been enabled or sped up by judicious application those concepts.
[0] https://pandas.pydata.org/pandas-docs/stable/groupby.html
It's good to warn people, but let's not scare them.
In contrast, while I'm not an expert in JS or Python, I find that time spent struggling with those technologies pays dividends since the lessons learned make everything I do in the future easier.
This is highly subjective of course, but in your opinion, should I keep fighting with Pandas? Is it worth it?
I think this comes up particularly in the context of pandas, because it's a common entry point into a programming language for people who don't think of themselves as programmers and may resist the notion that this is actually what they're doing.
Tidyverse was super easy to pick up, and I can do almost anything I want with. Why would I want to switch to Panda?
Has anyone tired the python tydiverse port? How does it compare to the original?
The Pandas's API is a generalized solution to complicated, variegated use cases and its syntax reflects that (it was also hemmed by strictures of Python). There are several indexing methods, several ways to slice, several ways to do apply's, all of which behave slightly differently. Even expert Pandas users have trouble remembering the syntax for all of these, so they typically have a Pandas API browser window open or a printed cheat sheet pasted on some corkboard. Pandas definitely takes longer to get used to than Tidyverse but the payoff is that you get to use Python, which is a somewhat "deeper" language than R.
R is great for interactive work, and for data munging jobs that don't interact too much with non-R libraries. However Python is sinply more versatile end-to-end.
I used to start my interactive analysis in R and port to Python for production, but these days I start in Python straight away so there's no impedance mismatch. I've personally found that writing production code in Python (and by extension Pandas) to be much more pleasant than in R, even with Tidyverse.
If you already have a good grasp of Python, sure why not learn Pandas too? In my case, I’m reasonably ambidextrous in Python and R but find myself not reaching for Python unless there are colleague / deployment considerations that remove R as an option. The reason? R’s Tidyverse is pretty awesome, and reflects one of the better parts of the R language, namely the meta programming that is a holdover from Scheme’s influence on R.
Now, if you don’t already know Python and don’t have some other reason (such as specific deployment considerations or a team of Python collaborators) to learn? I don’t think so. Python is a fine language, just as R is a fine language. You’re already getting things done in R.
If you want a mental challenge, or to get in on the ground floor of something that might be the future, learn Julia, or F#, or (my favorite) Racket. Or heck, learn Spark, or a new modeling method.
Is it? How so?
Are you going to write more? Can you (or anybody) recommend where (a book, a YouTube channel, a website or whatever) do I continue from the point where you intro ends? As for now all I use of Pandas is a datetime-indexed array of real numbers + simple vector operations on its columns but I feel like I would like to take a learning/career path to becoming a Pandas expert.
> Short hands-on challenges to perfect your data manipulation skills
https://www.kaggle.com/learn/pandas
Also this:
> Things in Pandas I Wish I'd Known Earlier
http://nbviewer.jupyter.org/github/rasbt/python_reference/bl...
https://www.dataquest.io/course/pandas-fundamentals
We use a similar approach to the OP. Lots of diagrams and visual aids and you always work with a real dataset.
Thanks for the kind words!
I really enjoyed your style of writing and use of visual examples. I wish for such an explainer for SQL. If you made that as a book, you can tun with my money.
> We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):
> df[1:3]
That will slice beginning from the row with integer location 1 up to 3, exclusive of the last element. So, just two rows, not three as shown.
Edit: Corrected in the post. Thanks again!
Haven't really used python for anything and I'm just wondering, since it looks like an array or map, but clearly seems to have some logic behind it as it seems to reference the specified column at each row. What is this functionality, something that's built in to python or use of some sort of magic functions?
We've had a pandas course (https://www.dataquest.io/course/pandas-fundamentals) for a while and we just launched some R courses that teach a lot of vectorization (https://www.dataquest.io/path/data-analyst-r).
Is this useful for the analysis of such data (with a machine learning mid term goal (clustering and anomaly detection)?