The general strategy is to build the core of any dataset as a SQL query that handles joins and performance-sensitive parts of the query, then polish/plot/yeet into weird shapes with Pandas since it offers much greater expressivity.
.astype({
'central_air': bool,
'ms_subclass': 'uint8',
...
})
Now if, say, ms_subclass and overall_qual need different types, that's an easy diff to read. Ah, but I suppose that wouldn't be as Twitter-friendly.It’s just a bad programming habit thing.
The responses seem out of context, too:
>David: What's the elevator pitch for writing pandas code the way that you do?
>Matt: One common thing that you'll see in the data science world is this notion that there's like Untitled1.ipynb and Untitled2.ipynb[...]. My goal is to help with that so (...) you have Analysis_for_ClientA.ipynb and that's the only notebook you have. And you can come back to it tomorrow and pick it up where you left off and you're going to be productive. Your code will be easier to read[...].
This is a tweet. Filenames aren't even argued. This doesn't answer the interviewer's question either. Writing code != naming files.
>David: What is it that separates beginner pandas code from professional pandas code?
>Matt: I would say that if you want to write good pandas code (...) you should know how to write lambdas. You should know how to do list and dictionary comprehensions. Dictionary unpacking (...) is super useful in pandas world.
Absolutely. But professionals use variables, too. Possibly even more so.
Oh boy, do i care about every single intermediate step though!
Especially in pandas, where we play "where's the NaN" all the damn time.
I think is one area where pandas and Polaris can be improved. How do you write long chains and slot in breaks and testing?
I prefer assigning lists like that to informatively name variable rather than have leave them the subject of speculation. It’s easier yo add add clarifying comments that way too.
In sql or pandas, long lists of values not broken up are hard to read. It’s easy to scan down a single value on each row, not random length values spread randomly across the screen.
Also That is chaining far too much in a single go
Edit: Here is a library that brings pipes to pandas https://github.com/pwwang/datar
Another good example of this style is tf.data pipelines. Also a very nice API.
He doesn't articulate any of the virtues of it either, aside from some hand waving about 'memory' that doesn't get fleshed out.