Given I had ten weeks to cram a lot of material in but did want to show them some amount of programming, this worked pretty nicely.
I've recently started working on some projects where I need to do a lot of data visualization, story telling, and investigation "into the data". As a programmer getting into this stuff is far worse then I expected. Nothing works as I would think would make sense. My biggest problem is that I'm thinking like a programmer not like a mathematician. I expect objects, segregation or elimination of state, application and reduction, re-usability, and algorithms.
Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?
What follows, below this line, is my groveling about the things that have bothered me. Be warned if you don't like rambling and complaining. -------
Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive and poorly documented for anyone who isn't a mathematician (.plot(lons, lats, latlons=True) is correct). Dealing with anything more then 100,000 data points is a pain to revision on. State everywhere it shouldn't be (matplotlib.pyplot).
While I've been working on this project I probably (each spin) spend an hour or two getting the data out of a format that doesn't make sense from a programmers perspective, I spend another 5 to 10 minutes writing an application/reduction, then I spend another hour to go back into the strange data formats that matplotlib will take. All the while re-running expensive computations and waiting because I have no good persistence layer for my project.
There are just things in this community that are common that I'd never dream of. What follows is a list of these things.
1. Functions with 20-40 arguments are the norm for some reason. They also love to throw in a few insane defaults, undocumented options, and even magical flags (not enums).
Things like "draw a line, connect the dots" makes it so you need to know what 5 to 7 arguments of a massive function. In C/Java when I need some flags they probably look like this:
some_operation(some_data, DO_A | DO_C | DO_Z)
Or, if someone was feeling really nice and defined an enum & used varargs, it looks more like this: some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)
Where all of these have appropriate documentation. My IDE place nice and can complete these things. My compiler likes it and can typecheck these things. I like it because I know all of my options available (SomeOperationFeatures.).With matplotlib you have things like `linestyle=""`. You have to go to a webpage, look through the docs, and figure out what you want. It's worth reading the docs [1] if you never have. This could have very easily have been LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played nice. The 3.6 runtime's typechecking would have played nice. You would be able to see what your options are (LineStyle.).
2. Non-standard ways of treating python-isms
Pandas, for some reason, cannot stick to python-isms. I can't do simple things like...
if not df: # Check if DF is empty
return ...
for row in df: # Iterate through the rows of a DF
row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.
subset = [a for a in df if some_condition(a)] # Do simple filtering
Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.3. All these libraries separate logically grouped concepts.
Lets say I have time series data from 10 sensors.
class SomeMagicalSample:
def __init__(self, a, b, c, d. ..., occurred)
self.a = a
...
self.occurred = occurred
With this code I can generate very complex filtering, combinations, and what not. Things like extracting "real" meaning from measured values becomes easy to express. def get_magical_scalar(self): return ... some interpolation ...
def is_some_magical_type(self): return ... some check ...
Now I can use my already tried and true reduction and application. sum(map(SomeMagicalSample.get_magical_scalar,
filter(SomeMagicalSample.is_some_magical_type, samples)))
Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid this style of organization. I'm instead forced to do something like this. a = [...]
b = [...]
c = [...]
d = [...]
....
occurred = [...]
Then I have to jump through hoops to keep all of this data in the same order, shift it around together.4. Because everything is meaningless lists of numbers there are no ways to reuse code.
Most of the code I have written to show off a single value over time, or pull some data out of some other data and visualize it, is never going to be used again. Unless I want to look at this exact same thing this code will not be useful. If there was some way pass objects around, hide the internals, and process them independently of their meaning then this would not be the case.
The one case where this was not true in the past few days was when I rendered a model's prediction into a pcolormesh and drew it onto a basemap. By passing it a basemap it will automatically find the place to generate data for with the model. This was an undocumented feature that I had to read the source of basemap to find was possible (pulling the top left and bottom right Lat Lons from a basemap regardless of projection).
Maybe these warts just hurt for a little while? Do these go away? Are there alternatives that can handle >10 million data points? I don't have a good analysis framework setup for the work I'm doing. Maybe this is the issue. I don't even know what a good analysis framework would look like.
[1] - https://matplotlib.org/api/lines_api.html#matplotlib.lines.L...
You might like [Agate](http://agate.readthedocs.io/) better.
I haven't done a ton of Jupyter in the newsroom yet, but what I've found myself doing is abstracting out the stuff I want to do in normal Python into one or more utility modules and having those return dataframes into my notebook. That way I can mostly write normal Python but have access to some of the nicer pandas features and get to do more exploratory work.
I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.
I recently started a project that I got to write from the ground up by myself. I was happy with the processing side of things. I was very sad with the data I was getting in and putting out. There's some impedance mismatch that doesn't need to exist.
> You might like [Agate](http://agate.readthedocs.io/) better.
I looked at the front page and definitely wasn't enjoying what I was seeing. It, at first, looked like more complexity piled up on top of things that don't need it. Then I saw this link: http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html#l...
This is definitely worth a try. Much closer to what I was thinking.
> I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.
Sadly in my field matplotlib is the professional tool (hah!). The end goal is the matplotlib plots. I'd be all fine for tweaking things in a designing program and putting it up by I'd be upset with myself.
My end goal is to have a single script in a repository that installs, runs, and then compiles my papers. I don't want anyone to need to look at sub-standard copies of my plots. I want anyone to be able to jump in and check my work and create derivative works.
Sadly this is not common in science today so there aren't really good tools for this sort of thing at the composition side. Even worse plotting isn't common in the computer world so tools for that don't exist either.
I use SAS for this in my Day Job it's not a free program but powerful for this type of stuff.
I typically use SQL queries (via SAS's proc sql command) to manipulate and process my data but you can also programatically manipulate your data sets using SAS's "datastep" language.
SAS has support for macro expansions which make some of your examples (like manipulating 10 sensors at once) pretty trivial. But this is getting into programming language territory I would not expect someone new/unfamiliar with programming to grasp all of this intuitively.
edit: Heres some code I have in production that counts how many (of 8) sensors are reading high in a given time frame.
array aads (*) TP_AD1_TOP_STACK_TC1 -- TP_AD1_TOP_STACK_TC8; NO_AD1_TEMPERATURES_HIGH = 0; do j= 1 to dim(aads); if aads(j) gt 160 then NO_AD1_TEMPERATURES_HIGH = NO_AD1_TEMPERATURES_HIGH +1; end;
Downside is that SAS is a commercial package and it is not free I Have heard a lot of good things about "R" which is supposedly quite similar but have not had opportunity to use it myself.
Case in point, your production SAS code could be replaced with this Pandas code (and the R code would look very similar):
temperatures[[TEMPERATURE_COLUMNS]].apply(lambda t: (t > 160).sum(), axis=1)
or if your data is in proper long form data.groupby('time').temperature.gt(160).sum()SAS looks good though. I've looked at it many times and it is a clean solution if you really are in the "big games".
In my former team, we used SAS for a while and once I introduced the team to Pandas, they happily ditched SAS.
This part is a gotcha, but it's also a reflection that allowing if checks for things other than empty leads to subtle bugs. (there are long mailing list posts about it and about the bugs that were uncovered). See here for some explanation about why numpy does it: https://github.com/numpy/numpy/issues/8622
> Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators").
What makes Pandas so great is that you can apply arbitrary functions to rows and columns, with the full expressivity of Python. In some cases it might be clunkier (though you should almost never need `.loc` and other indexing methods) but mostly it's just `df.groupby(...).apply(...)` or vectorized methods like `df.column + df.other_column`. This is a huge improvement over having half of your analysis in database queries and half in a programming language.
> Matplotlib is unintuitive and poorly documented
Try https://seaborn.pydata.org/ for statistical graphics.
> Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.
This sucks but is unavoidable, because Python does not have fast data types with support for missing values built in, so all your columns would have to be of mixed type (the actual type + None) and everything would slow down and simple things like computing the mean of a column with missing values would not work.
Note that you don't actually "need to go back and forth" because Pandas will happily convert plain Python objects to their Numpy equivalents for you.
> 3. All these libraries separate logically grouped concepts.
It's not functional, you're just going to have to deal with that. But split-apply-combine and similar patterns are quite elegant in Pandas: http://pandas.pydata.org/pandas-docs/stable/groupby.html
> 4. Because everything is meaningless lists of numbers there are no ways to reuse code.
A lot of data analysis is throw-away code. Some of it can be abstracted into reusable code, some of it can't.
Lastly, don't forget that Python does have a lot of things going for it when it comes to data analysis, from geospatial tools (http://toblerity.org/shapely/) to Bayesian modeling (http://pymc-devs.github.io/pymc3/index.html), as well as interactive coding with Jupyter and Hydrogen for the Atom editor (https://github.com/nteract/hydrogen).