I use Python (NumPy/SciPy) for most of the data preprocessing, and perhaps that's why. I used to do this in R, and I realized that it's just a lot easier to get done in Python (and it ends up being faster anyway). The problem is that Python/NumPy/SciPy still doesn't lend itself quite as well as R does to certain aspects of the statistician's use case. It's possible that things have changed since the last time I evaluated the two, but I still find it easier to prototype various models in R, even if I do all of the preqrequisite data munging in a different environment.
I understand that R, like Perl, is 'blessed' (pun intended) with two different, incompatible type systems - in fact, this is the reason I avoid using R's type system, and whenever I'm advising newcomers, I always recommend the same. I don't write statistical packages, so this doesn't come up, but when I find myself needing to write a method in R, I ask myself if this would actually be done more easily another way instead. Generally, I find the answer is 'yes, yes it would'.
I really do think the problem is the type system. The kind of type system that lends itself well to data manipulation is not the same type system that lends itself well to model manipulation - when I think about it, I've unconsciously segregated my workflow into two parts, doing everything naturally done with Python's type system in Python, and likewise for R. Maybe that's just the way that I happen to approach data manipulation, but I think it's non-coincidental. R's relative homoiconicity (compared to Python) makes it really nice for some things, but there are other warts with its typing that are just too annoying to work around, when a python shell is just a few keystrokes away.
I guess the answer is (as always!) to use a purely homoiconic Lisp dialect, so you get the best of both worlds but that's asking a lot of statisticians.
I really have come love R for what it does do, though. Of all all the statistical software packages I've seen (comparable: SAS, SPSS, Stata, MATLAB), it's far and away the best (and the GNU license makes it very, very attractive to broke students looking to avoid the still-absurdly-priced student licenses for the alternatives). That said, I still sigh every time I realize that I'm essentially gluing together two separate runtime environments for something that should really be easily integrated. I do what I do now because it ends up being faster than using either Python or R for everything, but it still strikes me as weird that a language so perfect for munging data (Python) can still be so awkward for analyzing it, and vice versa.
An example, from this week: I have a bunch of CSV data files from various trials of an experiment. I want to combine them into one data frame with a new column that includes an id for trial. This took me about a half hour to figure out in R, and five minutes to write in Ruby.
I think the main problem with R is that there's a different way to do everything. It feels like a language that was not so much designed as gradually evolved. In a functional-ish language like Ruby or Python you have a few workhorse data manipulation tools: map, fold, etc. But in R everything is different depending on whether you're dealing with row vectors, column vectors, data frame, or arrays. It makes it hard to generalize over slightly different problems to find common solutions.
Julia looks really awesome, though, and I'm excited to see something that might be able to replace R and bring all of this comfortably into one language.
I don't know how much you know about the history of R, but you're spot on about that.
For what it's worth, Julia is homoiconic and "underneath" the Matlab-like syntactic exterior, quite a lot like scheme.
Licensing isn't just a minor thing - getting Matlab to run on non-Debian Linux is a painful ordeal. I never actually got it working, because I never bothered to debug its cryptic error messages, and since it's distributed as a precompiled binary, I wasn't going to sit around trying to patch it. A corollary is that R is easier to integrate into other toolkits, and there are a ridiculous number of freely available R libraries that make your life easier.
My issues with Matlab may be things that someone familiar with the language would care less about. That said, I find Matlab to be incredibly, incredibly irritating, and I think that's because it's design is tailored towards people with minimal experience with other programming languages (like research scientists), whereas R's design is simply based off of S - so I find it violates the Principle of Least Surprise less. Matlab is not like Lisp or Haskell (where the journey of understanding the language is valuable in itself) - it's really just a means to an end (number crunching), so the POLS is especally important.
R, unlike Matlab, imposes almost no restrictions on the structure of a program. The way I see it, Matlab makes Java's broken one-class-per-file model even worse, by imposing more filesystem-level restrictions on my program.
R, unlike Matlab, uses a type system that's more familiar to someone used to programming with multiple datatypes, as opposed to someone used to thinking in terms of strictly numerical structures. I never got the hang of when I should index with () or {} or [] Matlab ... I'd have to look it up to tell you. R, on the other hand, is more like Python in this regard - even if it's not quite as clean as Python, it makes basic things like importing/maniplating CSVs much easier than Python (or even Excel, which is even designed around that exact purpose).
R, unlike Matlab, returns the last value computed, not the last values with the same local names as the return value names.
R, unlike Matlab, uses a more intuitive (to me) definition of dimensions (and of row- vs. column-vectors). I spent 80% of my time in Matlab figuring out how to get dimensions to match in a robust manner, and I've never had to do that in R.
You get the idea - my frustrations with the language itself are mostly with the fact that it's so unlike most other languages, and it's too much of a hassle to learn. My frustrations with the language environment is that the free alternative (R) is much easier to work with, and much more cross-platform.
Thanks!
I am a heavy Python user, but when I use Numpy/Scipy I don't feel like I'm using Python much anymore so at that point I either switch to R (or Fortran)... though I'm quite optimistic that at some point the pandas DataFrame can become my default storage structure from which I can parse out R tasks through Rpy, SQLite, HDF5, or possibly Reddis.
matplotlib is very verbose though; I almost prefer Matlab's graphics model... though less so than R's basic and lattice graphics.
As much as R may be capable of, I just can't get past how inconsistent and complicated its basic types are.
vector: this one is clear based on the name; it's a homogeneous sequence (with very aggressive type conversion). A sequence of strings, a sequence of numerics, etc. One thing worth knowing is that there are no atomic types, so c(1) == 1. That is, the value 1 is identical to the singleton vector containing 1. Also the empty vector c() is identical to NULL! is.null(c()) == TRUE. Weird.
list: the name is confusing, but I think of it basically like a dict in Python. And the syntax is the same: list(a=1, b=2) vs dict(a=1, b=2). I think you can use it like a sequence as you are saying, but I never use them that way. Lists are for ad hoc composite types -- if I want to return 2 values from a function, I return a list() of them. I think you can convert lists to environments easily, or they are the same -- also similar to Python's dicts.
data frame: This is the core type AFAICT, it is basically a collection of named column vectors of the same length. e.g. data.frame(name=c("a", "b", "c"), value=c(1,2,3)). This seems pretty intuitive. A row has different types (like a DB relation) but the columns have the same type since a column is vector.
matrix: I don't use these too much, but it basically seems like a homogeneous type like vector, except you specify the dimensions.
array: I don't use this, but the R documentation says "A 2-dimensional array is the same thing as a matrix". So I think I am confused and what I typed above is an "array", and matrix is the special 2D case. Yes the names are bad. I think of a matrix as having arbitrary number of dimensions (e.g. in matlab).
I think where it gets confusing is that there are all these arbitary conversions. And you can use things more than the prescribed ways, so you might stumble across code that uses them wrong. But after a fair amount of R programming, there is my mental model, whether right or wrong :)
I think a lot of the mess comes from the fact that dealing with real data is just messy. R takes the mess and makes the common case convenient, and people like that. But it's like Perl in that it's a "Do what I mean" language and tries to guess a lot, rather than "Do what I say" like Python. And when it's guessing your intent wrong it can leave you very frustrated, as with Perl.
Two things:
1) A data.frame is in fact a list of vectors of the same length "compacted" together.
2) I find the types very "sensible" for a person doing statistics. But I guess (almost) everything makes sense once you get used to it...