R, the master troll of statistical languages (2012) (opens in new tab)

(talyarkoni.org)

184 pointsGoldenromeo10y ago148 comments

148 comments

99 comments · 29 top-level

mziel10y ago· 22 in thread

The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.

For example complaining that R is slow and then writing iterative solution instead of using vectorization. When I saw the example the author gave my first thought was "sapply/lapply". Lapply is essential to the R use, and is being taught early on in every book/course on R I've ever saw.

"In 2012, I’m the kind of person who uses apply() a dozen times a day, and is vaguely aware that R has a million related built-in functions like sapply(), tapply(), lapply(), and vapply(), yet still has absolutely no idea what all of those actually do. "

todd810y ago

It's been a few years since I really looked at R, but I don't think the problems with R are simply that people don't learn the language. Some languages are simply not as good as others. We can all learn more about the tools we use when programming, I know that I certainly could. But this doesn't make it our fault that a language is tricky or hard to debug or hard to understand. If we worked at it, I suppose we could all write more efficient programs by using assembler, but that doesn't mean that assembler is the best possible programming language for, say, statistical programming.

Someone, Ross Ihaka, that knows a thing or two about R wrote a short post 6 years ago and said "simply start over and build something better". Take a look:

http://www.r-bloggers.com/“simply-start-over-and-build-somet...

My hope is that Julia will eventually be adopted as a basis for a future statistical programming language.

sdenton410y ago

From your link, hilariously relevant to the blog post at hand:

"First, scalar computations in R are very slow. This in part because the R interpreter is very slow, but also because there are a no scalar types. By introducing scalars and using compilation it looks like its possible to get a speedup by a factor of several hundred for scalar computations. This is important because it means that many ghastly uses of array operations and the apply functions could be replaced by simple loops. The cost of these improvements is that scope declarations become mandatory and (optional) type declarations are necessary to help the compiler."

todd810y ago

I should point out that Ross Ihaka along with Robert Gentleman created R.

TheLogothete10y ago

The push for R is enormous. It's not getting replaced.

jlebar10y ago

> The problem is people using R without trying to learn about the language itself

It's not the user's fault.

Like, congratulations on being better at R than the author of TFA. Maybe you're smarter than him, maybe you've put in more time learning, maybe you've just spent your time more intelligently, maybe you lucked out and bought better books...who knows.

But this line of reasoning completely misses the author's point, which is that despite having used the language for years, he still finds it inscrutable. "It would be easier if you were better at R" is a tautology, and unhelpful. The issue is that the author finds it hard to become better at R.

We can disagree as to whether or not it's objectively hard to become better at R, but this is a perfectly valid criticism to make. It's not the user's fault.

baldfat10y ago

Its a 4 year old article and R has changed a TON with new things BUT.... R has grown a ton in users as well as features.

R really is a functional programming language that people don't take advantage on. All languages have strength and weaknesses and YET the complaint is R has too many ways to do any one thing which allows us to have data.table, dplyr, ggplot2, magrittr (pipping %>%). [EDIT RStudio and RServer are also a big example of R growth in features and quality]

As I learned R my code has changed dramatically and I think R has one of the largest gap between the code you start with and when you are proficient. My starting R code is really embarrassing.

jcizzle10y ago

I think the person you are responding to is simply saying some things are more complex than others and require more understanding and experience. R apparently falls into that category. If the author wants to gain that experience, I think the time spent on this blog post may be better used to reading a book on R. Fault might be a strong word, but the author has certainly made a decision on what they spend their time on and the results are as expected.

1 more reply

x0x010y ago

It's kind of amazing to see someone admit to spending hundreds or thousands of hours using R, yet refuse to spend a couple hours learning the language a little better. Whining that your tools are hard without investing any effort in them is just dumb.

The R help even comes with code samples that you can run!

Fomite10y ago

R is, I think, an interesting language because it's heavily used by people who would not otherwise learn a programming language. If you compare R not with other programming languages, but with other ways of working with statistical data, this makes far more sense. I don't actually "know" SAS in the way I know a programming language - I know the commands I invoke to do what I want it to do.

Similarly, I encounter lots of people using R who don't actually know what a function is, just that lm(x~y) gets them what they want.

3 more replies

copperx10y ago

Surprisingly, this is true for many programmers of general-purpose languages.

kbenson10y ago

> The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.

I think you just explained Perl in a nutshell (err, the idiom, not the book). It seems whenever a language supports enough idioms of the usual C-like languages people will gravitate towards those, likely due to the high population of people that know those idioms and can fall-back on them without having to think too hard. I doubt Lisp has as much of a problem of people trying to write C in Lisp.

crispyambulance10y ago

I sympathize with the OP and also feel frustrated with R (and I say that as a regular R "practitioner").

Part of the problem, I think, is the built-in documentation. The typical R user is a domain-expert just trying to get some work done. Occasionally, they'll get stuck and try something like "?sapply". What appears is usually a terse, confusing mess that takes a VERY LONG TIME to digest and is the LAST THING you want to read when you're trying to make a living solving a problem other than understanding R documentation.

Below is the "Description" for Apply (which you get when you try ?sapply). Does it _really_ explain the essentials of what you need to use "apply"?

"... lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).

vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use. ..."

mziel10y ago

YMMV, but I think this is pretty clear:

"lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X" == lapply is a map() construct that takes a list and a function

"sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array(). sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f)." == "sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f)"

1 more reply

chrisseaton10y ago

> For example complaining that R is slow and then writing iterative solution instead of using vectorisation.

But if someone prefers iterative solutions, or that's all they know, why can't R make them just as fast as the vectorised versions?

YeGoblynQueenne10y ago

>> But if someone prefers iterative solutions, or that's all they know, why can't R make them just as fast as the vectorised versions?

R is interpreted and dynamically typed, so when you declare a variable, the interpreter has to do some bookkeeping to figure out the type of the variable, allocate memory for it and so on.

If you write a loop by hand, the interpreter has to do this bookkeeping once for each iteration.

If you write your code in vectorised form, the interpreter can sort out the bookkeeping once and then hand over to the lower-level code (C or Fortran) the vectorised functions are interpreted in.

This can also be further optimised to take advantage of processor vector instructions, parallel processing etc.

So I'm afraid we can't have our pie and eat it. If we want an interpreted language with somewhat intuitive notation, then it has to have crappy slow loops. If we want a language with fast loops we have to rely on C or Fortran and forget about vectorised notation.

3 more replies

bachmeier10y ago

Keep in mind just how old the language is, as it started as S in 1976. It was intended to be a glue language for Fortran and C.

Keep in mind also that it's easy to rewrite the bottlenecks (which are only a small part of most programs) in C, C++, Fortran and other languages including D. That may not be what any particular person is looking for, but that's traditionally the way things have been done.

Thriptic10y ago

Yea I feel like sapply, lapply, and mapply will cover most of what people need to do. Hell I've personally only really ever needed sapply as I don't work much with lists.

TheLogothete10y ago

In my experience a lot of the claims that R is slow are greatly exaggerated and made by people who don't actually use it. Kind of an echo chamber. Every time I see someone say chose Python instead because of speed, I roll my eyes.

Fomite10y ago

My usual take on this: "Between R and Python, the faster language is likely whichever library author actually wrote most of their code in C or FORTRAN."

1 more reply

existencebox10y ago

It's funny you bring up python; I say this not as a comment on your thesis, but related, since I often hear the "python is slow" trope but that's only half true, you can typically write python that is plenty "fast enough" (As a day-jobbing data pipeline engineer) if you're implementing with an understanding of what things will drive you into the mud. This goes beyond just understanding the tool you're using, fundamentally writing an O(N^N) or something is going to hurt even if you're in C#/C. Have seen that plenty, frankly, more than I've seen "legitimately slow python"

Anyway, this was just a thought rolling around in my head given the discussion.

1 more reply

50CNT10y ago

I for one write all my statistical code in baremetal assembly. I manage about 5 a year, but they all run very quickly. There is no such thing as premature optimization.

bachmeier10y ago

> The problem is people using R without trying to learn about the language itself

I see a lot of blub when I read posts about R. So much so that I start with the assumption that any post about R is a blub post.

louden10y ago· 10 in thread

R is a language with a lot of gotcha's. I usually get burned by characters being converted to factors in read.csv() and converting factors to numeric (it works, but not how you intend). The R Inferno (http://www.burns-stat.com/documents/books/the-r-inferno/) has a lot of other gotcha's and is worth a read for people who use the language.

That said, the power, flexibility and user community make it my go-to for any first crack at an analysis of data.

VLM10y ago

What makes R infuriating is most of the complexity and gotchas aren't inherent to the problem.

So you learn Clojure and you inevitably meet the collections. And it takes like 5 minutes to tell you how to map and how to reduce and then the lecture ends with "and it just works". And in fact it does just work.

Then you learn functional R and the first five minutes are the same as the Clojure experience. Then the slow motion train wreck starts and "And R likes mapping so much, we have nine microscopically different apply statements for list and tables and they input some things and output other things and if you pick the wrong one the failure looks like the Trinity nuclear test but more impressive". Every R language lecture is like that, five minutes of how real languages do it, then the rest of the 45 minutes is endless pitfalls and accidents. Its like a 45 minute long fever dream or nightmare "... and if you accidentally tapply, table apply, to a list, then it coerces the input to ..." and drift back to Cthulhu, or maybe away from, whatever.

Pragmatically if you teach R as a statistical analysis language what looks weird often enough turns out to be super convenient. But if you try to teach and learn R as a general purpose computational language, you wonder if its a joke and nobody would actually use Intercal or BF to run analysis, would they?

Its a very powerful system in spite of the language. Think of PC hardware architecture going back to the old XT days, its sinfully ugly, but its quite capable. R is no PDP-11 or VAX, thats for sure.

mziel10y ago

Factor is one of the worst thing of R-world. I don't recall ever needing factors, yet they creep in with many functions (read.csv, cut).

Btw there's a nice readr package (from Hadleyverse) that has a read_csv method that does away with factors by default.

VLM10y ago

You should use factor for data cleaning and verification.

So you have "sex" on the questionnaire, and factor will very quickly identify contamination such as "often", "not yet", various mis-spellings, etc.

stewbrew10y ago

How would you represent categorical data then? R's primary use case isn't text processing. And HW isn't always right.

1 more reply

th0br010y ago

You will (inevitably?) run into factors when importing data from SPSS files... sure, you can discard them upon reading... but are you sure you don't want access to the value labels in the future?

lottin10y ago

Factors are weird because no other language has anything like it, but they are actually a quite clever way to group data. It just takes a while to get used to them.

Fomite10y ago

I actually use factors a fair amount, and having factor-like data shoved into numeric values gets you to some bad places statistically.

grayclhn10y ago

You must not do a lot of regression with categorical data, then. I use commands like `lm(y ~ (x1 + x2) * factor_variable, data = d)` and `xyplot(y ~ x1 | factor_1, groups = factor_2, data = d)` all the time.

1 more reply

gbrown10y ago

Factors are great, and surprisingly powerful even outside of statistical computing. With that being said, I prefer to create them on purpose rather than having read.csv attempting to be helpful.

stewbrew10y ago

A factor already is a vector of numeric values, which happen to have names.

wanderfowl10y ago· 6 in thread

I'm a Post-Doc in a small social sciences department in a major university, and am probably the department's ranking R-geek. I did my dissertation, and much of my current work, doing modeling, analysis, and even machine learning in R.

In many ways, I owe much of my success to the power that R has allowed me to wield. Multicore lapplys and ggplot2 are my life these days. But even with this, R drives me absolutely batty, and the documentation, even battier.

I may be competent relative to most, but R feels so taped-together and idiosyncratic that even on my best days, I just feel like a newbie who's built up an army of ugly hacks.

Someday, I'll learn more about the python stats tools and do my stats there. But for now, R it is. Troll on, you crazy bastard.

mojoe10y ago

My data science group is currently transitioning to Python-based development from R. It's actually amazing how much faster Python development is, because of two things:

1. R libraries (and even rarely the R interpreter itself!) tend to have really weird corner case bugs that crop up every couple months, and

2. It's REALLY easy to write unmaintainable code in R, and so strange cruft creeps into the code over time.

The Python interpreter and Python statistical libs are rock solid in comparison, and with it we don't spend weeks debugging things caused by unnecessary idiosyncrasies. I just wish we'd started switching sooner and saving our time.

nothis10y ago

I just wonder if there's rationality behind not changing this (by either switching to a different language, fixing it or, hardcore mode, create a new one).

I know from a very, very different field that you often have to deal with decade(s) old technology because your employee/professor/etc is just used to it and, 15 years ago, it simply was the best option. I guess it's an equation that puts time spent learning its quirks vs time saved using a more sensible tool. While tiresome, "learning" might actually be the faster way to get things done. But it also carries so much ballast that, if there's a better alternative, must waste millions of frust-hours (and actual errors), especially for newcomers who could just as well learn a new tool - faster.

wanderfowl10y ago

I started in SPSS. This is so much better. And although it drives me nuts, I still spend the other half of the time working in R feeling like a goddamned wizard. So even if it takes an hour and a half some days to figure out how to turn a list of unique factors into a list for lapplying (or something else stupid), it's not bad enough.

Also, remember that I'm the R geek around. Whether it's the best tool for the job or not, in my field, R is the lingua franca for stats. I could swear off R and move to Python Stats, but I'd still be supporting R among colleagues and friends. It's hard enough to convince folks who grew up in SAS to move to R, let alone to learn Python.

Finally, I'll have a hard time convincing editors that some weird-ass python implementation of GAMs or LMER is kosher when they're barely OK with the idea of GAMs. Reviewer two is, shall we say, technologically conservative.

avs73310y ago

I am in an almost identical situation, just still in the PhD stage.

I moved from Matlab (originally a mechanical engineer) and the biggest shock was documentation and just the internal help stuff in general. The help files on a regular basis require you to understand how something works to understand the thing explaining how it works.

I am often just shocked at little quirks I find trying to do things in R, not that it is worse than SATA of SAS but the goals was to be better. I am all for FOSS, and R provides many extensive capabilities not available in Matlab, but in terms of being user friendly Matlab is so superior it is honestly sad. Oh and '<-' just drives me nuts...I will never understand the choice of two characters where one is entirely sufficient.

S4M10y ago

> Oh and '<-' just drives me nuts...I will never understand the choice of two characters where one is entirely sufficient.

It's true that '<-' is a strange choice, but you can use '=' for variable assignments as well.

1 more reply

jghn10y ago

Iirc it comes from the old APL keyboard

2 more replies

nathell10y ago· 5 in thread

I have a love-hate relationship with R, being a predominantly Clojure (and Ruby these days) programmer who only occassionally dabbles in data crunching.

The apply/sapply dichotomy that the article mentions (actually a hexachotomy, there are also mapply, sapply, tapply and vapply) is one example of a gazillion warts that the language has.

Another random one: R has a useful function, paste, that concatenates strings together. Only it takes varargs, not a character vector, so if you have a vector v of strings, you have to use do.call(paste, v). Only not, because do.call insists that its second argument be a list, not a vector, so you do do.call(paste, as.list(v)). And if you want to separate the strings, say, by commas, you have to affix the named argument sep, obtaining do.call(paste, c(as.list(v), sep=",")).

And R's three mutually incompatible object systems. And so on and so on and so on.

There are things to love. The packaging system works really well. I like the focus that R puts on documentation: hardly anywhere is it so comprehensive, with vignettes and all. There are things plainly inspired by Lisp (R is just about the only non-Lisp I know that has a condition and restart system akin to CL). And ggplot2 is one hell of a gem of API.

In many ways, R is the PHP of data science. (Though the core language's still nowhere near as abysmal as PHP.) Despite all the warts, there are all sorts of statistical analyses that are just a package.install() away. Put another way, R is to data science what LaTeX is to typesetting. It's a heavy pile of ducttape, but it's here to stay because it's just so damn useful.

jonchang10y ago

See, this is another interesting example of the kind of behavior described here: https://news.ycombinator.com/item?id=11113042

People who don't take the time to learn the language are having to go through these contortions to make R work the way their favorite language works, rather than just taking the time to learn how R works!

    R has a useful function, paste, that concatenates strings together.
    Only it takes varargs, not a character vector, so if you have a
    vector v of strings, you have to use do.call(paste, v).

But the help for the `paste` function literally goes over this exact situation:

    > v <- 1:5
    > paste(v, collapse = ",")
    [1] "1,2,3,4,5"

I'm often super baffled by the lengths people will go to not figure out how to use R and insist on writing <X> language in R.

nathell10y ago

Thanks for pointing this out. I overlooked it, presumably because it's in the last paragraph of "Details" and not illustrated in any example.

I still maintain there's a wart in what I'd described, which is `do.call` not accepting vectors as the second argument. Also, `collapse` is idiosyncratic: I have to remember a special knob for every function that has a vararg and non-vararg flavour.

You raise the point of taking the time to learn the language, and I acknowledge this. Yet, as an occassional user, this is precisely what I'd like to avoid. When working with R, I'm pragmatic: what I'm after is a working solution to the problem at hand, rather than its most succinct or elegant formulation. When I find one, I move on. In production code this would incur a technical debt, but due to R's exploratory nature, this is typically not much of a problem. Had the language been more consistent, it would take less time to learn it thoroughly.

1 more reply

sooheon10y ago

I wonder what disagreements or counterarguments downvoters might have, and if they might share them.

For the parent, given what you've observed, do you still go to R for data crunching, or have you found anything in Clojure land that measures up?

nathell10y ago

R. Or a mixed approach, with Clojure for data preprocessing and R for the analysis proper. Case in point: I wrote the Clojure scraping library, Skyscraper [1], and made it output CSV by default so as to be able to easily drop the resulting files to R.

For statistics, Clojure has Incanter, but it's very basic in comparison. There are easily usable Java libraries for certain tasks (MALLET comes to mind), but these are few and far between.

[1]: https://github.com/nathell/skyscraper/

tanvach10y ago

You can also use paste( v, collapse =",")

tmalsburg210y ago· 4 in thread

Writing a variant of this article has become a rite of passage for all serious users of R. There are two issues that contribute to the difficulties people experience with R. First, yes, R can be confusing at times. Tal explains this really well, but only scratches the surface. There is so much more confusing and counter-intuitive stuff, for example with regards to factors that only very few people seem to understand fully. However, there is a second issue, and this is less often acknowledged: People expect R to be immensely powerful and at the same time easy to use, which is really not a very reasonable thing to expect. This attitude is fairly specific to the R community. No C++ developer would dare to write a long rant about the shortcomings of C++ while at the same time nonchalantly admitting that they never made a serious attempt to learn it. One symptom of this problem is that hardly any self-proclaimed R hacker has read Matloff's book "The Art of R Programming" which was for a long time, and perhaps still is, the only book on R programming. The mere fact that there is (or was) only one such book speaks volumes.

hudibras10y ago

I agree with everything here, including the praise for Matloff's book; it should be the very first book any serious R user picks up.

But Matloff is no longer alone: Hadley Wickham's Advanced R is now also a must-read for R programmers.

http://www.amazon.com/Advanced-Chapman-Hall-CRC-Series/dp/14...

hadley10y ago

And Advanced R is also available for free at http://adv-r.had.co.nz/

kgwgk10y ago

There are other books on R programming: John Chambers published "Programming with Data: A Guide to the S Language" in 1998 and "Software for Data Analysis: Programming with R" in 2008.

tmalsburg210y ago

No better way to elicit information from people than making outrageous claims. Thanks for those references!

th0ma510y ago· 4 in thread

My own personal rant, I think the specific feeling I get is the conceptual idea of R has long since outpaced the reality of R.

People like to fetishize data, and R sure lets you do that. The data science landscape however is growing such that R is really just a one-trick pony, however, that one trick is for better or worse being the gold standard of statistics and modeling, somehow.

But everything else wants to sugar coat the software surrounding the statistics, and leaves you no room to grow.

This is a very bad over-simplified example, but you sort of can't learn much about graphic design or good communication skills by using ggplot2 ... you can make something look very very nice, hopefully, in the general case, sure. And you can definitely do all kinds of hacks and crazy code to make it do whatever you want, but by doing that you produce ever more fragile and environment dependent code. You'd be better off learning just about anything else for graphics (Straight SVG, D3, Processing, Cairo directly, etc) because it is of course a bit more of a problem starting up, but a generalized skill set that could allow you to grow.

You also learn pretty much nothing about web development from Shiny. Shiny is a wonderful idea, but ultimately prevents a statistician from implementing what it promises, which is an analytic application. At some point, you have to ditch it and learn more traditional web stacks. It is also something of a sales funnel into a server solution that's a DDOS or security nightmare just waiting to happen.

So instead of just griping, I guess I have some ideas... it would be nice to have a Ruby/JS/Java/Python service generator. It would be nice to have a D3/React/whatver based generator. It would be nice for there to be a data munging solution (or even whole models, more like more PMML type stuff) that can be generalized into something that could be compiled or generates Python/Java/Bash/JS/Whatever code.

Ultimately you start thinking along those lines, and you realize that the promises R is making about empowering the analyst are just teasing them rather than helping.

R could do with less magic and more concentration on being simply a great statistics engine that integrates better. I guess it is that to some degree, but it sure fails the rest of the technology world that tries to live with it.

JasonCEC10y ago

I disagree on multiple fronts!

1) ggplot does exactly what it is supposed to do: create data visualizations. It made no promises for interactivity or display, and in fact, it was originally designed for creating publication quality charts, which it continues to do well.

1.5) ggvis is a D3 API wrapper on ggplot and allows for interactive graphics. Do you want to pay your data scientists for creating production ready graphics or let them focus on what they're best at?

2) R has been growing - outside of neural networks (which R needs to catch up on), R gets almost every pre-processing and modeling algorithm first, and distributes it for free. Furthermore, it has better sampling options, metric options, augmentation options, and model ensabling tools (stack or meta-model) VS any other language or framework - it is the gold standard.

3) I don't think there's any "magic" in R. It's just a language with a learning curve and lack of opinions.

4) Last point: R is really not built for the web (it's older than Python!) - its built for data science. There's no reason you need to run your modeling stack in the same language as your application server. R is perfectly capable of writing to databases or sending API responces in JSON or PMML.

/endrant

Not trying to start a flame-war - but this type of difference in opinion is important to see when thinking about hiring data scientists or deploying models.

th0ma510y ago

I sincerely appreciate your thorough and well reasoned response, and thank you for taking the time with it.

1) I agree, perhaps I was trying to allude to visualizations being more than charts. ggplot's charts are absolutely gorgeous and simpler to make than even I remember them being in Lotus 123 for DOS. However, there are some things in the periodic table of visualizations (http://www.visual-literacy.org/periodic_table/periodic_table...) that it can't do. And what about hybrid combinations in the same chart? Could I have a bar chart where the bars are also mini-spectrograms? I can instantly think of how to do this in SVG or Processing, but I'm not sure where to begin with ggplot ... maybe it is possible. Of course why would you?

1.5) I guess I don't want to pay someone else to do the custom thing in D3 that the statistician can almost do with their code, or try to get a regular web stack developer familiar with D3 to actually get it working the way the statistician says.

2) Yup, totally agree!

3) I think Shiny takes a lot of liberties and makes a lot of assumptions that users of it can't even express to me are important to them because using Shiny completely hides the underlying concepts of how it is implemented. I guess I would definitely call that magic. You're right there's quite a lack of magic in most of the language and packages, however.

4) I think data science can/should/does embrace the web. I think the modeling stack shouldn't be on the application server, but a trained model perhaps should be? I also wouldn't trust the stability of R for performance critical API calls without a lot of redundant instances and a lot of load balancing.

Anyway... the real problem is that you're also absolutely right. There is quite a bit difference in opinion between the tooling of an analysis effort, and the robustness expected by IT.

Thanks again!

mattkrause10y ago

I'd divide my work into three categories - Exploratory plots just for me. - Plots that I'm going to show to my boss or coworkers. - Things we're going to distribute to the whole world.

Plots in category #1 are often quick and dirty--I just want to see if an idea worked and don't really care about communicating that idea cleanly.

I could show these plots internally, but it often helps to clean them up a bit first. This avoids us getting bogged down in whether we should be comparing the red/blue lines here or the circle/diamond points there.

This is where ggplot shines--I can go from #1 to #2 with minimal effort. The final version usually still needs some tweaking, but only a small fraction of the plots ever get this far and some of this customization really needs a human in the loop (e.g., in Illustrator or something).

Similarly, while you can use Shiny as final product, it's actually great for letting moderate-sized groups play "what-if" with the data. It's certainly easier than sending them a huge powerpoint deck with "choose your own adventure" style instructions.

TheLogothete10y ago

Except that you miss a very important point - nobody cares about the stuff you listed. I build models and use shiny to create a front end for clients to interact with them. They are very happy and pay me very handsomely. I can assure you that this is the case across the board. R is for analysts, not for programmers. It seems like programmers feel intimidated, because analysts now code their solutions themselves.

superuser210y ago· 3 in thread

Most of my university classmates' first exposure to programming is using R in a statistics class. It's awful. I wish they'd make Python or something a prerequisite, so that giant swaths of people don't get turned off of computing or start with the strange ideas it teaches.

bachmeier10y ago

It sounds like you are arguing for imperative programming over functional programming.

superuser210y ago

R is a fine tool, but (like Java or C) not the best window for a beginner into the joy that programming can be. The syntax is pretty weird and its semantics don't align that neatly with broadly-useful ideas for reasoning about programs.

We run 3 different intro sequences in Python (for non-majors), Scheme (for most majors), and Haskell (for those who are already strong imperative programmers). They're all great.

GFK_of_xmaspast10y ago

What does that have to do with r vs python?

elcapitan10y ago· 3 in thread

Not intending to start a language war here, but if somebody who has experience with both R and Python/pandas/etc could answer - how's the current state of the emerging Python data/statistics ecosystem compared to R? (not counting all the other differences like R being allegedly weird or Python more general purpose and so on).

Fomite10y ago

Also assuming you don't want to count things like "You can call R from Python, and Python from R", here's my take on it, as someone who uses both languages:

- Pandas has helped Python tremendously, but I don't think it's quite to where the R data frame is.

- For 90% of what someone who wants to do statistics wants to do, it honestly doesn't matter at all. You can do nice data visualization in both. You can fit most generalized linear models in both.

- At the cutting edge, R still takes the cake. Odds are if someone has developed a new method (especially outside machine learning), it's in R before it's in Python. Your local university's statistics department is likely running R (or SAS), not Python.

TheLogothete10y ago

We did an evaluation recently. Not even close.

elcapitan10y ago

Is that evaluation somehow public, or could you share some details?

capnrefsmmat10y ago· 2 in thread

My biggest complaint about R isn't the inconsistency and obtuseness -- I've been using it long enough to get familiar with the documentation and the zillions of varieties of apply. My problem is the data structures.

R has only a few core data structures: vectors, lists, arrays, and matrices. Data frames are built on top of lists, and admittedly data frames are incredibly useful for statistics -- there's a reason pandas exists, and a reason data analysis is much more tedious in other languages.

But there are no hash maps or sets (lists have named elements, but with O(n) indexing; the only hash tables available use environments and accept limited types of keys), no tuples, no structural or record types, stacks and queues only recently became available on CRAN (through C), and so on.

This leads to the folk belief that the only way to optimize R is to vectorize code or to write in in C or C++ (with Rcpp, for instance). No statistical programmer ever thinks about choosing the right data structure for the job, since you basically only ever use lists and data frames. Fast operations on data structures (like graph algorithms) have to be written in C. There's just no way to do it in R.

When I co-taught a statistical computing course, covering the basics of data structures and algorithms, I included some homework assignments where the difference between a fast and a slow algorithm was the choice of data structure. R users struggled because they had very little available to them. If their code wasn't fast because they were doing O(n) list lookups in a loop, there wasn't anything they could do to fix it.

I hope Python and Julia can eat R's lunch. Some day I'll have to get around to trying Julia for a serious project...

hadley10y ago

The lack of data structures in R is a totally fixable problem.

capnrefsmmat10y ago

Sure, through packages, but you'd need to adapt the entire standard library to take advantage of them, so you could pass new data structures to built-in functions and get meaningful results.

Generic iterators would also be extremely useful to build in, so it's easy to work with a wide variety of structures.

1 more reply

evandev10y ago· 2 in thread

My thoughts (a little of a Rant) on R as the Lead Engineer at a data science focused company is that R is a great statistical language, but a poor programming language. I use the term programming language as a language which is very versatile for a variety of needs (web app, commandline app) such as python, ruby, etc. R has the capabilities to use as a programming language like a climbing rope can be used as a belt. It can, but shouldn't because of some points I have below.

It is great for exploratory analysis, as it is forgiving and easy to use in the console for testing things; but once it needs to be put into practice, it has issues. For a non-programmer, grasping R isn't too hard thanks to some great developers in the community.

There is a lot of good in the R community, but people are focused on making it isn't. Just look at deploying R into production, that can be a nightmare. I've spent days looking over code to figure out where an error in production lies. One of the errors was a package of a package which was updated for the first time in years. That package depended on another package which my package called another function that called the first one; basically it was a mess of dependencies. And there are some misconceptions, while doing the engineering work in R and learning I learned not to use for loops. Then one day I timed it and the for loop was 10x+ faster than any apply/plyr function including using a gpu.

The things that separate a programming language from a statistical language are a programming language have more than one of these:

* Good dependency management

* Easy deployment into production environment

* A clear way to setup environment (e.g. naming, folder conventions)

* Ability to do most of the things you want with the base packages

* Good documentation about the above.

Basically, I believe a good data scientist is someone who can use R (or something else) to explore data and then create the algorithm in a compiled language to be put in production. And for someone who just needs to create analysis for research or a paper, R is the perfect use case. R is an excellent language for its use cases, just don't think about using it for general programming. It has caused a lot of extra dev hours working on issues with it.

Little plug, we wrote a piece on hiring data scientists.[0]

[0]: https://gastrograph.com/blogs/gastronexus/interviewing-data-...

JasonCEC10y ago

To contrast this - I'm the lead data scientist at the same company, and head over heels in love with R....

It is the only language I can quickly and efficiently jump from algebraic topology for novel pre-processing, straight into model building and validation - with just about every potential variation of every major algorithm freely available and packaged on a well curated package manager (CRAN), and then ensemble them.

I _agree_ that it's a bit difficult to use in production, and that dependency management needs work (Packrat is trying to do that), and that blindly trusting packages on CRAN can cause errors - but 98% of the time - it just works. Graphics, models, crazy niche things that are currently only used by one post-doc locked away in a top secret research lab... it all just works.

Of course, take this with a grain of salt: this is coming from a guy who's built web-servers (HTTP responses and all) in R.

makeset10y ago

R's server sockets can't select(), so unless you've reimplemented that as well, don't count on handling concurrent requests with that web server.

numlocked10y ago· 2 in thread

At my previous job we used to play "Guess what R does" over lunch. Someone would write a few R statements and we'd have to guess the output. Extremely difficult!

    >> a=c(1,2,3,4)
    >> b=c(1,2)
    >> a+b

Any guesses?

epistasis10y ago

+ is a vector operation, and does it elementwise. R recycles vectors. You get warnings if there are elements left over. This is something you should know within the first 5 minutes of learning the language, hopefully?

jmcphers10y ago

2, 4, 4, 6. The second vector is recycled along the first one:

1 + 1, 2 + 2, 3 + 1, 4 + 2.

makeset10y ago· 2 in thread

Yes, I'm well aware of R's many faults, I have my own long list of R caveats I hand to new hires, but not bothering to learn the damn language is no reason to complain about it. First, RTFM.

avs73310y ago

look, I am all for RTFM but my experience with R has been a different case in my experience. Can I make it do powerful things? yes. can I take code that wont run, copy it to a new file and have it run? yes.

1) There are just quirks in implementation that are nonintuitive. I constantly find myself doing things I would do in other languages, to do simple tasks, which fail for no good reason, albeit an obvious one once I find the right documentation.

2) The manual you describe is flat out unhelpful in many cases. The suggestions that constantly come about "check stack / google" are suggestive of just how poor that documentation is

mikeskim10y ago

You wouldn't find it intuitive if your first language was Scheme.

1 more reply

noelwelsh10y ago· 1 in thread

I think the general consensus is that R is a terrible language with a lot of useful libraries. I especially like that R claims to be inspired by Scheme, but the memo seems to have been "Make sure we f*ck this all up" taped to the front of the "Lambda the Ultimate" papers[1]. In particular, lexical scoping was one of the key innovations in Scheme and R has pervasively buggered up their implementation, from not distinguishing between defining and mutating a variable to making the default save/load procedures mutate the environment. OMG does R drive me insane (as a programming language person.)

[Saying "there is package on CRAN that fixes this" is not a solution. A language shouldn't require extensive knowledge of the ecosystem to get the basics working properly.]

[1] Scheme was introduced to the world in the "Lambda the Ultimate" series of papers. See http://library.readscheme.org/page1.html

hadley10y ago

No, that is not the consensus. R is (at its heart) a beautiful language that is extremely well suited to its domain. Most people who use R are not professional programmers (or even identify as programmers) so it's not surprising that there's a lot of bad code written in R.

justin_oaks10y ago· 1 in thread

Way too many languages are trolls, or have troll features. Or in other words, too many languages have features that don't do what a reasonable person would expect them to do.

I've long considered implicit type conversion to be a troll feature, especially how Javascript does it. Another one is how differently Java treats primitives and Object types. Oracle databases treat nulls and empty strings the same.

At times like this, all I can do is lament and search in vain for a language with no troll features.

mackwic10y ago

You can take a look at Rust. The authors are really careful to design an elegant C++ replacement, with no troll features and zero-cost abstractions.

chollida110y ago· 1 in thread

I think this article nails exactly what's right and wrong with R.

This in particular sums up the learning curve of R.

> Thankfully, I’m long past the point where R syntax is perpetually confusing. I’m now well into the phase where it’s only frequently confusing, and I even have high hopes of one day making it to the point where it barely confuses me at all.

Warning personal opinion ahead...

R, the language can get you up and running alot faster than other languages for statistics like say python with Pandas or scipy but even people who use it on a daily basis will curse the languages "quirks". I find most of the confusion comes from R trying to be too friendly to the user via type conversions. The ease in which the R's type system will convert values has probably caused me more grief when first learning the language than any other issue I ran into.

And this illustrates the down side of using R

> library(Hmisc) apply(ice.cream, 2, all.is.numeric)

> …which had the desirable property of actually working. But it still wasn’t very satisfactory, because it requires loading a pretty large library (Hmisc) with a bunch of dependencies just to do something very simple that should really be doable in the base R distribution.

Since R is rarely a programmer's most used language, I find there tends to be an above average use of google and paste type code that pulls in 50 different packages, each of which is used on 1-2 lines of a 1000 line script. Perhaps this is just a function of most programmers not really understanding the mathematical domain and hence they slowly google and iterate their way towards a solution.

Often I'll see people pull in 5 different time series libraries just because each of them operate on a ts object, so they all can work on the same object, and each one provides one additional method the other's don't and the programmer needs to create their solution.

You'll hear people talk about writing R in the Hadley universe or the basic R universe but there isn't much talk about what a canonical R solution looks like. R is a great language in the sense that Perl and C++. It allows you to do anything but there often isn't an agreed upon way of writing it and two different programmers can come up with wildly different but valid solutions to the same problem.

mziel10y ago

I think it's also due the fact that installing R package is just one command away, install.package("abc") and library(abc) and you're done. It a blessing, but encourages loading swaths of libraries.

DangerousPie10y ago· 1 in thread

I have my fair share of problems with R, but that first example (4 ways to select a column) seems a bit silly. Just off the top of my head, I could think of plenty of ways to do the same thing in Python/pandas:

    ice_cream.icol[0]
    ice_cream['col']
    ice_cream.iloc[:, 0]
    ice_cream.loc[:, 'col']
    ice_cream.ix[:, 'col']

And if you wanted to make things more convoluted, you could also wrap things into lists like the author did in the R example. So this is definitely not a problem that is unique to R or any reasonably flexible language.

mziel10y ago

For R at least some of it is due to R's flexibility

x$name syntax stems from data frames being really lists in disguise x[["name"]] ditto, plus because it's useful to access by string (see reflection in other languages)

x[,"name"] and x[,1] because we can also apply the matrix syntax to data frames

dmlorenzetti10y ago· 1 in thread

The upshot is that unless you carefully read the apply() documentation..., you’re hosed.

One thing that jumps out at me, having returned to R after several years in the Python world, is how obtuse its documentation can be.

The standard format for R documentation does a few things that I find impede understanding. First, the help pages are organized into sections giving the high-level description, the arguments, the details, and the results ("values"). The "details" generally are organized by argument keyword, and the arguments section draws on the language laid down-- usually in a vague, high-level way-- by the description section. Finally the practical effects of the details are deferred till the results section. That means unless you already know what's going on, you end up having to jump around among sections, trying to synthesize everything.

This is particularly a problem for those help pages-- and there are a lot of them-- that describe a raft of related functions all at the same time. Describing a bunch of related functions in the same place sounds like a good idea (it should help you figure out `apply` vs `sapply`, right?). Yet this is exactly when the documentation organization results in the most scattershot reading, because in addition to having to synthesize between sections, you have to mentally prune away text that, for one reason or another, doesn't apply to your particular case (for example, because different functions don't all share the same arguments, or because you want to read about the values for just one variation on the function).

Another idiom I dislike in the standard R documentation is how the examples don't actually show any sample output. There are generally some attempts at comments to explain what the sample code should or shouldn't do, but they are very much written in the style of programmer's comments, not in the style of documentation or learning points. So you end up having to run the code, and sometimes puzzle over the results for a while.

Here's an example, from the help page that I happen to have open right now, `help(sample)`:

     # sample()'s surprise -- example
     x <- 1:10
         sample(x[x >  8]) # length 2
         sample(x[x >  9]) # oops -- length 10!
         sample(x[x > 10]) # length 0

The comments alert me that there's a "surprise" in store, and they even allude to the (apparently surprising) fact that the second line produces a 10-vector. Notably lacking is any explanation of what's meant to be surprising here, how that relates to the internal logic of `sample`, or how to avoid falling into the trap.

Overall, I feel like R's documentation is a bit like a conversation among experts, with a rather sink-or-swim attitude towards newcomers.

Documentation is far from the first thing that stands out about R vs Python, but it's the most salient, I think, in the context of the original article.

masklinn10y ago

> This is particularly a problem for those help pages-- and there are a lot of them-- that describe a raft of related functions all at the same time. Describing a bunch of related functions in the same place sounds like a good idea (it should help you figure out `apply` vs `sapply`, right?). Yet this is exactly when the documentation organization results in the most scattershot reading, because in addition to having to synthesize between sections, you have to mentally prune away text that, for one reason or another, doesn't apply to your particular case (for example, because different functions don't all share the same arguments, or because you want to read about the values for just one variation on the function).

This reminds me very strongly of man pages. Man pages group either similar (man 3 printf) or closely related (man 3 malloc) functions, and intersperses bits about each of the functions documented by the page, which ranges from difficult to read to mind-boggling (when you have half a dozen near-identical functions being documented at the same time). Reading an lapply documentation page[0] it looks very similar in organisation, and similarly difficult to parse/use.

> The comments alert me that there's a "surprise" in store, and they even allude to the (apparently surprising) fact that the second line produces a 10-vector. Notably lacking is any explanation of what's meant to be surprising here, how that relates to the internal logic of `sample`, or how to avoid falling into the trap.

On http://www.inside-r.org/r-doc/base/sample the surprise is explained by the first paragraph of the details, with the hell of an understatement that "this convenience feature may lead to undesired behaviour" but without the big red blinking box it would definitely deserve.

[0] https://stat.ethz.ch/R-manual/R-devel/library/base/html/lapp...

Gatsky10y ago

Ok, but this seems pretty trivial compared to the many exclusive advantages R has. I've had minimal problems using and extending other people's software packages written in R (for bioinformatics). This has definitely not been the case with Java, R or perl, where just installing said software package is often unusually painful or impossible.

I think R is a prime example how useful a domain specific language can be. As such, I see Julia as the most viable replacement, although that will take a long, long time.

stevetrewick10y ago

So, as a code person rather than a stats one, my first reaction was that in the first example there is in fact only a single way to access a column but multiple ways to specify which one, all of which made immediate intuitive sense to me.

So I wonder if this is less about R specifically and more a feature of people approaching a language (any language) without that code geek intuition for the underlying affordances ?

pak10y ago

There are certain languages that are good for a first-time programmer.

R, despite being one of the first languages a budding "data scientist" might want to use, is probably not one of them for the many reasons given, among them:

- there are way too many ways to do everything

- implicit iteration (although great for statistics) makes performance issues hard to spot

- the data structures are a bit too flexible (it is Lisp-y in places), and you really need to understand them all to deploy the *apply and plyr functions effectively

- 3+ object-oriented programming systems

- non-standard evaluation. It's all over popular libraries like ggplot2, because it increases terseness, but it just looks like magic to beginners.

Basically, all the chapters listed here [0] -- which happens to be a great guide for experienced programmers to really understand R as a language -- happen to be the same reasons beginners give up too quickly.

Python, although it sufficiently nags me with its one-way-to-do-it motto and its many warts [1] to not want to use it regularly, is just well-rounded enough that it is a much better language for beginners. With Anaconda and iPython installed, I've found that a total programming beginner can actually get productive pretty quickly, even on stats and math problems.

[0]: http://adv-r.had.co.nz/

[1]: https://wiki.python.org/moin/PythonWarts

mikeskim10y ago

This could change your life: http://adv-r.had.co.nz/

joelberman10y ago

I think if you are a programmer or have some programming language experience, R is not very weird. But if you are a financial analyst or a social scientist, or a statistician, and only want to get your work done; it depends on your first programming language. If it was S3 you are golden. If it was Basic, you are not so golden. Mine was LISP.

Bluestrike210y ago

I just introduced a friend of mine to R. He's working on his PhD in microbiology and beside himself once he started working with R. Personally, I can't believe he hadn't used it before. It really is a beautiful language to work with once you get a handle on it.

stevehiehn10y ago

The only issue I have with R is when exposing it as a web service R is not great. For example if using R you will need one container for "dployr" and another for your web service. It's not the end of the world but more moving parts means more problems.

rcthompson10y ago

Maybe R should have optional "training wheels" that produce a warning every time an implicit conversion happens. In the OP's case, it would warn that a data frame was implicitly being converted to a matrix, and maybe also warn that the numeric vectors within were being converted to character in order to get slotted into that matrix.

haddr10y ago

R is great language but at the same time it can be a real pain.

Sometimes I imagine that some very wise guy designs a language much more consise and coherent, that could at the same time take advantage of the huge number of existing libraries written in R and C++... Maybe it's a dream but so many times I wonder if that's even be possible.

gradstudent10y ago

ITT: people who don't understand R complain about R.

tempodox10y ago

Ah, the joys of a dynamic language and its implicit conversions.

misiti378010y ago

previous discussion: https://news.ycombinator.com/item?id=5450097

j / k navigate · click thread line to collapse

148 comments

99 comments · 29 top-level

mziel10y ago· 22 in thread

The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.

todd810y ago

Someone, Ross Ihaka, that knows a thing or two about R wrote a short post 6 years ago and said "simply start over and build something better". Take a look:

http://www.r-bloggers.com/“simply-start-over-and-build-somet...

My hope is that Julia will eventually be adopted as a basis for a future statistical programming language.

sdenton410y ago

From your link, hilariously relevant to the blog post at hand:

todd810y ago

I should point out that Ross Ihaka along with Robert Gentleman created R.

TheLogothete10y ago

The push for R is enormous. It's not getting replaced.

jlebar10y ago

> The problem is people using R without trying to learn about the language itself

It's not the user's fault.

We can disagree as to whether or not it's objectively hard to become better at R, but this is a perfectly valid criticism to make. It's not the user's fault.

baldfat10y ago

Its a 4 year old article and R has changed a TON with new things BUT.... R has grown a ton in users as well as features.

As I learned R my code has changed dramatically and I think R has one of the largest gap between the code you start with and when you are proficient. My starting R code is really embarrassing.

jcizzle10y ago

1 more reply

x0x010y ago

The R help even comes with code samples that you can run!

Fomite10y ago

Similarly, I encounter lots of people using R who don't actually know what a function is, just that lm(x~y) gets them what they want.

3 more replies

copperx10y ago

Surprisingly, this is true for many programmers of general-purpose languages.

kbenson10y ago

> The problem is people using R without trying to learn about the language itself, just assuming it works like their favourite language.

crispyambulance10y ago

I sympathize with the OP and also feel frustrated with R (and I say that as a regular R "practitioner").

Below is the "Description" for Apply (which you get when you try ?sapply). Does it _really_ explain the essentials of what you need to use "apply"?

"... lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use. ..."

mziel10y ago

YMMV, but I think this is pretty clear:

"lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X" == lapply is a map() construct that takes a list and a function

1 more reply

chrisseaton10y ago

> For example complaining that R is slow and then writing iterative solution instead of using vectorisation.

But if someone prefers iterative solutions, or that's all they know, why can't R make them just as fast as the vectorised versions?

YeGoblynQueenne10y ago

>> But if someone prefers iterative solutions, or that's all they know, why can't R make them just as fast as the vectorised versions?

R is interpreted and dynamically typed, so when you declare a variable, the interpreter has to do some bookkeeping to figure out the type of the variable, allocate memory for it and so on.

If you write a loop by hand, the interpreter has to do this bookkeeping once for each iteration.

If you write your code in vectorised form, the interpreter can sort out the bookkeeping once and then hand over to the lower-level code (C or Fortran) the vectorised functions are interpreted in.

This can also be further optimised to take advantage of processor vector instructions, parallel processing etc.

3 more replies

bachmeier10y ago

Keep in mind just how old the language is, as it started as S in 1976. It was intended to be a glue language for Fortran and C.

Thriptic10y ago

Yea I feel like sapply, lapply, and mapply will cover most of what people need to do. Hell I've personally only really ever needed sapply as I don't work much with lists.

TheLogothete10y ago

Fomite10y ago

My usual take on this: "Between R and Python, the faster language is likely whichever library author actually wrote most of their code in C or FORTRAN."

1 more reply

existencebox10y ago

Anyway, this was just a thought rolling around in my head given the discussion.

1 more reply

50CNT10y ago

I for one write all my statistical code in baremetal assembly. I manage about 5 a year, but they all run very quickly. There is no such thing as premature optimization.

bachmeier10y ago

> The problem is people using R without trying to learn about the language itself

I see a lot of blub when I read posts about R. So much so that I start with the assumption that any post about R is a blub post.

louden10y ago· 10 in thread

That said, the power, flexibility and user community make it my go-to for any first crack at an analysis of data.

VLM10y ago

What makes R infuriating is most of the complexity and gotchas aren't inherent to the problem.

Its a very powerful system in spite of the language. Think of PC hardware architecture going back to the old XT days, its sinfully ugly, but its quite capable. R is no PDP-11 or VAX, thats for sure.

mziel10y ago

Factor is one of the worst thing of R-world. I don't recall ever needing factors, yet they creep in with many functions (read.csv, cut).

Btw there's a nice readr package (from Hadleyverse) that has a read_csv method that does away with factors by default.

VLM10y ago

You should use factor for data cleaning and verification.

So you have "sex" on the questionnaire, and factor will very quickly identify contamination such as "often", "not yet", various mis-spellings, etc.

stewbrew10y ago

How would you represent categorical data then? R's primary use case isn't text processing. And HW isn't always right.

1 more reply

th0br010y ago

You will (inevitably?) run into factors when importing data from SPSS files... sure, you can discard them upon reading... but are you sure you don't want access to the value labels in the future?

lottin10y ago

Factors are weird because no other language has anything like it, but they are actually a quite clever way to group data. It just takes a while to get used to them.

Fomite10y ago

I actually use factors a fair amount, and having factor-like data shoved into numeric values gets you to some bad places statistically.

grayclhn10y ago

1 more reply

gbrown10y ago

Factors are great, and surprisingly powerful even outside of statistical computing. With that being said, I prefer to create them on purpose rather than having read.csv attempting to be helpful.

stewbrew10y ago

A factor already is a vector of numeric values, which happen to have names.

wanderfowl10y ago· 6 in thread

I may be competent relative to most, but R feels so taped-together and idiosyncratic that even on my best days, I just feel like a newbie who's built up an army of ugly hacks.

Someday, I'll learn more about the python stats tools and do my stats there. But for now, R it is. Troll on, you crazy bastard.

mojoe10y ago

My data science group is currently transitioning to Python-based development from R. It's actually amazing how much faster Python development is, because of two things:

1. R libraries (and even rarely the R interpreter itself!) tend to have really weird corner case bugs that crop up every couple months, and

2. It's REALLY easy to write unmaintainable code in R, and so strange cruft creeps into the code over time.

nothis10y ago

I just wonder if there's rationality behind not changing this (by either switching to a different language, fixing it or, hardcore mode, create a new one).

wanderfowl10y ago

avs73310y ago

I am in an almost identical situation, just still in the PhD stage.

S4M10y ago

> Oh and '<-' just drives me nuts...I will never understand the choice of two characters where one is entirely sufficient.

It's true that '<-' is a strange choice, but you can use '=' for variable assignments as well.

1 more reply

jghn10y ago

Iirc it comes from the old APL keyboard

2 more replies

nathell10y ago· 5 in thread

I have a love-hate relationship with R, being a predominantly Clojure (and Ruby these days) programmer who only occassionally dabbles in data crunching.

The apply/sapply dichotomy that the article mentions (actually a hexachotomy, there are also mapply, sapply, tapply and vapply) is one example of a gazillion warts that the language has.

And R's three mutually incompatible object systems. And so on and so on and so on.

jonchang10y ago

See, this is another interesting example of the kind of behavior described here: https://news.ycombinator.com/item?id=11113042

    R has a useful function, paste, that concatenates strings together.
    Only it takes varargs, not a character vector, so if you have a
    vector v of strings, you have to use do.call(paste, v).

But the help for the `paste` function literally goes over this exact situation:

    > v <- 1:5
    > paste(v, collapse = ",")
    [1] "1,2,3,4,5"

I'm often super baffled by the lengths people will go to not figure out how to use R and insist on writing <X> language in R.

nathell10y ago

Thanks for pointing this out. I overlooked it, presumably because it's in the last paragraph of "Details" and not illustrated in any example.

1 more reply

sooheon10y ago

I wonder what disagreements or counterarguments downvoters might have, and if they might share them.

For the parent, given what you've observed, do you still go to R for data crunching, or have you found anything in Clojure land that measures up?

nathell10y ago

For statistics, Clojure has Incanter, but it's very basic in comparison. There are easily usable Java libraries for certain tasks (MALLET comes to mind), but these are few and far between.

[1]: https://github.com/nathell/skyscraper/

tanvach10y ago

You can also use paste( v, collapse =",")

tmalsburg210y ago· 4 in thread

hudibras10y ago

I agree with everything here, including the praise for Matloff's book; it should be the very first book any serious R user picks up.

But Matloff is no longer alone: Hadley Wickham's Advanced R is now also a must-read for R programmers.

http://www.amazon.com/Advanced-Chapman-Hall-CRC-Series/dp/14...

hadley10y ago

And Advanced R is also available for free at http://adv-r.had.co.nz/

kgwgk10y ago

There are other books on R programming: John Chambers published "Programming with Data: A Guide to the S Language" in 1998 and "Software for Data Analysis: Programming with R" in 2008.

tmalsburg210y ago

No better way to elicit information from people than making outrageous claims. Thanks for those references!

th0ma510y ago· 4 in thread

My own personal rant, I think the specific feeling I get is the conceptual idea of R has long since outpaced the reality of R.

But everything else wants to sugar coat the software surrounding the statistics, and leaves you no room to grow.

Ultimately you start thinking along those lines, and you realize that the promises R is making about empowering the analyst are just teasing them rather than helping.

JasonCEC10y ago

I disagree on multiple fronts!

1.5) ggvis is a D3 API wrapper on ggplot and allows for interactive graphics. Do you want to pay your data scientists for creating production ready graphics or let them focus on what they're best at?

3) I don't think there's any "magic" in R. It's just a language with a learning curve and lack of opinions.

/endrant

Not trying to start a flame-war - but this type of difference in opinion is important to see when thinking about hiring data scientists or deploying models.

th0ma510y ago

I sincerely appreciate your thorough and well reasoned response, and thank you for taking the time with it.

2) Yup, totally agree!

Anyway... the real problem is that you're also absolutely right. There is quite a bit difference in opinion between the tooling of an analysis effort, and the robustness expected by IT.

Thanks again!

mattkrause10y ago

I'd divide my work into three categories - Exploratory plots just for me. - Plots that I'm going to show to my boss or coworkers. - Things we're going to distribute to the whole world.

Plots in category #1 are often quick and dirty--I just want to see if an idea worked and don't really care about communicating that idea cleanly.

TheLogothete10y ago

superuser210y ago· 3 in thread

bachmeier10y ago

It sounds like you are arguing for imperative programming over functional programming.

superuser210y ago

We run 3 different intro sequences in Python (for non-majors), Scheme (for most majors), and Haskell (for those who are already strong imperative programmers). They're all great.

GFK_of_xmaspast10y ago

What does that have to do with r vs python?

elcapitan10y ago· 3 in thread

Fomite10y ago

Also assuming you don't want to count things like "You can call R from Python, and Python from R", here's my take on it, as someone who uses both languages:

- Pandas has helped Python tremendously, but I don't think it's quite to where the R data frame is.

- For 90% of what someone who wants to do statistics wants to do, it honestly doesn't matter at all. You can do nice data visualization in both. You can fit most generalized linear models in both.

TheLogothete10y ago

We did an evaluation recently. Not even close.

elcapitan10y ago

Is that evaluation somehow public, or could you share some details?

capnrefsmmat10y ago· 2 in thread

I hope Python and Julia can eat R's lunch. Some day I'll have to get around to trying Julia for a serious project...

hadley10y ago

The lack of data structures in R is a totally fixable problem.

capnrefsmmat10y ago

Sure, through packages, but you'd need to adapt the entire standard library to take advantage of them, so you could pass new data structures to built-in functions and get meaningful results.

Generic iterators would also be extremely useful to build in, so it's easy to work with a wide variety of structures.

1 more reply

evandev10y ago· 2 in thread

The things that separate a programming language from a statistical language are a programming language have more than one of these:

* Good dependency management

* Easy deployment into production environment

* A clear way to setup environment (e.g. naming, folder conventions)

* Ability to do most of the things you want with the base packages

* Good documentation about the above.

Little plug, we wrote a piece on hiring data scientists.[0]

[0]: https://gastrograph.com/blogs/gastronexus/interviewing-data-...

JasonCEC10y ago

To contrast this - I'm the lead data scientist at the same company, and head over heels in love with R....

Of course, take this with a grain of salt: this is coming from a guy who's built web-servers (HTTP responses and all) in R.

makeset10y ago

R's server sockets can't select(), so unless you've reimplemented that as well, don't count on handling concurrent requests with that web server.

numlocked10y ago· 2 in thread

At my previous job we used to play "Guess what R does" over lunch. Someone would write a few R statements and we'd have to guess the output. Extremely difficult!

    >> a=c(1,2,3,4)
    >> b=c(1,2)
    >> a+b

Any guesses?

epistasis10y ago

jmcphers10y ago

2, 4, 4, 6. The second vector is recycled along the first one:

1 + 1, 2 + 2, 3 + 1, 4 + 2.

makeset10y ago· 2 in thread

Yes, I'm well aware of R's many faults, I have my own long list of R caveats I hand to new hires, but not bothering to learn the damn language is no reason to complain about it. First, RTFM.

avs73310y ago

2) The manual you describe is flat out unhelpful in many cases. The suggestions that constantly come about "check stack / google" are suggestive of just how poor that documentation is

mikeskim10y ago

You wouldn't find it intuitive if your first language was Scheme.

1 more reply

noelwelsh10y ago· 1 in thread

[Saying "there is package on CRAN that fixes this" is not a solution. A language shouldn't require extensive knowledge of the ecosystem to get the basics working properly.]

[1] Scheme was introduced to the world in the "Lambda the Ultimate" series of papers. See http://library.readscheme.org/page1.html

hadley10y ago

justin_oaks10y ago· 1 in thread

Way too many languages are trolls, or have troll features. Or in other words, too many languages have features that don't do what a reasonable person would expect them to do.

At times like this, all I can do is lament and search in vain for a language with no troll features.

mackwic10y ago

You can take a look at Rust. The authors are really careful to design an elegant C++ replacement, with no troll features and zero-cost abstractions.

chollida110y ago· 1 in thread

I think this article nails exactly what's right and wrong with R.

This in particular sums up the learning curve of R.

Warning personal opinion ahead...

And this illustrates the down side of using R

> library(Hmisc) apply(ice.cream, 2, all.is.numeric)

mziel10y ago

I think it's also due the fact that installing R package is just one command away, install.package("abc") and library(abc) and you're done. It a blessing, but encourages loading swaths of libraries.

DangerousPie10y ago· 1 in thread

    ice_cream.icol[0]
    ice_cream['col']
    ice_cream.iloc[:, 0]
    ice_cream.loc[:, 'col']
    ice_cream.ix[:, 'col']

mziel10y ago

For R at least some of it is due to R's flexibility

x$name syntax stems from data frames being really lists in disguise x[["name"]] ditto, plus because it's useful to access by string (see reflection in other languages)

x[,"name"] and x[,1] because we can also apply the matrix syntax to data frames

dmlorenzetti10y ago· 1 in thread

The upshot is that unless you carefully read the apply() documentation..., you’re hosed.

One thing that jumps out at me, having returned to R after several years in the Python world, is how obtuse its documentation can be.

Here's an example, from the help page that I happen to have open right now, `help(sample)`:

     # sample()'s surprise -- example
     x <- 1:10
         sample(x[x >  8]) # length 2
         sample(x[x >  9]) # oops -- length 10!
         sample(x[x > 10]) # length 0

Overall, I feel like R's documentation is a bit like a conversation among experts, with a rather sink-or-swim attitude towards newcomers.

Documentation is far from the first thing that stands out about R vs Python, but it's the most salient, I think, in the context of the original article.

masklinn10y ago

[0] https://stat.ethz.ch/R-manual/R-devel/library/base/html/lapp...

Gatsky10y ago

I think R is a prime example how useful a domain specific language can be. As such, I see Julia as the most viable replacement, although that will take a long, long time.

stevetrewick10y ago

So I wonder if this is less about R specifically and more a feature of people approaching a language (any language) without that code geek intuition for the underlying affordances ?

pak10y ago

There are certain languages that are good for a first-time programmer.

R, despite being one of the first languages a budding "data scientist" might want to use, is probably not one of them for the many reasons given, among them:

- there are way too many ways to do everything

- implicit iteration (although great for statistics) makes performance issues hard to spot

- the data structures are a bit too flexible (it is Lisp-y in places), and you really need to understand them all to deploy the *apply and plyr functions effectively

- 3+ object-oriented programming systems

- non-standard evaluation. It's all over popular libraries like ggplot2, because it increases terseness, but it just looks like magic to beginners.

[0]: http://adv-r.had.co.nz/

[1]: https://wiki.python.org/moin/PythonWarts

mikeskim10y ago

This could change your life: http://adv-r.had.co.nz/

joelberman10y ago

Bluestrike210y ago

stevehiehn10y ago

rcthompson10y ago

haddr10y ago

R is great language but at the same time it can be a real pain.