I get that R is not for everyone but used correctly it is a beast.
Now this is anecdotal, but we have in the insurance industry what we call on level premium calculators. It is basically a program that will rerate all policies with the current set of rates.
Our current R program can rate 41000 policies a second fully vectorized on a user laptop that has a an i5 from 2015.
In contrast, the previous SAS program could do 231 policies a minute on xeon 64 core processor from 2017.
For our workload and type of work, R has been a godsend.
Bonus, we can put what our data scientist develop in R directly in production. (after peer review, testing, etc, not different than any other production code)
Back when I started in 2005, we modeled in some proprietary software like Emblem, used Excel to build a first draft premium calculator, rebuilt the computation in SAS for the onlevel program and sent specs to IT to rebuilt the program again for production. All three had to produce the same results.
I've tried Python, Go, Rust, Julia. I'd say Python could be a good alternative but speed of data.table, RStudio IDE and ease of package management in R makes R an obvious choice for us. I believe Julia to be the future but so far the adoption rate in house has been low.
But... and here's the big but...I almost never actually meet anyone capable of putting all these steps together in SAS these days that actually understands the SAS computation model end to end.
And SAS's strength, a computation model not being limited by memory by default, becomes a performance weakness when everyone reads/writes every step out to disk and programs without understanding all those little intricacies. SAS hasn't helped any of this by trying to move its eco system away from "programmer" to "application users", so now "programmers" can pick up an interpreted language like R with in-memory default vectorised operations and beat SAS.
Course, I'd still recommend places move to python/R these days because of the broader ecosystems, university talent pool, and avoiding the extensive lock in of proprietary software, but I still feel I have to reflexively respond to "R faster than SAS" claims :p
And yes technically SAS is faster than R but part of the equation is how many people can make SAS code faster than R/python. I had maybe, 1-2 people that could write efficient SAS code.
One version we had was a bunch of macro producing hash merge plus the whole how can I do something without having to get out of the data step. Just horrible. Number of characters in a line of code? You forgot your quote somewhere and now you have to run the magic line.
I hope I'm not too emotional when I say I hope SAS disappears from my industry and we embrace less adversarial licensing.
I have been trying to help with exactly this (and your breadcrumbs help) but it is tricky for me since I am used to open source/*nix environment where you can use much different tools and also information and tutorials are distributed much more widely.
It uses dplyr and data.table syntax to manipulate data on disk
Why do you believe it will be the future, and what do you see as the barriers to roll-out? I ask as someone who is curious about when/whether to start investing in Julia competence
https://julialang.org/blog/2012/02/why-we-created-julia
I've been playing around with it. As a Python/MATLAB guy, the syntax is very friendly. I can see it displacing Python in production code where you need speed and might avoid some of the heavy Python DS libraries. Overall it seems like a thoughtful combo of a lot of good numerical programming features.
But so far we have seen great development. Flux is a truly beautiful ml library. Being a compiled language remove a lot of headache when building production images. The syntax, the full utf support in variable name. Package management is great. Having that abstraction layer between CPU, GPU so you don't have to rewrite code. Dispatch based on signature, type management. I don't see it going away soon. It took me 13 years to make them transition out of SAS, good thing cloud computing come around and someone realised the clusterfuck of having to manage SAS licence in the cloud.
Right now the only way to do that without significant performance costs is to drop down into C or avoid the problem completely by using Julia.
Having worked with both R and Python on large datasets, I think both languages are really easy until they aren’t. Eventually you hit a performance wall.
I recommend folks looking to start with R check out: https://r4ds.had.co.nz/
It is also available for free.
However, a couple of years ago, my wife tried to transition from business consulting to a data analytics / data science role. She started with taking an R course. She was put off by R's complexity and the course's early focus on the details of R syntax, function definitions, closures etc. and abandoned it.
The year after, she decided to try again and enrolled in a course that used Python (with numpy+pandas+scipy as data science stack) and she reported it to be much simpler, more intuitive and easier to learn compared to her previous experience with R. Now she has successfully completed the program and is employed as a data analyst.
Here's a useful post, comparing the classic approach you mention to an alternative
The tidyverse is incredibly controversial in parts of the R community; it's essentially an opinionated set of packages that basically comes with its own "standard" library. But I think that wholeheartedly embracing it, and hiding the way to do things in R that you would do them without the affordances that the tidyverse offers, is absolutely the right way to teach R these days. Unfortunately, a lot of courses and books haven't caught up to that yet.
Great documentations and tutorials go a long way.
Doing the exact same thing we did before!
We have a new library called "dtplyr" (no seriously!) it is designed to save users from the arcane and obtuse sides of R by combining the power of "dplyr" and "data.table", the two libraries that were designed to save users from the arcane and obtuse sides of packages such as "data.frame" and ....
I wish I were kidding. There is the absurd contention in the R world that by introducing yet another weirdly named package people can avoid having to learn and suffer through the "real" R.
A huge pain point for us is the packaging system. It is absolutely awful. Packages constantly get overridden so we have to install packages in a specific order. Whenever I have reached out to the community (including prominent members, which have written R books) I have always been told to just use the latest version of all packages and just get on with it, which as anybody knows, isn’t always possible, especially as there are constantly breaking API changes.
I understand R’s history and that in general, it is a lot better than it use to be, but I would only recommend R is used for notebook style work and to keep it well away from production.
We have migrated to Python, which isn’t perfect, but the difference in logging and packaging has been night and day.
So instead of defining our app to use version 1.4.5 of a package, we would use “latest version from 3rd of May”.
A lot of packages/functionality are not available in Python, however.
R-Shiny is a full stack platform for web apps, and it’s how I leveraged my data science background to get into web development. It’s incredibly powerful in my opinion, with the only obvious limitation being the speed of R itself.
And Plumber. It’s become the defacto method for deploying R code in a REST api. It too is still maturing, but I see it eventually becoming the Flask of R.
Truth be told, however, after developing quite a few projects on the Shiny/Plumber stack, I wouldn’t recommend anyone do it.
If for some reason you can only have an R interpreter, go for it. But learning multiple languages really is the best solution if you want to manage efficient applications. I say this, however, realizing that all of my colleagues writing R don’t have engineering backgrounds.
I can’t help but feel like R is like JavaScript in many ways. Ease of use and the ease of publishing packages very quickly clutters the repository.
R will always have a special place in my heart, after all it’s the language that made me discover programming. However, I can’t help but feel that my thirst for efficiency is making me outgrow it as a language quickly.
After learning a fair bit of web-development, I feel R should focus on an analytics oriented path.
R just isn't designed for a web-app. Web-apps are much better and faster developed in more focussed languages / frameworks (node/python/django/express, etc) and can be seamlessly integrated to leverage R modules / scripts.
It was really useful to be able to apply most theory I was learning to actual research datasets. This is what I miss the most since moving to Python.
What I don't miss is R's terrible packaging system and how it made collaborating with colleagues near impossible. I can't count the amount of times I had to debug dependencies on others' script just to be able to move forward with some team project.
Historically, the conventional way to write R code was one that tended to result in shadowed names (and hence brittle code).
Working with dataframes in R is much much more convenient than Pandas (loc, iloc, etc??)
Plotting is an obvious win for R, matplotlib is horrible, it's powerful yes but it is an absolute pain when compared to ggplot.
Scikit is definitely unmatched but caret is not so far behind. Also, R has a plethora of implemented models that Python lacks (from something as basic as decent quantile regression to time series analysis tools).
As for building a complete application, Python is indeed the go-to.
Syntax wise, using magrittr's pipes is an absolute pleasure. Good luck doing that with Python.
[1] https://github.com/statsmodels/statsmodels/releases [2] https://www.statsmodels.org/dev/examples/notebooks/generated...
I use R everyday for statistical analysis due to it having certain interfaces and I still hate it every day.
[1] https://cran.r-project.org/web/packages/languageserver/readm...
[1]: https://jamesmccaffrey.wordpress.com/2016/05/02/r-language-v...
But behind the scenes, R is just a lisp with some data structures that are adapted to statistics and data science.
All base data structures are by default immutable. And e.g. the vector type is extremely performant as it's just a thinly wrapped C Array. In Python you need to reach for Numpy for anything similar, and you do feel some pain when converting between native python types and Numpy types for various functions which support one or the other.
The data frame is immensely powerful. And has excellent performance characteristics as it's built upon vectors. A list of objects, like you'd make in python is just a lot slower and more unwieldy to deal with. And much harder to make generalizable functions upon.
Hadley Wickham's Tidyverse[2] is exactly an attempt to hide away the arcane details and create a modern, coherent and consistent language on top of R, keeping the power of all the great statistics R libraries. The fact that R behind the scenes is a Lisp, with support for macros, makes this possible. For doing data-transformations and statistics, I can't think of anything currently as powerful as CRAN + Tidyverse.
In R it's slightly different. The vector (being generally without dimensions) is the base data type and n-dimensional arrays are made of a vector and dimensions. A matrix is then a 2d array. Also vectors/arrays are by default type-specific.
> support for macros
From what I've seen, R does not support macros, but functions which can retrieve/generate code at runtime. That's an early mechanism which got replaced by macros in Lisp. Macros in Lisp are source code transformers and can be compiled - thus they are not a runtime mechanism like in R or earlier Lisps with so-called FEXPRs.
This is something I wish there was more progress on. A serious limitation in some contexts.
> A vector is what is called an array in all other programming languages except R
Vectors are called vectors in several "wispy" languages: Common Lisp, Scheme, Clojure...
> An array with two dimensions is (almost) the same as a matrix.
I think it's the same, not "almost" the same. At least in the current version of R:
> class(array(1, c(2,3)))
[1] "matrix"
> identical(array(1, c(2,3)), matrix(1, nrow=2, ncol=3))
[1] TRUE
In 4.0 there will be a change and the class of a matrix will be both "matrix" and "array", but I think the fact that there is no difference between a 2-dimensional array and a matrix remains.That said, the real value in R seems to be the libraries. Has anyone looked at a shim that could make those libraries available to Python in a reasonably natural way? If that existed, the R language itself could be allowed to finally rest in peace.
Being vector aware and having a dataframe support in R is much more elegant for me than Python's add on library. It's like Scala building on top of Java but trying to have an Actor paradigm vs Erlang built from get go around concurrency and choosing Actor as it's main concurrency paradigm. You can see this in other language on PHP and C++ let you be OOP but it's an after thought compare to Ruby or Python.
[1] https://github.com/rust-lang/rustfix
[2] https://doc.rust-lang.org/nightly/edition-guide/editions/tra...