I thought that Octave was an ugly little language at first, now I really like it - a great tool for doing linear algebra, data visualization, machine learning, neural networks, etc.
Matlab is awesome above all else because the design is coherent. Both the syntax and the standard libraries.
It is extremely quick to whip up anything and then turn that into a script and then into a software with functions (since functions can return many variables, and they also have zero overhead, you don't need any includes or requires, you just call them). Type conversions are practically never a problem, since they are sane and automatic. None of this 1+1.5 giving syntax error. Real booleans. Data input and output libraries just simply work like you would expect them to. ( A=imread('/home/gravityloss/abc.png') creates a width x height x 3 matrix with all the rgb values. No requires, includes, plugins, hunting and compiling libraries.). You don't need libraries to do a huge amount of stuff, but if you need them for something experimental, they work extremely easily.
You also rarely need stuff like loops since mass operations on data are native. If you as a newbie create a custom function for a scalar, there's good chance it will work for vectors or n-matrices automatically. This reduces the amount of error-prone housekeeping code for indices and lengths immensely. It's also much much faster than some looping in another scripting language. As a result, the code is often very readable as well.
There's help which actually returns something sensible when you type help, you can type help help or help command or search this or that, the help texts are actually very thoughtful and helpful too and not at all like Linux man pages... I could go on for hours on features that don't really exist anywhere else, even though everything's been in plain sight for decades in Matlab.
Julia's an awesome thing though, I hope it gets more traction...
Python is better generally than matlab/octave, and R I view as roughly on par.
*Source: 7 year veteran, octave and matlab user, python lover
If Octave's only claim to fame is being the poorer cousin of Matlab, I wonder why universities still use Octave to teach anything. I would much rather that they use R : which pretty much ensures that the students have an open source path to use it in the future.
Doing coursera stat one and R is pretty easy. It remind me of PHP. Syntax wise, I don't know why it's just a feeling. This article made it a bit clearer. OOP was an after thought...
I'm getting more and more into R now. Hopefully one day Python.
So I guess what I'm saying is I think it's R that I wouldn't be surprise and that I have to respectfully disagree with your Octave statement.
That would be a really cool project to work on: design a minimal language for expressing most types of data analysis at a higher level. If the language is sufficiently small and simple, I could see some very powerful tooling being possible for it.
Perhaps it might make sense to go even more specific: have a small language designed not just for data analysis but for analysis in a very specific vertical (say finance or bioinformatics). It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.
1) Most data analysis tasks boil down to roughly the same things: accessing the data source --> data cleaning -->simple transformations --> (optional)stats/fitting/ML/specialized procedures-->pretty pictures and reporting.
2) Not everyone wants programming to be the main component of their job.
People who can take advantage of the flexibility that programming offers can usually take advantage of existing technologies. People who don't enjoy coding will always look for of-the-shelf solutions that have pretty GUI's with magic buttons that solve all their problems. I just don't think there is a huge market in between to be filled... in the domains that i've been exposed to anyway.
For instance, consider illustrator products or d3? Both of these are specialized ("deep") tools for creating pictures that I've used extensively in the "pretty pictures and reporting" stage you outlined.
Also of serious note are BUGS[1], JAGS[2] and (recently) Stan[3] as small semi-declarative languages for MCMC model building, fitting, and checking.
SQL is an obvious example of a component of the "simple transformations" step.
[1] BUGS http://www.mrc-bsu.cam.ac.uk/bugs/ [2] JAGS http://mcmc-jags.sourceforge.net/ [3] Stan http://mc-stan.org/
> It would be awesome to let people express their ideas in terms of the domain and not worry about low-level details like loops.
Yes! I think you want to build this functionality on top of a powerful language to easily handle the dirty ETL work too. This is the reason lots of financial companies use python with scipy, numpy, pandas, etc, on top of it.
I think there's a lot of power in certain kinds of non-Turing completeness. Email me if you want to talk about it.
The part about python not being as fast as Julia jumped at me. Wes McKinney's benchmarks show that python is faster than Julia for numerics: http://wesmckinney.com/blog/?p=475
EDIT: should not have said "python faster than Julia". They are comparable because the slow bits get done in BLAS anyway.
Cython is actually what is faster than Julia in Wes' comparison, not Python. Cython looks kinda, sorta like Python, but it is actually a static language with C-like types (but quite different syntax for those types), no polymorphism, and, afaict, ill-defined semantics. The best answer I seem to get about Cython's semantics is that Cython's semantics are whatever it does. I'm not alone in this complaint – Travis Oliphant expressed a similar concern at this year's SciPy (in this panel [http://www.youtube.com/watch?v=7i2vhoQY-K4], if I recall correctly), which is part of his motivation to work on Numba [https://github.com/numba/numba].
If you look at the comments on Wes' post, when I used the dot(x,y) function, which ships with Julia and uses a BLAS to compute the inner product just like the fastest "Python" version does, Julia is equally fast. That stands to reason – they're both just calling a BLAS.
Finally, that blog post is months old – since then Julia passed the milestone of being no slower than 2x C++ on its microbenchmarks suite [http://julialang.org/]. That's not a guarantee that all code is that fast, but most things we see can be pretty easily tweaked to get there (counterintuitively for those coming from Matlab, Python or R, usually by devectorizing the code rather than vectorizing it). And of course, there's a lot of room for improving Julia's performance, the compiler is still quite young and there are many optimizations that we haven't implemented. Basically, there's nothing but work standing in the way of reaching C or Fortran's speed across the board.
Even with thousands of hours of experience in Matlab, R and Python... I'm not sure what "obvious advantage" Matlab and R share over Python.
But you can just type "R", do read.table(), and very quickly slice and dice your data. In Python just evaluating what package to use, then getting the packages, dealing with versioning issues, etc. kind of breaks the whole thing. Then you need to figure out what plotting library to use, etc. Having stuff built-in as a common base which all your coworkers share is important. I know there are common distributions like SciPy but they are not as common as R is.
Probably the bigger issue, as mentioned above, is that R has higher-level stuff like time series libraries that Python doesn't.
The main thing that's needed is a shell to glue all these languages together, to ease integration pain. Everybody wants the "one true language", but that's a pipe dream. Python's close but not quite. Julia is kind of falling prey to this fallacy too. The programming world is becoming more heterogeneous, and the solution is to have tools to make multiple languages work nicely together. Not to pretend that heterogeneity doesn't exist.
You can work really hard to get homogeneity on your one little project. Maybe that's what language wars are so heated. But the second you have to borrow code from another lab, or you acquire a company, or get acquired, you have a heterogeneous mix. Matlab, R, Python, or Julia will never suffice for all tasks. Non-trivial problems will always require a mix of them. You have to pick the solution according to the problem, and Matlab and R definitely are superior to Python for certain problems.
That would be a pretty weak argument in my opinion.
I wish every language would have such a built-in object type, I definitely feel its loss when I manipulate data in other languages such as Javascript or Mathematica.
The performance is terrible though. For data of more than ~10,000 observations SQL is much better performance wise, is more robust, and is as good at subsetting. Although it's maybe not as elegant for everyone's definition of elegant.
As much as I hate R and love Python, this is not entirely true (unless you count rpy2 as part of "Python"). R has many more statistical models and better plotting capability compared with Python. It also has a lot of domain-specific packages (for example, Bioconductor) that are not available in Python.
But the basic data analysis is fine. The IDE has awful code completion and lacks more refinement in the editor.
I think it would be interesting to see breakdowns of different software, and where they are used. Often times it seems to me that people just use the tools their peers and co-workers use, and people tend learn to like whatever they use most.
Because the analysis is often the quickest part of being a data scientist. Coursera, as I recall, apparently cleans it's data, and also lets you easily import it.
In real life, data is messy, and messed up. You looking at birthdays from some website? expect a spike for whatever the default is... but that doesn't mean you can eliminate that data completely, because some people were presumably born Jan 1st.
You looking at birth years? I recall dealing with them in SAS... remember if it's four digit that you check for births occurring in the current and past century.
And hey... do you have two or more elements of data for an individual? 2% to 5% will probably be missing some element, and some will have wrong data. a zip code off by one, an address not in the city you are looking to geocode for, whatever. If you are lucky, it will be obvious stuff like that.
The life if the data scientist is mostly cleaning, formatting, and transferring data, with the occasional sweeeet analysis. Of course your analysis will probably give you nothing useful, because despite several thousand usable records, it's not clear if any element has a significant effect on the dependent variable you are looking at. If you are smart, maybe you can finagle an analysis based on a non parametric distribution or logistic regression.
Oh, and often the speed of your analysis running is inversely correlated with how easy it is to code and enter your data. There is a reason people use SAS, and it's not because of it's amazing IDE.
https://metacpan.org/module/PDL
PDL is the Perl Data Language, a perl extension that [...] includes fully vectorized, multidimensional array handling, plus several paths for device-independent graphics output.
PDL is fast, comparable and often outperforming IDL and MATLAB in real world applications. PDL allows large N-dimensional data sets such as large images, spectra, etc to be stored efficiently and manipulated quickly.
For integration with R, there are Statistics::R (https://metacpan.org/module/Statistics::R) and Statistics::useR (https://metacpan.org/module/Statistics::useR)