While we are trying to minimize the disruptions, there is one thing project maintainers should do right now: pin the maximum NumPy to <2.0 in their ~`pyproject.toml`~ project dependencies. This will ensure they do not inadvertently upgrade before they are ready to do so. Once numpy2.0 is released, you can check that your code works with it, and then release the pin.
Definitely good advice on pinning the upper range for major versions. Pandas 2.0 inadvertently broke spark’s toPandas for (at the time) all existing spark versions, which some of our users lean on. Pandas response to this was apparently to upgrade spark to a version that was not yet released. We lost about half a day’s work from a small handful of devs trying to identify why some of our users were seeing sudden failures.
I think that ndarray is the most successful abstraction I've come across. Numerical computing is a domain ripe for terrible code, but multi-indexing, broadcasting, mask arrays, .reshape(), .where(), linspace(), and all that are so well made and useful they are now the standard grammar of data science. Yes you've seen horrible numpy code before, but how much worse would it be if it had been written by the same person in raw C with only malloc and pointer arithmetic?
Just FYI, numpy by no means pioneered this concept. There was Fortran before, as a language built around n-dimensional arrays. And even Fortran was not the first one, as some stack based programming languages (such as APL, IIRC) have similar concepts. Also languages such as R and evventually Matlab (as a kind-of nicer frontend to Fortran libs) pioneered this concept. However, Numpy was the first library bringing this into a general-purpose language as Python is.
Fortran, Matlab and all array based languages (APL, J, K, Niall, etc) have these constructs at their core. Ndarray may be a great implementation but the ideas have all been there. I've been exploring J recently and the flexibility and compactness is tremendous.
Numeric Python, the original numpy, was based mainly on the MATLAB array object. Of course, it was also influenced by FORTRAN and other languages, but ultimately, it was taken from the mental model of MATLAB.
I remember implementing a sliding overlapping 2d windowing by playing with the strides. I don't know how common is that trick but at the time it felt like magic.
numpy is a great library, but I find the pytorch take on it superior. It allows for much more method-chaining, which I find to be very readable in longer computations. I frequently want to reach out for it in numpy, only to see that it's a np.* function. In functional languages I would use an infix pipeline-combinator like (|>) but that's not possible in python.
I’m working on adding missing data support for strings as part of adding a UTF-8 variable-width string type to NumPy. Not a general solution but should help with a lot of use-cases. https://numpy.org/neps/nep-0055-string_dtype.html
np.nan? Not trying to be funny, but hoping to learn whether I'm missing something about limitations of np.nan which would be solved by some other kind of missing value indicator.
np.nan is only for floats, doesn't help with integer, boolean, string etc. Also, datetimes have NaT, but it's troublesome to e.g. do different checks np.isnan() or np.isnat() depending in the data type. And we don't even have np.nat, but need np.datetime64("NaT"), so it's just confusing.