People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.
> It also means that adding performance to an existing Python program requires dropping into a different language
As stated above, this applies to all languages. BLAS routines used for serious numerical work are hand-vectorized Assembly fine-tuned for each processor architecture, written by a few hyper-experts who do nothing else.
Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".