The current workarounds to make this happen in python are quite ugly imho, e.g. Pytorch spawns multiple python processes and then pushes data between the processes through shared memory, which incurs quite some overhead. Tensorflow on the other hand requires you to stick to their Tensor-dsl so that it can run within their graph engine. If native concurrency were a thing, data loading would be much more straightforward to implement without such hacks.
1. Loading data
2. Running algorithms that benefit from shared memory
3. Serving the model (if it's not being output to some portable format)
There are also general benefits of using one language across a project. Because Python is weak on these things, we end up using multiple languages.
If I really had the use case and needed threads, I'd much rather use C++ bindings in a Python package than rebuilding the whole thing. Guess it depends on the scale we are talking about.
[0] https://pythonspeed.com/articles/faster-multiprocessing-pick...
Go and elixir provide some parallelism but the primary focus for both languages is concurrency.