The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.
It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.
Completely agree that almost none of the SoTA github repos are really ready for production and making this stuff work can be pretty hard.
Getting this done on C++ and moving up to the next level of performance is the focus of my next article :)
too bad such great ecosystems evolved around a language that can’t fully utilize the amazing hardware we have today.
Do you have any experience with that?
All the deep learning libraries are Python wrappers around C/C++ (which then call into CUDA). If you call the C++ layers directly, you have control over the memory operations applied to your data. The biggest wins come from reducing the number of copies, reducing the number of transfers between CPU and GPU memory, and speeding up operations by moving them from the CPU to the GPU (or vice versa).
This is basically what the article does, but if you want to squeeze out all the performance, the Python layer is still an abstraction that gets in the way of directly choosing what happens to the memory.
To me, seeing the GIL held for 40% of time and significant time spent waiting on GIL by other threads was a fairly strong indicator. Keen to hear your thoughts/experience on it.
I know a number of python frameworks (ie. detectron) that are fast.
I'd like to see the evidence that the performance bottleneck is python, esp. when asynchronous dispatch exists.
At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL. We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.
PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)
When you have multithreaded setups, this typically is more significant than the Python overhead itself (which comes in at 10% for the PyTorch C++ extension LLTM example, but would be less for convnets).
Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?
I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.
Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.
But Paul's situation is multithreaded and his analysis has numbers that seem to indicate that something is up with the GIL. We know is a limitation in multithreaded PyTorch due to any Tensor creation at the Python level needing the GIL and these models typically creating quite a few of them.
It's always easier to know how the performance impact of something is when you have an experiment removing the bits. Maybe the using the JIT or moving things to C++ gives us that, I look forward to seeing a sequel.
The advantage of involving something like TensorRT or TVM is that they'll apply holistic optimizations - they may eliminate writing to memory and reading back (which would not show as underutilized GPU, but can be a big win, see e.g. the LSTM speedup with the PyTorch JIT fuser). The current disadvantage of TVM is that TVM currently is a bit of an all-or-nothing affair, so you can't give it a JITed model and say "optimize the bits that you can do well". TensorRT with TRTorch is a bit ahead there.
Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.
Seems like Xavier NX is more realistic for my needs right now personally though. Of course it's much more expensive etc.
This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.
I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.
Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.
1. How many times is the data being copied, or moved between devices?
2. Are you recomputing data from previous frames that you could just be saving? For example, some tracking algorithms apply the same CNN tower to the last 3-5 images, and you could just save the results from the last frame instead of recomputing. (Of course, you also want to follow hint #1 and keep these results on the GPU).
3. Change the algorithm or network you're using.
Really you should read the original article carefully. The article is showing you the steps for profiling what part of the runtime is slow. Typically, once you profile a little you'll be surprised to find that time is being wasted somewhere unexpected.
Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.
Also, in a tracking pipeline you'll generally have the big compute on pixels done up front. Object detection and ReID take the bulk of the compute and can be easily batched and run in parallel. The results (metadata) can then be fed into a more serial process (but still doing the N<->N ReID comparisons on GPU).
What about using pytorch multiprocessing[1]?
[1] https://pytorch.org/docs/stable/notes/multiprocessing.html
Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.
The ability for Julia to compile directly to PTX assembly[3][4] means that you can even write the GPU kernels in Julia and eliminate the C/C++ CUDA code. Unfortunately, there is still a lot of work to be done to make it as reliably fast and easy as TensorFlow/PyTorch so I don't think it is usable for production yet.
I hope it will be production ready soon but it will likely take some time to highly tune the compute stacks. They are already working on AMD GPU support with AMDGPU.jl[5] and with the latest NVIDIA GPU release which has IMHO purposefully decreased performance (onboard RAM, power) for scientific compute application I would love to be able to develop on my AMD GPU workstation and deploy on whatever infrastructure easily in the same language.
I do have some gripes with Julia but the biggest of them are mostly cosmetic.
[1]: https://fluxml.ai/
[2]: https://github.com/pjreddie/darknet
[3]: https://developer.nvidia.com/blog/gpu-computing-julia-progra...
https://github.com/streamlit/demo-self-driving
It uses StreamLit