Object Detection from 9 FPS to 650 FPS (opens in new tab)

(paulbridger.com)

165 pointsbriggers5y ago38 comments

38 comments

32 comments · 11 top-level

lostdog5y ago· 9 in thread

This is such a great post. It really shows how much room for improvement there is in all released deep learning code. Almost none of the open source work is really production ready for fast inference, and tuning the systems requires a good working knowledge of the GPU.

The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.

blihp5y ago

I'd alter your conclusion that open source work isn't production ready. As long as it works as described, it is production ready for at least some subset of use cases. There's just a lot of low hanging fruit re: performance improvement.

It's entirely valid to trade-off either a more straight-forward design or minimizing development time for performance and just throw hardware at the problem as needed.... companies do it all of the time.

briggersOP5y ago

Author here. I really appreciate your feedback.

Completely agree that almost none of the SoTA github repos are really ready for production and making this stuff work can be pretty hard.

Getting this done on C++ and moving up to the next level of performance is the focus of my next article :)

gameswithgo5y ago

c++ or .net or rust or go whatever. almost anything can get the performance you want except python.

too bad such great ecosystems evolved around a language that can’t fully utilize the amazing hardware we have today.

threatripper5y ago

> Drop Python and move fully into C++.

Do you have any experience with that?

lostdog5y ago

Yes (though the details are private).

All the deep learning libraries are Python wrappers around C/C++ (which then call into CUDA). If you call the C++ layers directly, you have control over the memory operations applied to your data. The biggest wins come from reducing the number of copies, reducing the number of transfers between CPU and GPU memory, and speeding up operations by moving them from the CPU to the GPU (or vice versa).

This is basically what the article does, but if you want to squeeze out all the performance, the Python layer is still an abstraction that gets in the way of directly choosing what happens to the memory.

2 more replies

pilooch5y ago

Done that since 2015. You can look at https://github.com/jolibrain/deepdetect. C++ doesn't sound ideal to many, but when your target is production, it's pretty powerful, and since c++11 probably much more comfortable than most non practitioners think. For deep learning, it is excellent for bare metal and fitting with industrial applications. Never looked back. For R&D (gans, flows, RL, ...) Python remains easier to play with.

mzakharo15y ago

Funny how blaming GIL for being a bottleneck is the least researched/not backed by performance measurement (before/after) part of the article. Everyone loves to hate GIL. maybe there should be T-shirts made for this for the C++ loving folks out there.

briggersOP5y ago

I looked into the GIL saturation as measured with gil_load (https://github.com/chrisjbillington/gil_load), but perhaps I should have included more numbers here.

To me, seeing the GIL held for 40% of time and significant time spent waiting on GIL by other threads was a fairly strong indicator. Keen to hear your thoughts/experience on it.

whimsicalism5y ago

All I see as the main insight of this article is that you shouldn't use pytorch hub as a baseline for inference speed.

I know a number of python frameworks (ie. detectron) that are fast.

I'd like to see the evidence that the performance bottleneck is python, esp. when asynchronous dispatch exists.

NikolaeVarius5y ago· 3 in thread

I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.

I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.

briggersOP5y ago

Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.

Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.

t-vi5y ago

Quite often, people seem to overestimate the performance overhead Python brings (one can take the PyTorch C++ extension example (LLTM) and create a 1-1 LibTorch implementation to see a ~10% speedup or so).

But Paul's situation is multithreaded and his analysis has numbers that seem to indicate that something is up with the GIL. We know is a limitation in multithreaded PyTorch due to any Tensor creation at the Python level needing the GIL and these models typically creating quite a few of them.

It's always easier to know how the performance impact of something is when you have an experiment removing the bits. Maybe the using the JIT or moving things to C++ gives us that, I look forward to seeing a sequel.

The advantage of involving something like TensorRT or TVM is that they'll apply holistic optimizations - they may eliminate writing to memory and reading back (which would not show as underutilized GPU, but can be a big win, see e.g. the LSTM speedup with the PyTorch JIT fuser). The current disadvantage of TVM is that TVM currently is a bit of an all-or-nothing affair, so you can't give it a JITed model and say "optimize the bits that you can do well". TensorRT with TRTorch is a bit ahead there.

Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.

ilaksh5y ago

I wonder if the tool used in the article can be applied?

Seems like Xavier NX is more realistic for my needs right now personally though. Of course it's much more expensive etc.

O5vYtytb5y ago· 3 in thread

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.

What about using pytorch multiprocessing[1]?

[1] https://pytorch.org/docs/stable/notes/multiprocessing.html

amelius5y ago

I can't attest to the usefulness of pytorch's multiprocessing module, but using python's multiprocessing module feels like low-level programming (serializing, packing and unpacking data-structures, etc. where you'd hope the environment would handle it for you).

modeless5y ago

I found python multiprocessing to work well to parallelize deep learning data loading and preprocessing, because all I needed to communicate was a couple of tensors which are easy to allocate in shared memory. I didn't need complex data structures or synchronization.

threatripper5y ago

Processing separate video streams works well with separate processes. There is some cost related to starting the other processes and sometimes libraries may stumble (e.g. several instances of ML libraries allocating all the GPU memory) but once it's running it's literally two separate processes that can do their work independently.

Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.

t-vi5y ago· 2 in thread

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.

At least for the PyTorch bits of it, using the PyTorch JIT works well. When you run PyTorch code through Python, the intermediate results will be created as Python objects (with GIL and all) while when you run it in TorchScript, the intermediates will only be in C++ PyTorch Tensors, all without the GIL. We have a small comment about it in our PyTorch book in the section on what improvements to expect from the PyTorch JIT and it seems rather relevant in practice.

g_airborne5y ago

The JIT is hands down the best feature of PyTorch. Especially compared to the somewhat neglected suite of native inference tools for TensorFlow. Just recently I was trying to get a TensorFlow 2 model to work nicely in C++. Basically, the external API for TensorFlow is the C API, but it does not have proper support for `SavedModel` yet. Linking to the C++ library is a pain, and both of them cannot do eager execution at all if you have a model trained in Python code :(

PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)

t-vi5y ago

The underappreciated (in my view/experience) part is that it also gets rid of a lot of GIL when used from Python because the part inside the JITed doesn't use Python anymore.

When you have multithreaded setups, this typically is more significant than the Python overhead itself (which comes in at 10% for the PyTorch C++ extension LLTM example, but would be less for convnets).

threatripper5y ago· 2 in thread

How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.

Are there some CNN-libraries that have way less overhead for small batch sizes? Tensorflow (GPU accelerated) seems to go down from 10000 fps on large batches to 200 fps for single frames for a small CNN.

lostdog5y ago

It depends on the algorithm you're using, but here are some places to start:

1. How many times is the data being copied, or moved between devices?

2. Are you recomputing data from previous frames that you could just be saving? For example, some tracking algorithms apply the same CNN tower to the last 3-5 images, and you could just save the results from the last frame instead of recomputing. (Of course, you also want to follow hint #1 and keep these results on the GPU).

3. Change the algorithm or network you're using.

Really you should read the original article carefully. The article is showing you the steps for profiling what part of the runtime is slow. Typically, once you profile a little you'll be surprised to find that time is being wasted somewhere unexpected.

briggersOP5y ago

Great point - dependencies between frames are inherently problematic for many of these techniques.

Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.

Also, in a tracking pipeline you'll generally have the big compute on pixels done up front. Object detection and ReID take the bulk of the compute and can be easily batched and run in parallel. The results (metadata) can then be fed into a more serial process (but still doing the N<->N ReID comparisons on GPU).

andrewbridger5y ago· 1 in thread

Has anyone looked at Julia? It’s claim is C like performance with the ease of use of a language like python.

Datenstrom5y ago

Yes, I have done non-trivial implementations of a number of SoTA models in Julia. The framework I've used is Flux[1] which I love for it's simplicity, it is very much like the DarkNet[2] framework in that regard which is refreshing after using TensorFlow. PyTorch is much better about not having unnecessary complexity and a sensible API but Flux is certainly better.

The ability for Julia to compile directly to PTX assembly[3][4] means that you can even write the GPU kernels in Julia and eliminate the C/C++ CUDA code. Unfortunately, there is still a lot of work to be done to make it as reliably fast and easy as TensorFlow/PyTorch so I don't think it is usable for production yet.

I hope it will be production ready soon but it will likely take some time to highly tune the compute stacks. They are already working on AMD GPU support with AMDGPU.jl[5] and with the latest NVIDIA GPU release which has IMHO purposefully decreased performance (onboard RAM, power) for scientific compute application I would love to be able to develop on my AMD GPU workstation and deploy on whatever infrastructure easily in the same language.

I do have some gripes with Julia but the biggest of them are mostly cosmetic.

[1]: https://fluxml.ai/

[2]: https://github.com/pjreddie/darknet

[3]: https://developer.nvidia.com/blog/gpu-computing-julia-progra...

[4]: http://blog.maleadt.net/2015/01/15/julia-cuda/

[5]: https://github.com/JuliaGPU/AMDGPU.jl

egberts15y ago· 1 in thread

Try this one.

https://github.com/streamlit/demo-self-driving

It uses StreamLit

https://github.com/streamlit/streamlit

minimaxir5y ago

Streamlit is a UI framework, not a ML pipeline/performance framework.

nraynaud5y ago

How do you keep track of the shutter clock in this kind of system? For example the camera clocks at 60fps, but the image processing is a few frames late, the gyroscope clocks at 4kHz, the accelerometer way slower, lidar is a slug, etc. Then you have to get all that stuff in your kalman filter to estimate the state and the central question is: “when did you collect this data?” I guess “no clue it comes from USB then disappeared into a GPU pipeline” is not a scientifically sound answer, you want to know if it goes before or after sample no 3864 of the gyroscope.

Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?

vj445y ago

Good job digging into all of this Paul! At my company (onspecta.com) we solve similar problems (and more!) to accelerate AI/deep learning/computer vision problems, across both CPUs, GPUs as well as other types of chips.

This is a fascinating space, and there are tons of speed up opportunities. Depending on the type of the workload you're running, you might be able to ditch the GPU entirely and run everything just on the CPU, greatly reducing cost & deployment complexity. Or, at the very least, improve SLAs and 10x decrease the GPU (or CPU) cost.

I've seen this over and over again. Glad someone's documenting this publicly :-) If any one of you readers have more questions about this I'm happy to discuss in the comments here. Or you can reach out to me at victor at onspecta dot com.

spockz5y ago

I think this is a great explanation. Are this kind of manual optimisations still needed when using the higher level frameworks? Or at least those should make it clear in the types when a pipeline moves from cpu to gpu and vice versa.

mleonhard5y ago

Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.

j / k navigate · click thread line to collapse

38 comments

32 comments · 11 top-level

lostdog5y ago· 9 in thread

The article does skip the most important step for getting great inference speeds: Drop Python and move fully into C++.

blihp5y ago

briggersOP5y ago

Author here. I really appreciate your feedback.

Completely agree that almost none of the SoTA github repos are really ready for production and making this stuff work can be pretty hard.

Getting this done on C++ and moving up to the next level of performance is the focus of my next article :)

gameswithgo5y ago

c++ or .net or rust or go whatever. almost anything can get the performance you want except python.

too bad such great ecosystems evolved around a language that can’t fully utilize the amazing hardware we have today.

threatripper5y ago

> Drop Python and move fully into C++.

Do you have any experience with that?

lostdog5y ago

Yes (though the details are private).

2 more replies

pilooch5y ago

mzakharo15y ago

briggersOP5y ago

I looked into the GIL saturation as measured with gil_load (https://github.com/chrisjbillington/gil_load), but perhaps I should have included more numbers here.

To me, seeing the GIL held for 40% of time and significant time spent waiting on GIL by other threads was a fairly strong indicator. Keen to hear your thoughts/experience on it.

whimsicalism5y ago

All I see as the main insight of this article is that you shouldn't use pytorch hub as a baseline for inference speed.

I know a number of python frameworks (ie. detectron) that are fast.

I'd like to see the evidence that the performance bottleneck is python, esp. when asynchronous dispatch exists.

NikolaeVarius5y ago· 3 in thread

I've been trying to coax better performance out of a Jetson nano camera, currently using Python's Open CV lib, with some threading, and can only manage at best about 29fps.

I would love an alternative that is reasonably simple to implement. I dislike having to handle raw bits.

briggersOP5y ago

Author here. As other commenters are saying, the Pytorch JIT and torchscript might be your friend here.

Alternatively, there are some quite fast OSS libraries for object detection. Nvidia's retinanet will export to a TensorRT engine which can be used with DeepStream.

t-vi5y ago

Of course, PyTorch itself is getting better too, with the new profiling executor and new fusers for the PyTorch JIT, so we might hope that you can have good perf for more workloads with just PyTorch.

ilaksh5y ago

I wonder if the tool used in the article can be applied?

Seems like Xavier NX is more realistic for my needs right now personally though. Of course it's much more expensive etc.

O5vYtytb5y ago· 3 in thread

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.

What about using pytorch multiprocessing[1]?

[1] https://pytorch.org/docs/stable/notes/multiprocessing.html

amelius5y ago

modeless5y ago

threatripper5y ago

Multiprocessing could be a pain if you need to pass frames of a single video stream. Traditionally you'd need to pickle/unpickle them to pass them between processes.

t-vi5y ago· 2 in thread

> The solution to Python’s GIL bottleneck is not some trick, it is to stop using Python for data-path code.

g_airborne5y ago

PyTorch will happily let you export your model, even with Python code in it, and run it in C++ :)

t-vi5y ago

The underappreciated (in my view/experience) part is that it also gets rid of a lot of GIL when used from Python because the part inside the JITed doesn't use Python anymore.

threatripper5y ago· 2 in thread

How would one accelerate object tracking on a video stream where each frame depends on the result of the previous one? Batching and multi-threading doesn't work here.

lostdog5y ago

It depends on the algorithm you're using, but here are some places to start:

1. How many times is the data being copied, or moved between devices?

3. Change the algorithm or network you're using.

briggersOP5y ago

Great point - dependencies between frames are inherently problematic for many of these techniques.

Everything lostdog says. I've had experience speeding up tracking immensely using the same big hammer I talk about in the article - moving the larger parts of tracking compute to GPU.

andrewbridger5y ago· 1 in thread

Has anyone looked at Julia? It’s claim is C like performance with the ease of use of a language like python.

Datenstrom5y ago

I do have some gripes with Julia but the biggest of them are mostly cosmetic.

[1]: https://fluxml.ai/

[2]: https://github.com/pjreddie/darknet

[3]: https://developer.nvidia.com/blog/gpu-computing-julia-progra...

[4]: http://blog.maleadt.net/2015/01/15/julia-cuda/

[5]: https://github.com/JuliaGPU/AMDGPU.jl

egberts15y ago· 1 in thread

Try this one.

https://github.com/streamlit/demo-self-driving

It uses StreamLit

https://github.com/streamlit/streamlit

minimaxir5y ago

Streamlit is a UI framework, not a ML pipeline/performance framework.

nraynaud5y ago

Long story short, that’s good, you’ve used a neural net to avoid using a human or an animal as a pose estimation datum, how do you correlate that to the rest of the sensor suite?

vj445y ago

spockz5y ago

mleonhard5y ago

Has any company tried putting the GPU and CPU in the same chip, sharing the same data caches? That could greatly increase the performance of the CPU-GPU data transfers.

j / k navigate · click thread line to collapse