There's also a blog post that has more detail: https://eng.uber.com/introducing-neuropod/
Super excited to open-source it!
We actually do use TensorRT with several of our models, but our approach is generally to do all TRT related processing before the Neuropod export step. For example, we might do something like
TF model -> TF-TRT optimization -> Neuropod export
or PyTorch model
-> (convert subset of model to a torchscript engine)
-> PyTorch model + custom op to run TRT engine
-> TorchScript model + custom op to run TRT engine
-> Neuropod export
Since Neuropod wraps the underlying model (including custom ops), this approach works well for us.The blog post I linked above goes into more detail, but here's a relevant quote about usage within Uber:
> Neuropod has been instrumental in quickly deploying new models at Uber since its internal release in early 2019. Over the last year, we have deployed hundreds of Neuropod models across Uber ATG, Uber AI, and the core Uber business. These include models for demand forecasting, estimated time of arrival (ETA) prediction for rides, menu transcription for Uber Eats, and object detection models for self-driving vehicles.
- I no longer work at $company, and their stuff sucks
- ergo, they fired me, or I left on bad terms
- I clearly didn't get on well with my coworkers, as I'm happy to shit on their work from across the pond
- ergo, I have some deep attitude problem I'm likely to bring to my next placement
- Neuropod is an abstraction layer so it can do useful things on top of just running models locally. For example, we can transparently proxy model execution to remote machines. This can be super useful for running large scale jobs with compute intensive models. Including GPUs in all our cluster machines doesn’t make sense from a resource efficiency perspective so instead, if we proxy model execution to a smaller cluster of GPU-enabled servers, we can get higher GPU utilization while using fewer GPUs. The "Model serving" section of the blog post ([1]) goes into more detail on this. We can also do interesting things with model isolation (see the "Out-of-process execution" section of the post).
- ONNX converts models while Neuropod wraps them. We use TensorFlow, TorchScript, etc. under the hood to run a model. This is important because we have several models that use custom ops, TensorRT, etc. We can use the same custom ops that we use at training time during inference. One of the goals of Neuropod is to make experimentation, deployment, and iteration easier so not having to do additional "conversion" work is useful.
- When we started building Neuropod, ONNX could only do trace-based conversions of PyTorch models. We've generally had lots of trouble with correctness of trace-based conversions for non-trivial models (even with TorchScript). Removing intermediate conversion steps (and their corresponding verification steps) can save a lot of time and make the experimentation process more efficient.
- Being able to define a "problem" interface was important to us (e.g. "this is the interface of a model that does 2d object detection"). This lets us have multiple implementations that we can easily swap out because we concretely defined an interface. This capability is useful for comparing models across frameworks without doing a lot of work. The blog post ([1]) talks about this in more detail.
The blog post ([1]) goes into a lot more detail about our motivations and use cases so it's worth a read.
Thanks for this, btw!
Have you had a chance to try running your models on baremetal devices such as ARM cortex M4?
Is there a list of OPs that are supported or crucially, unsupported?