Do you have experience how single frame processing compares between Python and C++? I see that batched processing in Python gives me a huge speed boost which hints at inefficiencies at some point but I don't know if those are related to Python, Tensorflow or CUDA itself. (Or just bad resource management that requires re-initalization of some costly things in between evaluations.)