Python bindings are in there :) This needs a tensor library callable from python that works on GPUs. One direction we are going towards is PyTorch via ATen / Torch tensors; we already use the C++ parts of ATen. Of course any other CUDA tensor library with minimal alloc/copy/synchronize would work too. Send a PR? ;)