Yes, it uses ONNX Runtime / MLAS (their native BLAS lib) under the hood. And yes, there is copy overhead but you can eliminate it internally to a single function/graph by compiling it down to a single ONNX model. The end result is within ~15% run time of PyTorch w/ MKL when training a reasonably-sized MLP. And ORT also provides support for CUDA and a number of other "execution providers".