In any case, yeah you can not download the model to the device at all, but then you have to deal with the other angle - making sure the endpoint isn't abused.
Maybe a hybrid approach would work - infer just part of the model (layers?) on the cloud, and then carry on the inference on the device? I'm not familiar with how AI models look like and work like exactly, but I feel like hiding even a tiny portion of the model would make it not usable in practice