This seems extremely inefficient considering data transfer between model layers if the model is distributed. I found this project called Petals that claim up to 4 tok/s for a 180B model although its repository hasn't been updated in two years.
https://petals.dev/
For prompt processing it would work though, and it could for diffusion LLMs as well.