llama 3.2 3b, qwen2.5 3B quantized to 4bit runs CPU inference quite fast. You can get a beefier VM and still save a ton of money. Depending on the context token length of this soluion, it's either fast or slow. If it's below 1024 tokens per request, you get around 10 sec delay, if you are at around 128 tokens I guess you would be somewhere at 1 sec for time to first token...