1. It's undocumented. None of the regular rate limit responses are returned.
2. You're charged for the full generation length. So if the output takes 10 minutes to generate, that's what you'll pay for (despite only getting half back).
3. It defeats the point of the larger context limit models. Why offer a 32K model if it fails after ~6K tokens?
4. The server response doesn't include any error codes or message, it simply terminates unexpectedly. Hit any of the actual rate limits, and you get told about it.
I'd expect to be able to generate output until the model reaches its context limit, or a stop sequence is detected, or I hit an actual documented rate limit.We're paying for these requests in full. We should get the full response back!