undefined | Better HN

0 pointsflowingfocus1y ago0 comments

The explanation is in the article. Tldr is for sync functions, the CPU is blocked, with async functions, once the await statement is reached, other stuff can be handled in between

0 comments

fulafel1y ago

Indeed it says "It enhances performance in areas where tasks are waiting for IO to complete by allowing the CPU to handle other tasks in the meantime".

To restate my comment: I argued (1) this CPU cost would be very marginal compared to the LLM API compute cost, and (2) the CPU blocking claim doesn't really hold, due to the wonders of threads and processes.

simonw1y ago

There are two ways to call out to an externally hosted LLM via an HTTP API:

1. A blocking call, which can take 3-10 seconds.

2. A streaming call, which can also take 3-10 seconds but where the content is streaming directly to you as it is generated (and you may be proxying it through to your user).

In both cases you risk blocking a thread or process for several seconds. That's the problem asyncio solves for you - it means you could have hundreds (or even thousands) of users all waiting for the response to that LLM call without needing hundreds or thousands of blocked threads/processes.

fulafel1y ago

Reading this my first reaction was that my questions still holds. Unless the slowness of the LLMs is of such a magnitude that having a thread or process waiting on the API call would substantially increase its cost, which I guess would mean the LLM server endpoint would be doing very heavy queuing and/or multitasking instead utilizing a powerful compute element for 90% of the call duration.

Or maybe the disconnect is where I'm taking for granted that "cost of parked thread" is the same as worrying about the nr of parked threads? Maybe everyone uses Django setups where it's nontrivial to add memory, increase serverless platform limits, etc if you get the happy problem of 10k concurrent users? Or maybe people don't know that the nr of threads/processes you can have on Linux is much more than hundreds or thousands. Or maybe there's some Python or Django specific limits to this.

1 more reply

j / k navigate · click thread line to collapse

0 comments

fulafel1y ago

Indeed it says "It enhances performance in areas where tasks are waiting for IO to complete by allowing the CPU to handle other tasks in the meantime".

simonw1y ago

There are two ways to call out to an externally hosted LLM via an HTTP API:

1. A blocking call, which can take 3-10 seconds.

2. A streaming call, which can also take 3-10 seconds but where the content is streaming directly to you as it is generated (and you may be proxying it through to your user).

fulafel1y ago

1 more reply

j / k navigate · click thread line to collapse