I was about to release an app based on the new Assistant API but just a day before the release the response times increased to 8s flat. When I have function calls, that meant up to a minute to get a response.
I had to dismantle everything Assistant API and implement it with Chat API. Which turned out to be great because in Assistant API the context management was very bad and after a few back and forth messages the cost ballooned to over 10K tokens per message.
When I looked closely at the Assistant API and Chat API, I noticed that Assistant API is just a wrapper over Chat API and acts as a web service that stores the previous messages(so slow response problem was probably due to the web server which keeps track of the context). So I went ahead and implemented my own Assistant API which has more control. For example, I set max token cost per message and if the context balloons over that, I make a request with the context and ask OpenAI to create a summary with all the facts so far, add that summary as a system prompt and my context gets compressed back into reasonable territory.
The website could as well buffer the incoming stream until the used clicks an area to request the display of the next block of the response, once he has finished reading the initial sentences.
LLM streaming must be a cost saving feature to prevent you from overloading the servers by asking to many questions with in a short time frame. Annoying feature IMHO
I am curious if the Assistants API lets you edit/remove/retry messages yet. I don't see anything implying this has changed. It's annoying that the Assistants API doesn't give you enough control to support basic things that the ChatGPT app does.
I get what you're asking for though. It would be nice if this was easier. But that would require OpenAI changing their API model to one where conversation history is stored on their server. It would be more of a "ChatGPT conversation API" then just an GPT-4/3.5 API.
There is an API to modify messages, though I am not sure of its constraints.
2012 JavaScript called, it wants its callbacks wrapped in objects back. Why do we have a context manager named "stream" for which you call `.until_done()`? This could've been an iterator, or better - an asynchronous iterator, since this is streaming over the network. We could be destructing instances of named tuples with pattern matching, or even just doing `"".join(delta.text for delta in prompt (...)`. But no here subclass this instead, tells me the wrapper around a web API.
The `stream` context manager actually does expose an async iterator (in the async client), so you could instead do this for the simple case:
with client.beta.threads.runs.create_and_stream(…) as stream:
async for text in stream.text_deltas:
print(text, end="", flush=True)
which I think is roughly what you want.Perhaps the docs should be updated to highlight this simple case earlier.
We are also considering expanding this design, and perhaps replacing the callbacks, like so:
with client.beta.threads.runs.create_and_stream(…) as stream:
async for event in stream.all_events:
if event.type == 'text_delta':
print(event.delta.value, end='')
elif event.type == 'run_step_delta':
event.snapshot.id
event.delta.step_details...
which I think is also more in line with what you expect. (you could also `match event: case TextDelta: …`).Note that the context manager is required because otherwise there's no way to tell if you `break` out of the loop (or otherwise stop listening to the stream) which means we can't close the request (and you both keep burning tokens and leak resources in your app).
And yet the AI is so good I put up with them everyday
If they ever grow into a proper product org they'll be unstoppable.
You can reply here or email me at atty@openai.com.
(Please don't hold back; we would love to hear the pain points so we can fix them.)
Use Claude in Safari and the browser completely locks up after a single response.
The tools are great because they don't invent their own DSL, they "just" use JSON schemas.
Maybe they ought to contribute changes to OpenAPI to support streaming APIs better.
In contrast so many startups make their own annotation-driven DSLs for Python with their branding slapped over everything. It gives desperate-for-lock-in vibes. The last people OpenAI should be taking advice from for their API design is this forum.
What I'm arguing is precisely that the abstractions in the library (such as the `AssistantEventHandler` shown in the article) are ineffective in making things simpler. They force you to over-engineer solutions and distribute state unnecessarily and be aware of that specific class interface when it could've just been something you use in a `for x in y` loop like everyone would know to do without spending an afternoon looking over docs and figuring out how the underlying implicit FSM works.
Horrendous in non english languages though, the accents are extremely American
I tried with Windows Subsystem for Android but the app refused to work.
- is it counted for a single user message or the sum of all previous messages?
- if there's a file, will it be counted every time a user interacts or only the first time?
- it is correlated to the sum, every new interaction adds the whole history again
- yes, but you probably pay for the retrieved fragments, not the whole file
I'd really like it if the streaming versions of their APIs could return a token usage count at the end.
The non-streaming APIs do this right now:
curl https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" -d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "A short fun fact about pigeons"
}
]
}'
Returns: {
"id": "chatcmpl-92UiIWQaf442wq7Eyp7kF8ge0e3fE",
"object": "chat.completion",
"created": 1710381746,
"model": "gpt-3.5-turbo-0125",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Pigeons are one of the few bird species that can drink water by sucking it up through their beaks, rather than tilting their heads back to swallow."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14,
"completion_tokens": 33,
"total_tokens": 47
},
"system_fingerprint": "fp_4f0b692a78"
}
Note the "usage" block there telling me how many tokens were used (which tells me how much this cost).But if I add "stream": true I get back an SSE stream that looks like this:
...
data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{"content":"."},"logprobs":null,"finish_reason":null}]}
data: {"id":"chatcmpl-92Uk81oNjrcUJQnPX8fSNqFINLfSI","object":"chat.completion.chunk","created":1710381860,"model":"gpt-3.5-turbo-0125","system_fingerprint":"fp_4f0b692a78","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}]}
data: [DONE]
There's no "usage" block, which means I have to try and account for the tokens myself. This is really inconvenient!I noticed the other day that the Claude streaming API returns a "usage" block with the last message. I'd love it if OpenAI's API did the same thing.
I need this right now because I'm starting to build features for end users of my own software, and I want to be able to give them X,000 tokens "free" before starting to charge them for extras. Counting those tokens myself (probably using tiktoken) is code I'd rather not have to write - especially since features like tools/functions or images make counting tokens a lot less obvious.
and then on each run, you have the option to add more guidance to the run explicitly, without modifying the assistant instructions (system prompt)
It's a little bit different but kind of the same
Also, the system prompt in assistants doesn't consume tokens?
Am I just projecting? Relatable, in any case :)
I literally want to give them my money and can't. Every few weeks for shirts and giggles i send an email to them saying, "any update on this?"
Oh well..
I really hope this will fade and focus will turn back to highlighting some broader actual human ingenuity in IT, rather than constant stream of "we used autocomplete for this new thing" or "we build this new API for this glorified autocomplete".
Boring.
Seriously though, it's not going away no matter how much anyone hates it. Emails and blogs will continue to be written with it, letters of recommendation will be/are written with it, Presidential speeches will be written with it, academic articles will be / are written with it (almost all ml and cs research is), news is written with it... It's not going to stop, but it will _probably_/_very likely_ get better.
There is no tool, no human, no method to determine if text is generated with one of these models at high F-score (only sometimes high precision, low recall domains for silly examples).
We're stuck with it. Like the English teacher and their despised spell check.
My point is about repetitiveness of LLM topics. Not about usefullness of LLM itself. And LLMs are glorified autocomplete. Their internals are maybe interesting, but that's often not what's being discussed here or even written about in the shared articles.
Just because that makes for a nice narrative in the copyright infringement argument, doesn't make it so.
We know next to nothing about how the human brain works.