I haven't seen a lot of documentation or examples covering this. In most LLM enabled apps I've used, if tokens are currently streaming and the page refreshes/changes, the stream gets interrupted.
One idea I had was writing the streamed tokens into some sort of queue or kafka topic, then connecting my UI to the queue and streaming tokens from there instead. But that seems like a lot of work.
How are most folks doing this?
What's the best way to maintain the SSE connection through a page refresh?
I haven't seen a lot of documentation or examples covering this. In most LLM enabled apps I've used, if tokens are currently streaming and the page refreshes/changes, the stream gets interrupted.
One idea I had was writing the streamed tokens into some sort of queue or kafka topic, then connecting my UI to the queue and streaming tokens from there instead. But that seems like a lot of work.
How are most folks doing this?
I'm curious how others have thought about handling embeddings for multiple file types (txt, pdf, image, docx, ppt, etc.)? Obviously, I could handle each file type individually and then build a flexible search layer on top, but I'm concerned about the level of maintenance required.
One idea I had was to build a translation layer of sorts that would take some arbitrary file type in, map it onto a standardized text schema, and embed that. For images (which are much less common in my dataset), I would use an LLM to describe the image and cast that text into my standard format. The standard format would allow me to simplify the chunking and embedding logic for each file type, and make the vector search layer a lot easier to maintain.
I know this won't be perfect, but I think it could solve most of what I'm trying to achieve.
---
Curious what others think about this and what you have tried.
Cheers,
spruce_tips
While developing new features/testing locally, the LLM flow frequently runs, and I use a bunch of tokens. My openAI bill spikes.
I've made some efforts to stub LLM responses but it adds a decent bit of complexity and work. I don't want to run a model locally with ollama because I need to output to be high quality and fast.
Curious how others are handling similar situations.