We have very different ideas about the meaning of self-hosted.
For example - if a "self hosted" service supports off-site backups is it self hosted or just well designed?
There is a big difference between communicating with external services (your example) vs REQUIRING external services (what parent is complaining about).
If in your example the system can run correctly with just local backups I would consider it self-hosted.
I’ve probably missed a huge wave of programming technology because of this, and I’ve figured out a way to make it work for a consistent paycheck over these past 20 years.
I’m also not a great example, I think I’ve watched 7 whole hours of YouTube videos ever, and those were all for car repair help.
I shy away from tech that needs to be online/connected/whatever.
The difference is this feature explicitly isn't designed to do a whole lot, which is still the best way to build most LLM-based products and sandwich it between non-LLM stuff.
To give a real world example, the way Claude Code works versus how Cursor's embedded database works.
This, combined with a subsequent reranker, basically eliminated any of our issues on search.
One thing I’m always curious about is if you could simplify this and get good/better results using SPLADE. The v3 models look really good and seem to provide a good balance of semantic and lexical retrieval.
Disclosure: I work at MS and help maintain our most popular open-source RAG template, so I follow the best practices closely: https://github.com/Azure-Samples/azure-search-openai-demo/
So few developers realize that you need more than just vector search, so I still spend many of my talks emphasizing the FULL retrieval stack for RAG. It's also possible to do it on top of other DBs like Postgres, but takes more effort.
Once Bedrock KB backed by S3 Vectors is released from Beta it'll eat everybody's lunch.
I'm correcting you less out of pedantry, and more because I find the correct term to be funny.
SOTA for what? Isn't it just a vector store?
Chunking strategy is a big issue. I found acceptable results by shoving large texts to to gemini flash and have it summarize and extract chunks instead of whatever text splitter I tried. I use the method published by Anthropic https://www.anthropic.com/engineering/contextual-retrieval i.e. include full summary along with chunks for each embedding.
I also created a tool to enable the LLM to do vector search on its own .
I do not use Langchain or python.. I use Clojure+ LLMs' REST APIs.
I've struggled to find a target market though. Would you mind sharing what your use case is? It would really help give me some direction.
Not sensitive to latency at all. My users would rather have well researched answers than poor answers.
Also, I use batch mode APIs for chunking .. it is so much cheaper.
- Classic RAG: `User -> Search -> LLM -> User`
- Agentic RAG: `User <-> LLM <-> Search`
Essentially instead of having a fixed loop, you provide the search as a tool to the LLM, which does three things:
- The LLM can search multiple times
- The LLM can adjust the search query
- The LLM can use multiple tools
The combination of these three things has solved a majority of classic RAG problems. It improves user queries, it can map abbreviations, it can correct bad results on its own, you can also let it list directories and load files directly.
- Depends on your use case to let the model understand when and when not to use tools - gpt-5 s VERY persistent and often searches more than 10 times in a single run depending on the results.
We're using pydantic AI where the entire Agent loop is taken care of by the framework. Highly recommend.
What does query generation mean in this context, it’s probably not SQL queries right?
One of the key features in Claude Code is "Agentic Search" aka using (rip)grep/ls to search a codebase without any of the overhead of RAG.
Sounds like even RAG approaches use a similar approach (Query Generation).
The big LLM-based rerankers (e.g. Qwen3-reranker) are what you always wanted your cross-encoder to be, and I highly recommend giving them a try. Unfortunately they're also quite computationally expensive.
Your metadata/tabular data often contains basic facts that a human takes for granted, but which aren't repeated in every text chunk - injecting it can help a lot in making the end model seem less clueless.
The point about queries that don't work with simple RAG (like "summarize the most recent twenty documents") is very important to keep in mind. We made our UI very search-oriented and deemphasized the chat, to try to communicate to users that search is what's happening under the hood - the model only sees what you see.
What is re-ranking in the context of RAG? Why not just show the code if it’s only 5 lines?
Here's sample code: https://docs.cohere.com/reference/rerank
That is, there is nothing here that one could not easily write without a library.
Ingestion + Agentic Search are two areas that we're focused on in the short term.
The only place I see that actually operates on chunks does so by fetching them from Redis, and AFAICT nothing in the repo actually writes to Redis, so I assume the chunker is elsewhere.
https://github.com/agentset-ai/agentset/blob/main/packages/j...
whats this roundtrip? also the chronology of the LLM (4.1) doesnt match the rest of the stack (text-embedding-large-3), feels weird
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Again, these were only observed in RAG when you pass lots of chunks, GPT-5 is probably a better model for other taks.
https://jakobs.dev/learnings-ingesting-millions-pages-rag-az...
Title: ... Author: ... Text: ...
for each chunk, instead of just passing the text
Could you share more about chunking strategies you used?
Anyone here successfully transitioned into legal space? My gut always been legal to the space where LLM can really be useful, the first one is in programming.
* Upload documents via API into a Google Workspace folder * Use some sort of Google AI search API on those documents in that folder
…placing documents for different customers into different folders.
Or the Azure equivalent whatever that is.
Query expansions and non-naive chunking give the biggest bang for the bug, with chunking being the most resource intensive task, if the input data is chunk (pun intended).