I like what you are doing! I suggest you create a proxy service that will route the traffic to the AI provider.
You can cache questions and answers heavily and use a B25 search with a vector embedded to retrieve the best results for you.
RAG pipeline
├── BM25 + TF-IDF + RRF retrieval
├── cross-encoder reranking
├── knowledge-graph entity linking
└── multi-angle intent detection
│
▼
LLM synthesis (Claude / local models)