Is anyone doing anything like that? I have all of the open source stuff downloaded (models , lollms-webui , promptfoo, etc ) and have been experimenting with the interactive chat stuff . Also txtai to make semantic search.
That all seems pretty mature / progressing nicely . A few more months and I expect a clear reference stack will emerge .
What about the assistant stack ? I invest all these resources to self host and feed in all my data. I want to maximize the ROI.
Seems very realistic!
Here's a bunch of other related methods,
Summarizing context - https://arxiv.org/abs/2305.14239
continuous finetuning - https://arxiv.org/pdf/2307.02839.pdf
retrieval augmented generation - https://arxiv.org/abs/2005.11401
knowledge graphs - https://arxiv.org/abs/2306.08302
augmenting the network a side network - https://arxiv.org/abs/2306.07174
another long term memory technique - https://arxiv.org/abs/2307.02738
There is nothing stopping someone from keeping an LLM in online-training mode forever. We don't do that because it's economically infeasible, not because it wouldn't work.
What works in my experience - structuring the task similar to a human-driven workflow, breaking it down into small steps is needed. Each step could be driven by a small prompt, relevant document fragments (if RAG is used) and condensed essays/tutorials/guides that were written by a powerful LLM (ideally, GPT-4 pre-Turbo).
Using this approach, you could stay well below 8k token limit even on the most demanding tasks.
(Big size contexts are leaky on all LLMs anyway)
I start to feel like a one eye king under blind people.
I even remember sometimes when I told people specific things.
What I would love to get working at some point is allowing an LLM access to my schedule, notes and goals and then have it help prompt me at appropriate times. "Hey, TJ I noticed you haven't worked out this week, it's sunny today this might be a good time". That sort of thing.
There seem to be good tooling around agents, prompt engineering, RAG etc. The 'glue' around getting the LLM to help figure out when appropriate time(s) to check in with me is the bit I am missing, but that's probably mostly down to me being an artist and only a very very JR hobbyist programmer though.
FWIW, it's not really what you're seeking... but there is a plugin that allows you to invoke an LLM from within Obsidian (via Ollama): https://github.com/hinterdupfinger/obsidian-ollama
In short, it allows you to set up prompts to act on selected text directly within a file, e.g. 'Summarize this selection as a markdown formatted list of key points', 'Write a PRD', 'Translate to [Language]', 'Run this as a prompt', etc.
If you work in a Microsoft world, this is what GraphAPI is all about: enabling access to all the things using your personal authentication token. This includes emails, calendar, OneNote, One Drive, and essentially everything. I've been working on making an easy to use OneNote provider with GraphAPI underneath it that I can use to work with the LLM.
I think a cron job run say every hour would be good enough. It would just have to collect all the inputs and make a decision about how "valuable" it would be to check in (haven't worked out in 3 days, very valuable), then it's just about connecting it to the right outputs (email, push noti, ...)
The job itself would be cheap and trivial to host on something like lambda
If your goal is to bake new concepts into the model weights, your only real option is a dataset with that concept being used in a wide variety of contexts.
A more feasible approach I think would be retrial augmented generation. You’d essentially store your conversations in a database and calculate embeddings as you go. This would allow you to later do a natural language search of the database, and insert the most relevant portion of the conversation into your context window.
Sort of how they made chatGPT way more likely to obey requests?
There is a high chance that a plain similarity search (dot product or cosine distance) will bring a lot of noise and junk into the request. And high noise/signal ratio in the context tends to lead to hallucinations.
I've had good experience hiring support folks and working with them on a shared inbox (we use HelpScout).
Cost is a huge problem. An immigration lawyer would continuously eat into my personal income. If I don't get state funding to cover it, it makes zero business sense. It would also come with all the liability of a first employee.
A GPT that cites my website, German law and a dozen official websites would be a game changer. It could not give advice, but it could find answers like Phind does. It's just tech, with lower running costs and virtually no obligations (unlike an employee).
I just don't know if the result would be useful and trustworthy enough, and it's very expensive to try.
My conclusion is that I should focus on building a good knowledge base, and when the time is ripe, I can augment it with fancy tech.
Seriously this is not far from Chat GPT 3.5 in only 6.7GB's and runs on a Macbook Air:
https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF
But yeah current context windows are limiting.
One example was trying to use as a assistant to beat long games, without immediate rewards.
I was trying to log and simultaneously get feedback playing Stardew Valley. gpt-3.5-turbo-1106 basically went along with me and my daughter in a coop session giving nice suggestions, sometimes with huge gaps, but easy enough to ask more about after giving more context.
Mistral 7b and 13B was basically mixing up stardew valley with WoW and Genshin Impact, even giving a lot of context about the day I was, what the npcs answered, or things that I know on how to solve a certain quest. It straight made up non existing towns (stardew valley only has one) etc, etc.
I was running the model on a separate gaming notebook, with nvidia, while playing the game on the one I'm using now.
Mistral hasn't released their own official 13B/30B's yet, but i'm really looking forward to what they can do.
What is crazy is that Ultrafastbert, Speculative, Jacobi, or lookahead decoding could potentially speed up by up to 80x depending on size which could make GPT-4 like models feasible on entry level macs / Phones if similar wizardry is done memory wise.
..Yes im very optimistic after the insane progress over the last months with models like Mistral, Deepseek etc.
I have an extra computer for services like filesharing, samba, nfs, git, firewall, etc, for instance caching the models I'm downloading with a squid proxy, so I can test several UIs downloading the same model over again. Not every UI is offering an easy way to set a single folder to store all gguf files, or it's lacking documentation.
I'm already having a lot of fun. There's people already doing much more than this. I'm more worried about integrating and gluing in a way that will become transparent after the new year, local models or not.
Also how to glue this with obsidian/logseq/neovim/etc in a way that I can use with fewest keystrokes possible, instead just uploading a gigantic context or sensible source code files.
Currently the stack is gravitates around:
- GPT-4 - either to drive the entire workflow OR generate prompts, plans and guidelines for the local models to execute.
- structured knowledge bases (either derived from existing sources OR curated manually by companies to drive AI assistants).
- embedding search indexes, augmented by full-text search. Usually LLM has access to the search engine and can drive the search as needed, refining the queries if results aren't good enough.
All of that is instrumented with logic to capture user feedback at every single step. This is crucial for the continuous improvement of the model!
Bigger model can use this information once in a while, to improve plans and workflow guidelines to make the overall process more efficient.
AMA, if needed!
For now, you can do all the custom/manual training you want, but gpt4 will almost always outperform it with the right context.
Hopefully that will change in the future. Even then, I don't expect people to want to self-host as in on their own machines. More like custom training, then host either on SAAS or PAAS, or their own on-prem if they have it. Spending the performance of a personal laptop isn't worth the reduction of performance on other tasks. Again, maybe that will change.
On one of the projects I used Chat GPT-4 to write instructions/tutorials that were then executed by local model on a large chunk of data (cleaning up product catalogues).
GPT-4 then reviewed some results and fine-tuned the instructions.
Since you're using txtai, this article I just wrote yesterday might be helpful: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
Looks like you've received a lot of great ideas here already though!
An LLM that is specifically trained for Software Development, and to which I feed the code of all my company's repositories. And I keep feeding commits/pull requests.
The idea is that I can query it about architectural issues, code improvements, and other technical aspects at different levels of abstraction (code, architecture, business, etc).
So far, I've played a bit with CodeRabbit and it's "just ok" but it is more of a very small windows to what "could be" than being actually useful.
This works reasonably well with gpt4, but my context is almost always too large for self-hosted models so far.