However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):
1. https://www.honeycomb.io/blog/introducing-query-assistant
2. https://www.youtube.com/watch?v=B_DMMlDuJB0
Other co-authors have worked on significant bodies of work:
Bryan Bischoff lead the creation of Magic in Hex: https://www.latent.space/p/bryan-bischof
Jason Liu created the most popular OSS libraries for structured data called instructor https://github.com/jxnl/instructor, and works with some of the leading companies in the space like Limitless and Raycast (https://jxnl.co/services/#current-and-past-clients)
Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)
I believe you might find these worth looking at.
I often see these messages from the community doubting the reality, but LLMs are a powerful tool in the tool chest. But I think most companies are not staffed with skilled enough engineers with a creative enough bent to really take advantage of them yet or be willing to fund basic research and from first principles toolchain creation. That’s ok. But it’s foolish to assume this is all hype like crypto was. The parallels are obvious but the foundations are different.
But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.
There are very strong parallels to crypto in that (a) people are starting with the technology and trying to find problems and (b) there is a cult like atmosphere where non-believers are seen as being anti-progress and anti-technology.
That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.
- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually
- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule
- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.
- Push the result to webhook
Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.
Another is preventing LLMs from adding intro or conclusion text.
(Plug) I shipped a dedicated OpenAI-compatible API for this, jsonmode.com a couple weeks ago and just integrated Groq (they were nice enough to bump up the rate limits) so it's crazy fast. It's a WIP but so far very comparable to JSON output from frontier models, with some bonus features (web crawling etc).
You can check it out over at https://github.com/BoundaryML/baml. Would love to talk if this is something that seems interesting!
How are you struggling with this, let alone as a significant barrier? JSON adherence with a well thought out schema hasn't been a worry between improved model performance and various grammar based constraint systems in a while.
> Another is preventing LLMs from adding intro or conclusion text.
Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.
Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment
Might be worth checking out.
This is really interesting, is there any architecture documentation/articles that you can recommend?
-Regex expressions: ChatGPT is the best multi-million regex parser to date.
-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.
-Artwork inspiration: Not only for visual inspiration, in the case of image generators, but descriptive as well. The verbosity of some LLMs can help describe things in more detail than a person would.
-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.
The problem I see is, who can an "application" be anything but a little window onto the base abilities of ChatGPT and so effectively offers nothing more to an end-user. The final result still have to be checked and regular end-users have to do their own prompt.
Edit: Also, I should also say that anyone who's designing LLM apps that, rather than being end-user tools, are effectively gate keepers to getting action or "a human" from a company deserves a big "f* you" 'cause that approach is evil.
For example, we focused on the boring and hard task of web data extraction.
Traditional web scraping is labor-intensive, error-prone, and requires constant updates to handle website changes. It's repetitive and tedious, but couldn't be automated due to the high data diversity and many edge cases. This required a combination of rule-based tools, developers, and constant maintenance.
We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs
- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3
- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"
- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc
We do use tuned smaller dumber models, such as part of a coarse relevancy filter in a firehose pipeline... but these are outliers. Likewise, we expect to be using them more... but again, for rarer cases and only after we've exhausted other stuff. I'm guessing as we do more fine-tuning, it'll be more on embeddings than LLMs, at least until OSS models get a lot better.
The article has a section called "When to finetune", along with links to separate pages describing how to do so. They absolutely don't say that "fine-tuning isn't even a consideration". Instead, they describe the situations in which fine-tuning is likely to be helpful.
I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.
As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...
I went back while writing this comment and realized it might be showing me a diff (better use of color would have helped, I have been trained by github). But I was at a loss for what to do with that. I just now figured out the Keep button exists and it accepted the diff and now it sort of makes sense, but the SQL still doesn't return any results.
My honest feedback is that there is way too much stuff I don't understand on the screen and it makes me confused and a little stressed. Ease me into it please, I'm dumb. There seems to be cells that are linked together and cells that aren't(? separated by purplish background) and I don't understand it. I am a jupyter user and I feel like this should be intuitive to me, but it isn't. I am not a designer, but I suspect the structural markings like cell boundaries are too faint compared to the content of the cells and/or the exterior of a cell having the same color as the interior is making it hard for me. I feel lost in a sea of white.
But the core issue is that, excluding the prompt I copy-pasted word for word which worked like a charm, I am 0 out of 4 on actually leveraging AI to solve the problems I asked of Magic. I like the concept of natural language BI (I worked on in the early days when Alexa came out) so I probably gave it more chances than I would have for a different product.
For me, it doesn't fit my criteria for good problems to solve with AI in 2024 - the conversational interface and binary right/wrong nature of querying/presenting data accurately make the cost of failure too high, which is a death sentence for AI products IMO (compare to proactive, non-blocking products like copilot or shades-of-wrong problems like image generation or conversations with imaginary characters). But text-to-SQL and data presentation make sense as AI capabilities in 2024 so I can see why that could be a good product to pursue. If it worked, I would definitely use it.
- "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" https://arxiv.org/abs//2405.05904
- "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" https://arxiv.org/abs/2312.05934
But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.
Expensive? Sure, all of AI is crazy expensive. Unfeasible? No
As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)
Note that their guidance here is quite practical:
> If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.
If I can do node-red or a function chain for prompts and outputs, that would be sweet.
Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.
Then just loop them back into LLM.
I find that long chats or chains just confuse most models and we start seeing gibberish.
Right now I'm favoring something like:
"We're going to do task {task}. The current situation and context is {context}.
Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."
I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.
Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.
“You are in charge of game prep and must work with an LLM over many prompts to…”
Here's a more dramatic example: https://www.grey-wing.com/
This company provides deeply integrated LLM-powered software for operating freight ships.
There are a lot of people who are doing this and achieving very good results.
Sorry, if it's not working for you, it doesn't mean that it doesn't work.
Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.
I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.
That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.
If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).
Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)
https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...
So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.
I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.
Disclosure: author on [1]
It's the yes we hallucinate but don't worry because we provide the sources for users to check.
Even though everyone knows that users will never check unless the hallucination is egregious.
It's such a disingenuous way of handling this.
Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...
We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.
Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.
> June 8, 2024
Is this an article from the future?
Best guess is that's the anticipated publishing date of the full three parts on the official O'Reilly site.
I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs
Here is a summary of all points:
1. Focus on Prompting Techniques:
1.1. Start with n-shot prompts to provide examples demonstrating tasks.
1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific.
1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG).
2. Structure Inputs and Outputs: 2.1. Format inputs using serialization methods like XML, JSON, or Markdown.
2.2. Ensure outputs are structured to integrate seamlessly with downstream systems.
3. Simplify Prompts: 3.1. Break down complex prompts into smaller, focused ones.
3.2. Iterate and evaluate each prompt individually for better performance.
4. Optimize Context Tokens: 4.1. Minimize redundant or irrelevant context in prompts.
4.2. Structure the context clearly to emphasize relationships between parts.
5. Leverage Information Retrieval/RAG: 5.1. Use RAG to provide the LLM with knowledge to improve output.
5.2. Ensure retrieved documents are relevant, dense, and detailed.
5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval.
6. Workflow Optimization: 6.1. Decompose tasks into multi-step workflows for better accuracy.
6.2. Prioritize deterministic execution for reliability and predictability.
6.3. Use caching to save costs and reduce latency.
7. Evaluation and Monitoring: 7.1. Create assertion-based unit tests using real input/output samples.
7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs.
7.3. Regularly review LLM inputs and outputs for new patterns or issues.
8. Address Hallucinations and Guardrails: 8.1. Combine prompt engineering with factual inconsistency guardrails.
8.2. Use content moderation APIs and PII detection packages to filter outputs.
9. Operational Practices: 9.1. Regularly check for development-prod data skew.
9.2. Ensure data logging and review input/output samples daily.
9.3. Pin specific model versions to maintain consistency and avoid unexpected changes.
10. Team and Roles: 10.1. Educate and empower all team members to use AI technology.
10.2. Include designers early in the process to improve user experience and reframe user needs.
10.3. Ensure the right progression of roles and hire based on the specific phase of the project.
11. Risk Management: 11.1. Calibrate risk tolerance based on the use case and audience.
11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.Thanks a lot for this!