What we've learned from a year of building with LLMs (opens in new tab)

(eugeneyan.com)

401 pointsViktorasJucikas2y ago138 comments

138 comments

73 comments · 16 top-level

dbs2y ago· 20 in thread

Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.

mloncode2y ago

Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.

However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):

1. https://www.honeycomb.io/blog/introducing-query-assistant

2. https://www.youtube.com/watch?v=B_DMMlDuJB0

Other co-authors have worked on significant bodies of work:

Bryan Bischoff lead the creation of Magic in Hex: https://www.latent.space/p/bryan-bischof

Jason Liu created the most popular OSS libraries for structured data called instructor https://github.com/jxnl/instructor, and works with some of the leading companies in the space like Limitless and Raycast (https://jxnl.co/services/#current-and-past-clients)

Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)

I believe you might find these worth looking at.

anon3738392y ago

I know it’s a snarky comment you responded to, but I’m glad you did. Those are great resources, as is your excellent article. Thanks for posting!

fnordpiglet2y ago

We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.

I often see these messages from the community doubting the reality, but LLMs are a powerful tool in the tool chest. But I think most companies are not staffed with skilled enough engineers with a creative enough bent to really take advantage of them yet or be willing to fund basic research and from first principles toolchain creation. That’s ok. But it’s foolish to assume this is all hype like crypto was. The parallels are obvious but the foundations are different.

threeseed2y ago

No one is saying that all of AI is hype. It clearly isn't.

But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.

There are very strong parallels to crypto in that (a) people are starting with the technology and trying to find problems and (b) there is a cult like atmosphere where non-believers are seen as being anti-progress and anti-technology.

2 more replies

TeMPOraL2y ago

> We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.

That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.

1 more reply

mvdtnz2y ago

Yet another post claiming "dozens" of production use cases without listing a single one.

1 more reply

robbiemitchell2y ago

Processing high volumes of unstructured data (text)… we’re using a STAG architecture.

- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually

- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule

- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.

- Push the result to webhook

Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

Another is preventing LLMs from adding intro or conclusion text.

adamsbriscoe2y ago

> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

(Plug) I shipped a dedicated OpenAI-compatible API for this, jsonmode.com a couple weeks ago and just integrated Groq (they were nice enough to bump up the rate limits) so it's crazy fast. It's a WIP but so far very comparable to JSON output from frontier models, with some bonus features (web crawling etc).

1 more reply

joatmon-snoo2y ago

We actually built an error-tolerant JSON parser to handle this. Our customers were reporting exactly the same issue- trying a bunch of different techniques to get more usefully structured data out.

You can check it out over at https://github.com/BoundaryML/baml. Would love to talk if this is something that seems interesting!

BoorishBears2y ago

> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

How are you struggling with this, let alone as a significant barrier? JSON adherence with a well thought out schema hasn't been a worry between improved model performance and various grammar based constraint systems in a while.

> Another is preventing LLMs from adding intro or conclusion text.

Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.

Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment

1 more reply

benreesman2y ago

I only became aware of it recently and therefore haven’t done more than play with in a fairly cursory way, but unstructured.io seems to have a lot of traction and certainly in my little toy tests their open-source stuff seems pretty clearly better than the status quo.

Might be worth checking out.

lastdong2y ago

“Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule”

This is really interesting, is there any architecture documentation/articles that you can recommend?

1 more reply

thallium2052y ago

We have a company mail, fax, and phone room that receives thousands of pages a day that now sorts, categorizes, and extracts useful information from them all in a completely automated way by LLMs. Several FTEs have been reassigned elsewhere as a result.

harrisoned2y ago

It certainly has use cases, just not as many as the hype lead people to believe. For me:

-Regex expressions: ChatGPT is the best multi-million regex parser to date.

-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.

-Artwork inspiration: Not only for visual inspiration, in the case of image generators, but descriptive as well. The verbosity of some LLMs can help describe things in more detail than a person would.

-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.

int_19h2y ago

GPT-4 has amazing translation capabilities, too. Actually usable for long conversations.

joe_the_user2y ago

I have a friend who uses ChatGPT for writing quick policy statement for her clients (mostly schools). I have a friend who uses it to create images and descriptions for DnD adventures. LLMs have uses.

The problem I see is, who can an "application" be anything but a little window onto the base abilities of ChatGPT and so effectively offers nothing more to an end-user. The final result still have to be checked and regular end-users have to do their own prompt.

Edit: Also, I should also say that anyone who's designing LLM apps that, rather than being end-user tools, are effectively gate keepers to getting action or "a human" from a company deserves a big "f* you" 'cause that approach is evil.

hubraumhugo2y ago

I think it comes down to relatively unexciting use cases that have a high business impact (process automation, RPA, data analysis), not fancy chatbots or generative art.

For example, we focused on the boring and hard task of web data extraction.

Traditional web scraping is labor-intensive, error-prone, and requires constant updates to handle website changes. It's repetitive and tedious, but couldn't be automated due to the high data diversity and many edge cases. This required a combination of rule-based tools, developers, and constant maintenance.

We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.

obiefernandez2y ago

The book I'm writing is almost finished and is based almost entirely on production use cases: https://leanpub.com/patterns-of-application-development-usin...

bbischof2y ago

Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

cqqxo4zV46cp2y ago

Or maybe they could choose to focus their attention on people that aren’t needlessly aggressive and adversarial.

solidasparagus2y ago· 13 in thread

No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

lmeyerov2y ago

We work in some pretty serious domains and try to stay away from fine tuning:

- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs

- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3

- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"

- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc

We do use tuned smaller dumber models, such as part of a coarse relevancy filter in a firehose pipeline... but these are outliers. Likewise, we expect to be using them more... but again, for rarer cases and only after we've exhausted other stuff. I'm guessing as we do more fine-tuning, it'll be more on embeddings than LLMs, at least until OSS models get a lot better.

solidasparagus2y ago

See if the article said this, I would have agreed - fine-tuning is a tool and it should be used thoughtfully. Although I personally believe that in this funding climate it makes sense to make data collection and model training a core capability of any AI product. However that will only be available and wise for some founders.

1 more reply

jph002y ago

> The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

The article has a section called "When to finetune", along with links to separate pages describing how to do so. They absolutely don't say that "fine-tuning isn't even a consideration". Instead, they describe the situations in which fine-tuning is likely to be helpful.

solidasparagus2y ago

Huh. Well that's embarrassing. I guess I missed it when I lost interest in the caching section and jumped straight to Evaluation and Monitoring.

bbischof2y ago

Hello, it’s Bryan, an author on this piece.

As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...

solidasparagus2y ago

I took at look at Magic earlier today and it didn't work at all for me, sorry to say. After the example prompt, I tried to learn about a table and it generated bad SQL (correct query to pull a row, but with limit 0). I asked it to show me the DDL and it generated invalid SQL. Then I tried to ask it to do some population statistics on the customer table and ended up confused about why there appears to be two windows in the cell, with the previously generated SQL on the left and the newly generated SQL on the right. The new SQL wouldn't run when I hit run cell, the error showed the originally generated SQL. I gave up and bounced.

I went back while writing this comment and realized it might be showing me a diff (better use of color would have helped, I have been trained by github). But I was at a loss for what to do with that. I just now figured out the Keep button exists and it accepted the diff and now it sort of makes sense, but the SQL still doesn't return any results.

My honest feedback is that there is way too much stuff I don't understand on the screen and it makes me confused and a little stressed. Ease me into it please, I'm dumb. There seems to be cells that are linked together and cells that aren't(? separated by purplish background) and I don't understand it. I am a jupyter user and I feel like this should be intuitive to me, but it isn't. I am not a designer, but I suspect the structural markings like cell boundaries are too faint compared to the content of the cells and/or the exterior of a cell having the same color as the interior is making it hard for me. I feel lost in a sea of white.

But the core issue is that, excluding the prompt I copy-pasted word for word which worked like a charm, I am 0 out of 4 on actually leveraging AI to solve the problems I asked of Magic. I like the concept of natural language BI (I worked on in the early days when Alexa came out) so I probably gave it more chances than I would have for a different product.

For me, it doesn't fit my criteria for good problems to solve with AI in 2024 - the conversational interface and binary right/wrong nature of querying/presenting data accurately make the cost of failure too high, which is a death sentence for AI products IMO (compare to proactive, non-blocking products like copilot or shades-of-wrong problems like image generation or conversations with imaginary characters). But text-to-SQL and data presentation make sense as AI capabilities in 2024 so I can see why that could be a good product to pursue. If it worked, I would definitely use it.

gandalfgeek2y ago

This was kind of conventional wisdom ("fine tune only when absolutely necessary for your domain", "fine-tuning hurts factuality"), but some recent research (some of which they cite) has actually quantitatively shown that RAG is much preferable to FT for adding domain-specific knowledge to an LLM:

- "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" https://arxiv.org/abs//2405.05904

- "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" https://arxiv.org/abs/2312.05934

solidasparagus2y ago

Thanks, I'll read those more fully.

But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.

2 more replies

OutOfHere2y ago

Fine-tuning is an absolutely necessary for true AI, and even if it's desirable, it's unfeasible to do for now for any large model considering how expensive GPUs are. If I had infinite money, I'd throw it at continuous fine-tuning and would throw away the RAG. Fine-tuning also requires appropriate measures to prevent forgetting of older concepts.

solidasparagus2y ago

It is not unfeasible. It is absolutely realistic to do distributed finetuning of an 8B text model on previous generation hardware. You can add finetuning to your set of options for about the cost of one FTE - up to you whether that tradeoff is worth it, but in many places it is. The expertise to pull it off is expensive, but to get a mid-level AI SME capable of helping a company adopt finetuning, you are only going to pay about the equivalent of 1-3 senior engineers.

Expensive? Sure, all of AI is crazy expensive. Unfeasible? No

1 more reply

CuriouslyC2y ago

Fine tuning has been on the way out for a while. It's hard to do right and costly. LoRAs are better for influencing output style as they don't dumb down the model, and they're easier to create. This is on top of RAG just being better for new facts like the other reply mentioned.

solidasparagus2y ago

How much of that is just the flood of traditional engineers into the space and the fact that collecting data and then fine-tuning models is orders of magnitude more complex than just throwing in RAG? I suspect a huge amount of RAG's popularity is just that any engineer can do a version of it + ChatGPT API calls in a day.

As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)

1 more reply

phillipcarter2y ago

I don't see why this is seen as an either-or by people? Fine-tuning doesn't eliminate the need for RAG, and RAG doesn't obviate the need for fine-tuning either.

Note that their guidance here is quite practical:

> If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.

Multicomp2y ago· 9 in thread

Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.

If I can do node-red or a function chain for prompts and outputs, that would be sweet.

hugocbp2y ago

For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.

Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.

Then just loop them back into LLM.

I find that long chats or chains just confuse most models and we start seeing gibberish.

Right now I'm favoring something like:

"We're going to do task {task}. The current situation and context is {context}.

Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."

I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.

datameta2y ago

Indeed! If I'm met with several misunderstandings in a row, asking it to explain what I'm trying to do is a pretty surefire way to move forward.

Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.

gpsx2y ago

One option for doing this is to incrementally build up the "document" using isolated prompts for each section. I say document because I am not exactly sure what the character sheet looks like, but I am assuming it can be constructed one section at a time. You create a prompt to create the first section. Then, you create a second prompt that gives the agent your existing document and prompts it to create the next section. You continue until all the sections are finished. In some cases this works better than doing a single conversation.

e1g2y ago

Perplexity recently released something like this https://www.perplexity.ai/hub/blog/perplexity-pages

punkspider2y ago

Perhaps this would be of use? https://github.com/langgenius/dify/ I use it for quick workflows and it's pretty intuitive.

proc02y ago

Sounds like you need an agent system, some libs are mentioned here: https://lilianweng.github.io/posts/2023-06-23-agent/

CuriouslyC2y ago

You can do multi shot workflows pretty easy, I like to have the model produce markdown, then add code blocks (```json/yaml```) to extract the interim results. You can lay out multiple "phases" in your prompt and have it perform each one in turn, and have each one reference prior phases. Then at the end you just pull out the code blocks for each phase and you have your structured result.

1 more reply

1272y ago

Did you force it into a parser? You can define a simple language in llama.cpp for the LLM to obey.

mentos2y ago

I still haven’t played with using one LLM to oversee another.

“You are in charge of game prep and must work with an LLM over many prompts to…”

jakubmazanec2y ago· 4 in thread

I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.

exhaze2y ago

I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.

Here's a more dramatic example: https://www.grey-wing.com/

This company provides deeply integrated LLM-powered software for operating freight ships.

There are a lot of people who are doing this and achieving very good results.

Sorry, if it's not working for you, it doesn't mean that it doesn't work.

robbiep2y ago

That’s really interesting. Surely the crewing roster stuff is actually using linear algebra rather than AI though?

qeternity2y ago

I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.

jakubmazanec2y ago

But it seems to me that's what they're doing: "We have LLMs, what to do with them?" But anyway, I'm seriously just looking for an example of app that is build with stuff described in the article.

Me personally, I only used LLM for one "serious" application: I used GPT-3.5Turbo for transforming unstructured text into JSON; it was basically just ad-hoc Node.js script that called API (prompt was few examples of input-output pairs), and then it did some checks (these checks usually failed only because GPT also corrected misspellings). It would take me weeks to do it manually, but with the help of GPT it was few hours (writing of the script + I made a lot of misspellings so the script stopped a lot). But I cannot imagine anything more complex.

1 more reply

JKCalhoun2y ago· 3 in thread

> Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...

I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.

ezst2y ago

That's the thing, it's a novel form of computing that's increasingly moving away from computer science. It deserves to be treated as a discipline of its own, with lots of words of caution and danger stickers slapped over it.

skydhash2y ago

It’s text (word) manipulation based on probalistic rules derived from analyzing human-produced text. And everyone knows language is imperfect. That’s why we have introduced logic and formalism so that we can reliably transmit knowledge.

That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.

If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).

amelius2y ago

Yeah like psychology being a different field from physics even if it is running on atoms ultimately.

Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)

1 more reply

threeseed2y ago· 3 in thread

RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:

https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

phillipcarter2y ago

> So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.

mattyyeung2y ago

You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications

Disclosure: author on [1]

[1] https://mattyyeung.github.io/deterministic-quoting

threeseed2y ago

Have seen this approach before.

It's the yes we hallucinate but don't worry because we provide the sources for users to check.

Even though everyone knows that users will never check unless the hallucination is egregious.

It's such a disingenuous way of handling this.

mloncode2y ago· 2 in thread

This is Hamel, one of the authors of the article. We published the article with OReilly here:

Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.

xnx2y ago

The link to part II from part I points back to part I.

seventytwo2y ago

Was wondering about the June 8th date on there :)

sheepscreek2y ago· 2 in thread

I’m sure this has some decent insights but it’s from almost 1 year ago! A lot has changed in this space since then.

bgrainger2y ago

Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.

Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.

jph002y ago

Looks like they made a mistake in the article metadata - they definitely just released this article.

1 more reply

blumomo2y ago· 1 in thread

> PUBLISHED

> June 8, 2024

Is this an article from the future?

defrost2y ago

Good catch.

Best guess is that's the anticipated publishing date of the full three parts on the official O'Reilly site.

See: https://news.ycombinator.com/item?id=40551413

DylanSp2y ago

Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.

mercurialsolo2y ago

As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :

I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs

OutOfHere2y ago

Almost all of this should flow from common-sense. I would use what makes sense for your application, and not worry about the rest. It's a toolbox, not a rulebook. The one point that comes more from experience than from common-sense is to always pin your model versions. As a final tip, if despite trying everything, you still don't like the LLM's output, just run it again!

Here is a summary of all points:

1. Focus on Prompting Techniques:

   1.1. Start with n-shot prompts to provide examples demonstrating tasks.
   1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific.
   1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG).

2. Structure Inputs and Outputs:

   2.1. Format inputs using serialization methods like XML, JSON, or Markdown.
   2.2. Ensure outputs are structured to integrate seamlessly with downstream systems.

3. Simplify Prompts:

   3.1. Break down complex prompts into smaller, focused ones.
   3.2. Iterate and evaluate each prompt individually for better performance.

4. Optimize Context Tokens:

   4.1. Minimize redundant or irrelevant context in prompts.
   4.2. Structure the context clearly to emphasize relationships between parts.

5. Leverage Information Retrieval/RAG:

   5.1. Use RAG to provide the LLM with knowledge to improve output.
   5.2. Ensure retrieved documents are relevant, dense, and detailed.
   5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval.

6. Workflow Optimization:

   6.1. Decompose tasks into multi-step workflows for better accuracy.
   6.2. Prioritize deterministic execution for reliability and predictability.
   6.3. Use caching to save costs and reduce latency.

7. Evaluation and Monitoring:

   7.1. Create assertion-based unit tests using real input/output samples.
   7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs.
   7.3. Regularly review LLM inputs and outputs for new patterns or issues.

8. Address Hallucinations and Guardrails:

   8.1. Combine prompt engineering with factual inconsistency guardrails.
   8.2. Use content moderation APIs and PII detection packages to filter outputs.

9. Operational Practices:

   9.1. Regularly check for development-prod data skew.
   9.2. Ensure data logging and review input/output samples daily.
   9.3. Pin specific model versions to maintain consistency and avoid unexpected changes.

10. Team and Roles:

    10.1. Educate and empower all team members to use AI technology.
    10.2. Include designers early in the process to improve user experience and reframe user needs.
    10.3. Ensure the right progression of roles and hire based on the specific phase of the project.

11. Risk Management:

    11.1. Calibrate risk tolerance based on the use case and audience.
    11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.

gengstrand2y ago

Interesting blog. It seems to be a compendium of advice for all kinds of folks ranging from end user to integration partner. For a slightly different take on how to use LLMs to build software, you might be interested in https://www.infoq.com/articles/llm-productivity-experiment/ which documents an experiment where the same prompt was given to various prominent LLMs asking to write two unit tests for an already existing code base. The results were collected, metrics were analyzed, then comparisons were made. No advice on how to write better prompts but some insight on how to work with and what you can expect from LLMs in order to improve developer productivity.

felixbraun2y ago

related discussion (3 days ago): https://news.ycombinator.com/item?id=40508390

pklee2y ago

This is pure gold !! Thank you so much eugene and gang for doing this. For those of them which I have encountered, I can 100 % agree with them. This is fantastic !! So many good insights.

hakanderyal2y ago

If you didn't follow what has been happing in the LLM space, this document gives you everything you need to know about state of the art LLM usage & applications.

Thanks a lot for this!

j / k navigate · click thread line to collapse

138 comments

73 comments · 16 top-level

dbs2y ago· 20 in thread

Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.

mloncode2y ago

Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.

However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):

1. https://www.honeycomb.io/blog/introducing-query-assistant

2. https://www.youtube.com/watch?v=B_DMMlDuJB0

Other co-authors have worked on significant bodies of work:

Bryan Bischoff lead the creation of Magic in Hex: https://www.latent.space/p/bryan-bischof

Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)

I believe you might find these worth looking at.

anon3738392y ago

I know it’s a snarky comment you responded to, but I’m glad you did. Those are great resources, as is your excellent article. Thanks for posting!

fnordpiglet2y ago

threeseed2y ago

No one is saying that all of AI is hype. It clearly isn't.

But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.

2 more replies

TeMPOraL2y ago

That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.

1 more reply

mvdtnz2y ago

Yet another post claiming "dozens" of production use cases without listing a single one.

1 more reply

robbiemitchell2y ago

Processing high volumes of unstructured data (text)… we’re using a STAG architecture.

- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually

- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule

- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.

- Push the result to webhook

Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

Another is preventing LLMs from adding intro or conclusion text.

adamsbriscoe2y ago

> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

1 more reply

joatmon-snoo2y ago

We actually built an error-tolerant JSON parser to handle this. Our customers were reporting exactly the same issue- trying a bunch of different techniques to get more usefully structured data out.

You can check it out over at https://github.com/BoundaryML/baml. Would love to talk if this is something that seems interesting!

BoorishBears2y ago

> Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

> Another is preventing LLMs from adding intro or conclusion text.

Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.

Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment

1 more reply

benreesman2y ago

Might be worth checking out.

lastdong2y ago

“Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule”

This is really interesting, is there any architecture documentation/articles that you can recommend?

1 more reply

thallium2052y ago

harrisoned2y ago

It certainly has use cases, just not as many as the hype lead people to believe. For me:

-Regex expressions: ChatGPT is the best multi-million regex parser to date.

-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.

-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.

int_19h2y ago

GPT-4 has amazing translation capabilities, too. Actually usable for long conversations.

joe_the_user2y ago

I have a friend who uses ChatGPT for writing quick policy statement for her clients (mostly schools). I have a friend who uses it to create images and descriptions for DnD adventures. LLMs have uses.

hubraumhugo2y ago

I think it comes down to relatively unexciting use cases that have a high business impact (process automation, RPA, data analysis), not fancy chatbots or generative art.

For example, we focused on the boring and hard task of web data extraction.

We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.

obiefernandez2y ago

The book I'm writing is almost finished and is based almost entirely on production use cases: https://leanpub.com/patterns-of-application-development-usin...

bbischof2y ago

Hello, it’s Bryan, an author on this piece.

cqqxo4zV46cp2y ago

Or maybe they could choose to focus their attention on people that aren’t needlessly aggressive and adversarial.

solidasparagus2y ago· 13 in thread

lmeyerov2y ago

We work in some pretty serious domains and try to stay away from fine tuning:

- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs

- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3

- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"

- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc

solidasparagus2y ago

1 more reply

jph002y ago

solidasparagus2y ago

Huh. Well that's embarrassing. I guess I missed it when I lost interest in the caching section and jumped straight to Evaluation and Monitoring.

bbischof2y ago

Hello, it’s Bryan, an author on this piece.

solidasparagus2y ago

gandalfgeek2y ago

- "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" https://arxiv.org/abs//2405.05904

- "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" https://arxiv.org/abs/2312.05934

solidasparagus2y ago

Thanks, I'll read those more fully.

2 more replies

OutOfHere2y ago

solidasparagus2y ago

Expensive? Sure, all of AI is crazy expensive. Unfeasible? No

1 more reply

CuriouslyC2y ago

solidasparagus2y ago

1 more reply

phillipcarter2y ago

I don't see why this is seen as an either-or by people? Fine-tuning doesn't eliminate the need for RAG, and RAG doesn't obviate the need for fine-tuning either.

Note that their guidance here is quite practical:

> If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.

Multicomp2y ago· 9 in thread

If I can do node-red or a function chain for prompts and outputs, that would be sweet.

hugocbp2y ago

For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.

Then just loop them back into LLM.

I find that long chats or chains just confuse most models and we start seeing gibberish.

Right now I'm favoring something like:

"We're going to do task {task}. The current situation and context is {context}.

I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.

datameta2y ago

Indeed! If I'm met with several misunderstandings in a row, asking it to explain what I'm trying to do is a pretty surefire way to move forward.

Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.

gpsx2y ago

e1g2y ago

Perplexity recently released something like this https://www.perplexity.ai/hub/blog/perplexity-pages

punkspider2y ago

Perhaps this would be of use? https://github.com/langgenius/dify/ I use it for quick workflows and it's pretty intuitive.

proc02y ago

Sounds like you need an agent system, some libs are mentioned here: https://lilianweng.github.io/posts/2023-06-23-agent/

CuriouslyC2y ago

1 more reply

1272y ago

Did you force it into a parser? You can define a simple language in llama.cpp for the LLM to obey.

mentos2y ago

I still haven’t played with using one LLM to oversee another.

“You are in charge of game prep and must work with an LLM over many prompts to…”

jakubmazanec2y ago· 4 in thread

exhaze2y ago

I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.

Here's a more dramatic example: https://www.grey-wing.com/

This company provides deeply integrated LLM-powered software for operating freight ships.

There are a lot of people who are doing this and achieving very good results.

Sorry, if it's not working for you, it doesn't mean that it doesn't work.

robbiep2y ago

That’s really interesting. Surely the crewing roster stuff is actually using linear algebra rather than AI though?

qeternity2y ago

I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.

jakubmazanec2y ago

But it seems to me that's what they're doing: "We have LLMs, what to do with them?" But anyway, I'm seriously just looking for an example of app that is build with stuff described in the article.

1 more reply

JKCalhoun2y ago· 3 in thread

I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.

ezst2y ago

skydhash2y ago

amelius2y ago

Yeah like psychology being a different field from physics even if it is running on atoms ultimately.

Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)

1 more reply

threeseed2y ago· 3 in thread

https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

phillipcarter2y ago

> So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.

mattyyeung2y ago

You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications

Disclosure: author on [1]

[1] https://mattyyeung.github.io/deterministic-quoting

threeseed2y ago

Have seen this approach before.

It's the yes we hallucinate but don't worry because we provide the sources for users to check.

Even though everyone knows that users will never check unless the hallucination is egregious.

It's such a disingenuous way of handling this.

mloncode2y ago· 2 in thread

This is Hamel, one of the authors of the article. We published the article with OReilly here:

Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

xnx2y ago

The link to part II from part I points back to part I.

seventytwo2y ago

Was wondering about the June 8th date on there :)

sheepscreek2y ago· 2 in thread

I’m sure this has some decent insights but it’s from almost 1 year ago! A lot has changed in this space since then.

bgrainger2y ago

Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.

jph002y ago

Looks like they made a mistake in the article metadata - they definitely just released this article.

1 more reply

blumomo2y ago· 1 in thread

> PUBLISHED

> June 8, 2024

Is this an article from the future?

defrost2y ago

Good catch.

Best guess is that's the anticipated publishing date of the full three parts on the official O'Reilly site.

See: https://news.ycombinator.com/item?id=40551413

DylanSp2y ago

Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.

mercurialsolo2y ago

OutOfHere2y ago

Here is a summary of all points:

1. Focus on Prompting Techniques:

   1.1. Start with n-shot prompts to provide examples demonstrating tasks.
   1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific.
   1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG).

2. Structure Inputs and Outputs:

   2.1. Format inputs using serialization methods like XML, JSON, or Markdown.
   2.2. Ensure outputs are structured to integrate seamlessly with downstream systems.

3. Simplify Prompts:

   3.1. Break down complex prompts into smaller, focused ones.
   3.2. Iterate and evaluate each prompt individually for better performance.

4. Optimize Context Tokens:

   4.1. Minimize redundant or irrelevant context in prompts.
   4.2. Structure the context clearly to emphasize relationships between parts.

5. Leverage Information Retrieval/RAG:

   5.1. Use RAG to provide the LLM with knowledge to improve output.
   5.2. Ensure retrieved documents are relevant, dense, and detailed.
   5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval.

6. Workflow Optimization:

   6.1. Decompose tasks into multi-step workflows for better accuracy.
   6.2. Prioritize deterministic execution for reliability and predictability.
   6.3. Use caching to save costs and reduce latency.

7. Evaluation and Monitoring:

   7.1. Create assertion-based unit tests using real input/output samples.
   7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs.
   7.3. Regularly review LLM inputs and outputs for new patterns or issues.

8. Address Hallucinations and Guardrails:

   8.1. Combine prompt engineering with factual inconsistency guardrails.
   8.2. Use content moderation APIs and PII detection packages to filter outputs.

9. Operational Practices:

   9.1. Regularly check for development-prod data skew.
   9.2. Ensure data logging and review input/output samples daily.
   9.3. Pin specific model versions to maintain consistency and avoid unexpected changes.

10. Team and Roles:

    10.1. Educate and empower all team members to use AI technology.
    10.2. Include designers early in the process to improve user experience and reframe user needs.
    10.3. Ensure the right progression of roles and hire based on the specific phase of the project.

11. Risk Management:

    11.1. Calibrate risk tolerance based on the use case and audience.
    11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.

gengstrand2y ago

felixbraun2y ago

related discussion (3 days ago): https://news.ycombinator.com/item?id=40508390

pklee2y ago

This is pure gold !! Thank you so much eugene and gang for doing this. For those of them which I have encountered, I can 100 % agree with them. This is fantastic !! So many good insights.

hakanderyal2y ago

If you didn't follow what has been happing in the LLM space, this document gives you everything you need to know about state of the art LLM usage & applications.

Thanks a lot for this!

j / k navigate · click thread line to collapse