Here are my take aways
1. There are way too many premature abstractions. Langchain, as one of may examples, might be useful in the future but at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call rather than as a special thing.
2. Hallucinations are definitely a big problem. Summarizing is pretty rock solid in my testing, but reasoning is really hard. Action models, where you ask the llm to take in a user input and try to get the llm to decide what to do next, is just really hard, specifically it's hard to get the llm to understand the context and get it to say when it's not sure.
That said, it's still a gamechanger that I can do it at all.
3. I am a bit more hyped than the author that this is a game changer, but like them, I don't think it's going to be the end of the world. There are some jobs that are going to be heavily impacted and I think we are going to have a rough few years of bots astroturfing platforms. But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.
IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions. Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me.
I advocate for these metaphors to help people better understand a reasonable expectation for LLMs in modern development workflows. Mostly because they show it as a trade-off versus a silver bullet. There were trade-offs to the evolution of devops, consider for example the loss of key skillsets like database administration as a direct result of "just use AWS RDS" and the explosion in cloud billing costs (especially the OpEx of startups who weren't even dealing with that much data or regional complexity!) - and how it indirectly led to Gitlabs big outage and many like it.
This is a function of the language model itself. By the time you get to the output, the uncertainty that is inherent in the computation is lost to the prediction. It is like if you ask me to guess heads or tails, and I guess heads, I could have stated my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual prediction of heads, and then the coin flip, that uncertainty is lost. It's the same with LLMs. The uncertainty in the computation is lost in the final prediction of the tokens, so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think), then you should not find an LLM output really ever to say it does not understand. But that is because it never understands, it just predicts.
I haven't tried it myself yet, not sure how well it works in practice.
Or do you mean that fine-tuning distorts these likelihoods so models can no longer accurately signal uncertainty?
Most prompts are written in the form “you are a helpful assistant, you will do X, you will not do Y”
I believe that inclusion of instructions like “if there are possible answers that differ and contradict, state that and estimate the probability of each” would help knowledgeable users.
But for typical users and PR purposes, it would be disaster. It is better to tell 999 people that the US constitution was signed in 1787 and 1 person that it was signed in 349 B.C. than it is to tell 1000 people that it was probably signed in 1787 but it might have been 349 B.C.
Why shouldn't you ask for uncertainaty?
I love asking for scores / probabilities (usually give a range, like 0.0 to 1.0) whenever I ask for a list, and it makes the output much more usable
Basically, I think we’re using GPT as the PaaS/heroku/render equivalent of AI ops.
Thank you for the insight!!
Thank you. Seeing similar things. Clients are also seeing sticker shock on how much the big models cost vs. the output. That will all come down over time.
So will interest, as more and more people realise theres nothing "intelligent" about the technology, it's merely a Markov-chain-word-salad generator with some weights to improve the accuracy somewhat.
I'm sure some people (other than AI investors) are getting some value out of it, but I've found it to be most unsuited to most of the tasks I've applied it to.
Asking for analogies has been interesting and surprisingly useful.
Yet, for some reason, ChatGPT is still pretty bad at generating titles for chats, and I didn't have better luck with the API even after trying to engineer the right prompt for quite a while...
For some odd reason, once in a while I get things in different languages. It's funny when it's in a language I can speak, but I recently got "Relm4 App Yenileştirme Titizliği" which ChatGPT tells me means "Relm4 App Renewal Thoroughness" when I actually was asking it to adapt a snippet of gtk-rs code to relm4, so not particularly helpful
They are also dull (higher latency for same resources) APIs if you're self-hosting LLM. Special attention needed to plan the capacity.
For example?
Now, if cost is little concern you can use zero shot prompting on an inefficient model. If cost is a concern, you can use GPT4 to create your golden dataset way faster and cheaper than human annotations, and then train your more efficient model.
Some example NLP tasks could be classifiers, sentiment, extracting data from documents. But I’d be curious which areas of NLP __weren’t__ disrupted by LLMs.
Much like you working with Bob and opining that Bob is great, and me saying that I find Jack easier to work with.
Yes I've found Claude to be capable of writing closer to the instructions in the prompt, whereas ChatGPT feels obligated to do the classic LLM end to each sentence, "comma, gerund, platitude", allowing us to easily recognize the text as a GPT output (see what I did there?)
My experience has been the opposite. I subscribe to multiple services as well and copy/paste the same question to all. For my software dev related questions, Claude Opus is so far ahead that I am thinking that it no longer is necessary to use GPT4.
For code samples I request, GPT4 produced code fails to even compile many times. That almost never happens for Claude.
My new litmus test is “give me 10 quirky bars within 200 miles of Austin.”
This is incredibly difficult for all of them, gpt4 is kind of close, Claude just made shit up, Gemini shat itself.
I wonder why? It seems to work pretty well for me.
> Lesson 4: GPT is really bad at producing the null hypothesis
Tell me about it! Just yesterday I was testing a prompt around text modification rules that ended with “If none of the rules apply to the text, return the original text without any changes”.
Do you know ChatGPT’s response to a text where none of the rules applied?
“The original text without any changes”. Yes, the literal string.
One fun anecdote, a while back I was making an image of three women drinking wine in a fancy garden for a tarot card, and at the end of the prompt I had "lush vegetation" but that was enough to tip the women from classy to red nosed frat girls, because of the double meaning of lush.
I read this as "what we do works just fine to not need to use JSON mode". We're in the same boat at my company. Been live for a year now, no need to switch. Our prompt is effective at getting GPT-3.5 to always produce JSON.
> GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.
This is just a prompt issue. I've had it reliably return up to 200 items in correct order. The trick is to not use lists at all but have JSON keys like "item1":{...} in the output. You can use lists as the values here if you have some input with 0-n outputs.
I haven’t looked at the prompts we run in prod at $DAYJOB for a while but I think we have at least five or ten things that are REALLY weird out of context.
If I give GPT4 a list of existing items with a defined structure, and it is just having to convert schema or something like that to JSON, it can do that all day long. But if it has to do any sort of reasoning and basically create its own list, it only gives me a very limited subset.
I have similar issues with other LLMs.
Very interested in how you are approaching this.
https://gist.github.com/thbar/a53123cbe7765219c1eca77e03e675...
Are you using the function calling/tool use API?
I was trying to build an intelligent search feature for my notes and asking ChatGPT to return structured JSON data. For example, I wanted to ask "give me all my notes that mention Haskell in the last 2 years that are marked as draft", and let Chat GPT figure out what to return. This only worked some of the time. Instead, I put my data in a SQLite database, sent ChatGPT the schema, and asked it to write a query to return what I wanted. That has worked much better.
I had better luck with function-calling to get a structured response, but it is more limiting than just getting a JSON body.
1. A recitation of terrible problems 2. A declaration of general satisfaction.
Clearly and obviously, ChatGPT is an unreliable toy. The author seems pleased with it. As an engineer, I find that unacceptable.
That doesn't mean they can't be incredibly useful - but it does mean you have to approach them in a bit of a different way, and design software around them that takes their unreliability into account.
The relatively low price point certainly plays a role here, but it's certainly not a mainly recreational thing for me. These thing's are kinda hard to measure but roughly most + is engagement with hard stuff goes up, and rate of learning goes up, by a lot.
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
== End of toot.
The price you pay for this bullshit in energy when the sea temperature is literally off the charts and we do not know why makes it not worth it in my opinion.
"return nothing if you find nothing" is the level 0 version of giving the LLM an out. Give it a softer out ("in the event that you do not have sufficient information to make conclusive statements, you may hypothesize as long as you state clearly that you are doing so, and note the evidence and logical basis for your hypothesis") then ask it to evaluate its own response at the end.
Need to verify if it even knows what you mean by nothing.
I think we now know, collectively, a lot more about what’s annoying/hard about building LLM features than we did when LangChain was being furiously developed.
And some things we thought would be important and not-easy, turned out to be very easy: like getting GPT to give back well-formed JSON.
So I think there’s lots of room.
One thing LangChain is doing now that solves something that IS very hard/annoying is testing. I spent 30 minutes yesterday re-running a slow prompt because 1 in 5 runs would produce weird output. Each tweak to the prompt, I had to run at least 10 times to be reasonably sure it was an improvement.
Like listening to my students all going to ‘call some API’ for their projects is really very sad to hear. Many startup fellows share this sentiment which a totally kills all the joy.
It's much better at critical thinking tasks and prose.
Don't mistake benchmarks for real world performance across actual usecases. There's a bit of Goodhart's Law going on with LLM evaluation and optimization.
When you are integrating these things into your business, you are looking for different things. Most of our customers would for example not find it very cool to have a service outage because somebody wanted to not kill all the joy.
But then is the author (and are we) talking experience in reselling APIs or experience in introducing NNs in the pipeline? Not the same thing IMHO.
Agreed that OpenAI provides very good service, Gemini is not quite there yet, Groq (the LPUs) delivered a nice tech demo, Mixtral is cool but lacks in certain areas, and Claude can be lengthy.
But precisely because I’m not sticking with OAI I can then restate my view that if someone is so good with prompts he can get the same results locally if he knows what he’s doing.
Prompting OpenAI the right way can be similarly difficult.
Perhaps the whole idea of local inference only matters for IoT scenarios or whenever data is super sensitive (or CTO super stubborn to let it embed and fly). But then if you start from day 1 with WordPress provisioned for you ready to go in Google Cloud, you’d never understand the underlying details of the technology.
There sure also must be a good reason why Phind tuned their own thing to offer alongside GPT4 APIs.
Disclaimer: tech education is a side thing I do, indeed, and been doing in person for very long time, more than dozen topics, to allow myself to have opinion. Of course business is different matter and strategic decisions arr not the same. Even though I’d not advise anyone to blindly use APIs unless they appreciate the need properly.
Sounds like everyone eventually concludes that Langchain is bloated and useless and creates way more problems than it solves. I don’t get the hype.
I also tried the API for some financial analysis of large tables, the response time was around 2 minutes, still did it really well and timeout errors were around 1 to 2% only.
is to have it return very specific text that I string-match on and treat as null.
Like: "if there is no warm up for this workout, use the following text in the description: NOPE"
then in code I just do a "if warm up contains NOPE, treat it as null"
Google Gemini were showing something that I'd call 'adapted output UI' in their launch presentation. Is that close to what you're doing in any way?
You can find some open-source examples here https://github.com/chatbotkit. More coming next week.
I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”
Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.
Natural language is the most probable output for GPT, because the text it was trained with is similar. In this case the developer simply leaned more into what GPT is good at than giving it more work.
You can use simple tasks to make GPT fail. Letter replacements, intentional typos and so on are very hard tasks for GPT. This is also true for ID mappings and similar, especially when the ID mapping diverges significantly from other mappings it may have been trained with (e.g. Non-ISO country codes but similar three letter codes etc.).
The fascinating thing is, that GPT "understands" mappings at all. Which is the actual hint at higher order pattern matching.
The answer is a bit boring: the expenditure definitely has helped customers - in that, they're using AI generated responses in all their work flows all the time in the app, and barely notice it.
See what I did there? :) I'm mostly serious though - one weird thing about our app is that you might not even know we're using AI, unless we literally tell you in the app.
And I think that's where we're at with AI and LLMs these days, at least for our use case.
You might find this other post I just put up to have more details too, related to how/where I see the primary value: https://kenkantzer.com/gpt-is-the-heroku-of-ai/
My experience around Langchain/RAG differs, so wanted to dig deeper: Putting some logic around handling relevant results helps us produce useful output. Curious what differs on their end.
It does very badly over diverse business docs, especially with naive chunking. B2B use cases usually have old PDFs and word docs that need to be searched, and they're often looking for specific keywords (e.g. a person's name, a product, an id, etc). Vectors terms to do badly in those kinds of searches, and just returning chunks misses a lot of important details
Especially if they aren’t in the token vocab
Not sure if he means training here or using his product. I think the latter.
My end-user exp of GPT3.5 is that I need to be - not just precise but the exact flavor of precise. It's usually after some trial and error. Then more error. Then more trial.
Getting a useful result on the 1st or 3rd try happens maybe 1 in 10 sessions. A bit more common is having 3.5 include what I clearly asked it not to. It often complies eventually.
Numbered bullets work well for this, if you don’t need JSON. With JSON, you can ask it to include an ‘id’ in each item.
> No. Not with this transformers + the data of the internet + $XB infrastructure approach.
Errr ...did they really mean Gen AI .. or AGI?
I don't want a crap intro or waffley summary but it just can't help itself.
As an example of contextual baggage, I wrote a tool where I had to adjust the prompt between Claude and GPT-4 because using the word "website" in the prompt caused GPT-4 (API) to go into its 'I do not have access to the internet' tirade about 30% of the time. The tool was a summary of web pages experiment. By removing 'website' and replacing it with 'content' (e.g. 'summarize the following content') GPT-4 happily complied 100% of the time.
Personally I thought this was an interesting read - and more interesting because it didn’t contain any massive “WE DID THIS AND IT CHANGED PUR LIVES!!!” style revelations.
It is discursive, thoughtful and not overwritten. I find this kind of content valuable and somewhat rare.
LLMs are set up to output tokens. Not to not output tokens.
So instead of "don't return anything" have the lack of results "return the default value of XYZ" and then just do a text search on the result for that default value (i.e. XYZ) the same way you do the text search for the state names.
Also, system prompts can be very useful. It's basically your opportunity to have the LLM roleplay as X. I wish they'd let the system prompt be passed directly, but it's still better than nothing.
If you pass in a whole list of states, you're kind of making the vectors for every state light up. If you just say "state" and the text you passed in has an explicit state, than fewer vectors specific to what you're searching for light up. So when it performs the soft max, the correct state is more likely to be selected.
Along the same lines I think his /n vs comma comparison probably comes down to tokenization differences.
The take on RAG feels application specific. For our use-case where having details of the past rendered up the ability to generate loose connections is actually a feature. Things like this are what I find excites me most about LLMs, having a way to proxy subjective similarities the way we do when we remember things is one of the benefits of the technology that didn’t really exist before that opens up a new kind of product opportunity.
Changing the prompt didn't help, but moving to GPT-4 did help a bit.
I just don't agree with the Claude assessment. In my experience, Claude 3 Opus is vastly superior to GPT-4. Maybe the author was comparing with Claude 2? (And I've never tested Gemini)
> While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.
> So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.
So, using ChatGPT helped uncover the correct schema, right?
> It’s the subtle things mostly, like intuiting intention.
this makes me wonder - what if the author "trained" himself onto chatgpt's "dialect"? How do we even detect that in ourselves?
and are we about to have "preferred_LLM wars" like we had "programming language wars" for the last 2 decades?
Why not really compare the two options, author? I would love to see the results!
I have an ignore command so that it will wait when the user isn't finished speaking. Which it generally judges okay, unless it has 'null' in there.
The nice thing is that I have found most of the problems with the LLM response were just indications that I hadn't finished debugging my program because I had something missing or weird in the prompt I gave it.
Using a multi-billion tokens like GPT-4 for such a trivial classification task[1] is an insane overkill. And in an era where ChatGPT exists, and can in fact give you what you need to build a simpler classifier for the task, it shows how narrow minded most people are when AI is involved.
[1] to clarify, it's either trivial or impossible to do reliably depending on how fucked-up your input is
Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:
def extract_json(prompt, variable_length_input, number_retries)
The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.
This is the beauty of a powerful generalized model – less is more."
Well said!
Why? The null stuff would not be a problem if you did and if you're only dealing with JSON anyway I don't see why you wouldn't.