To me this viewpoint looks totally alien. Imagine you have been training this model to predict the next token. At first it can barely interleave vowels and consonants. Then it can start making words, then whole sentences. Then it starts unlocking every cognitive ability one by one. It begins to pass nearly every human test and certification exam and psychological test of theory of mind.
Now imagine thinking at this point "training larger models with more data may not offer significant improvements" and deciding that's why you stop scaling it. That makes absolutely no sense to me unless 1) you have no imagination or 2) you want to stop because you are scared to make superhuman intelligence or 3) you are lying to throw off competitors or regulators or other people.
ChatGPT scrapes all the information given, then predicts the next token. It has no ability to understand what is truthful or correct. It’s as good as the data being fed to it.
To me, this is a step closer to AGI but we’re still far off. There’s a difference between “what’s statistically likely to be the next word” vs “despite this being the most likely next word, it’s actually wrong and here’s why”
If we say, “well, we’ll tell chatgpt what the correct sources of information are” that’s no better really. It’s not reasoning, it’s just a neutered data set.
I imagine they need to add something like chatgpt 4 has with live internet models or something else to get the next meaningful bump
I don’t recall who said it, but a similar thread had a researcher in the field express that we have squeezed far more juice than expected from these transformer models. Not that new progress in this direction can be made, but it seems like we’re approaching diminishing returns
I believe the next step that’s close is to have these train on less and less horsepower. If we can have these models run on a phone locally, oh boy that’s gonna be something
The truth is that functionally/technically, there's plenty left to squeeze. The bigger issue is that we're hitting a wall economically.
That is precisely true of Humans as well though! :-)
it is obscenely expensive to keep training + there are other more low hanging fruit + you expect hardware to get better over time.
I don't think Altman is trying to fool anyone. Even if he were it wouldn't work. The competition is not that stupid and he knows that :)
It's just that hardware tends to get better at a rate that resembles Moore's law so in 18 months the cost of training a 100 mill dollar model is 50 mill dollar. You certainly can just throw money at the problem, but it's expensive and there are other options that are just as effective for now. Why spend money on things that are half as valuable in 18 months when you can spend money on things that don't devalue as fast like producing more/better data?
All that being said you can bet your ass there will be a gpt5 :)
Or work on consistency within a scope. For example, it can't write a novel because it doesn't have object consistency. A character will be 15 years old then 28 years old three sentences later.
Or allow it database/API access so it can interpolate canonical information into its responses.
None of these have to do with scale of data (as far as I understand.) All of them are, in my opinion, higher ROI areas for development for LLM => AGI.
Best you can hope for is that they combine the expertise of all authors in the training data, which would be very impressive, but more top-tier human than super-human. However, achieving this level of performance may well be beyond what a transformer of any size can do. It may take a better architecture.
I suspect that there is also probably a dumbing-down effect by training the model on material from people who themselves are on a spectrum of different abilities. Simply put the model is being rewarded when trained for being correct as often as possible (i.e on average), so if it saw the same subject matter in the training set 10 times, once by an expert and 10x by mid-wits, then it's going to be rewarded for mid-wit performance.
For a squishy example of a known conscious system, if you scoop out certain small, relatively fixed, regions of our brains, you can make consciousness, memory, and learning mostly cease. This suggests it's partly due to special subsystems, rather than total connection count.
It has the same advantages search has over ChatGPT (being able to cite sources, being quite unlikely to hallucinate) and it has some of the advantages ChatGPT has over search (not needing exact query) - but in my experience it's not really in the new category of information discovery that ChatGPT introduced us to.
Maybe with more context I'll change my tune, but it's very much at the whim of the context retrieval finding everything you need to answer the query. That's easy for stuff that search is already good at, and so provides a better interface for search. But it's hard for stuff that search isn't good at, because, well: it's search.
But 90% of the time, it’s two barely distinct personalities chatting back and forth:
Me: Hey brian, what do you think of AI?
Brian: It’s great!
Me: I’m so glad we agree.
Brian: Great, this increases the training weight of Brian agreeing with Brian to a much more accurate level!
Me: Agree!
But these optimizations are applications of technology stacks we already know about. Sometimes, this era of AI research reminds me of all the whacky contraptions from the era before building airplanes became an engineering discipline.
I would likely have tried building a backyard ornithopter powered by mining explosives, if I had been alive during that period of experimentation.
Prediction: the best interfaces for this will be the ones we use for everything else as humans. I am trying to approach it more like that, and less like APIs and “document vs relational vs vector storage”.
I agree that there's probably a better solution than pure embedding-based or mixed embedding/keyword search, but the "better" solution will still be based around semantics... aka embeddings.
I think the two could be paired up effectively. Context windows are getting bigger, but are still limited in the amount of information ChatGPT can sift through. This in turn limits the utility of current plugin based approaches.
Letting ChatGPT ask for relevant information, and sift through it based on its internal knowledge, seems valuable. If nothing else, it allows "learning" from recent development and effectively would augment its reasoning capability by having more information in working memory.
One trick is to have a LLM hallucinate a document based on the query, and then embed that hallucinated document. Unfortunately this increases the latency since it incurs another round trip to the LLM.
If you could spot the need for it while streaming a response you could possibly even have it ready ahead of time
https://arxiv.org/abs/2212.10496
Summary —
HyDE is a new method for creating effective zero-shot dense retrieval systems that generates hypothetical documents based on queries and encodes them using an unsupervised contrastively learned encoder to identify relevant documents. It outperforms state-of-the-art unsupervised dense retrievers and performs strongly compared to fine-tuned retrievers across various tasks and languages.
Aleph Alpha provides an asymmetric embedding model which I believe is an attempt to resolve this issue (haven't looked into it much, just saw the entry in langchain's documentation)
I'm not following why you would want to do this? At that point, just asking the LLM without any additional context would/should produce the same (inaccurate) results.
E.g. Today I woke up at 9.am, had a light breakfast and then went on a run in Golden Gate Park.
What questions do you generate from this sentence?
Can this be implemented in current opensource models?
A other option is to ask GPT to compress your tokens into a shorter prompt for itself.
[0] https://www.theverge.com/2023/4/14/23683084/openai-gpt-5-rum...
What these articles don't touch on is what to do once you've got the most relevant documents. Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)? What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations). Doing that will be useful but still lossey
What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM
Figuring out which of these choices to make, along with an awful lot more choices I'm likely not even thinking about right now, is what will seperate the useful from the useless LLM + Extractive knowledge systems
This is news to me. Where could I read about this trick?
> "Do you use the whole document as context directly? Do you summarize the documents first using the LLM (now the risk of hallucination in this step is added)?"
In my opinion the best approach is to take a large document and break it down into chunks before storing as embeddings and only querying back the relevant passages (chunks).
> "What about that trick where you shrink a whole document of context down to the embedding space of a single token (which is how ChatGPT is remembering the previous conversations)"
Not sure I follow here but seems interesting if possible, do you have any references?
> "What about simply asking the LLM to craft its own search prompt to the DB given the user input, rather than returning articles that semantically match the query the closest? This would also make hybird search (keyword or bm25 + embeddings) more viable in the context of combining it with an LLM"
This is definitely doable but just adds to the overall processing/latency (if that is a concern).
I played with that approach in this post - https://friend.computer/jekyll/update/2023/04/30/wikidata-ll.... "Craft a query" is nice as it gives you a very declarative intermediate state for debugging.
I can try to make a Ruby client.
A Ruby client would be great. Our FastAPI spec makes this pretty easy - it's at localhost:8000/openapi.json when the docker backend is running.
>> “If you don't know the answer, just say that you don't know, don't try to make up an answer”
//
It seems silly to make this part of the prompt rather than a separate parameter, surely we could design the response to be close to factual. Then run a checker to ascertain a score for the factuality of the output?
Technobabble explanation: such "silly" additions are a natural way to emphasize certain dimensions of the latent space more than others, focusing the proximity search GPTs are doing.
Working model I've been getting some good mileage off: GPT-4 is like a 4 year old kid, that somehow managed to read half of the Internet. Sure, it kinda remembers and possibly understands a lot, but it still thinks like a 4 year old, has about as much attention span, and you need to treat it like a kid that age.
The model searches until it finds an answer, including distance and resolution
Search is performed by a DB, the query then sub-queries LLMs on a tree of embeddings
Each coordinate of an embedding vector is a pair of coordinate and LLM
Like a dynamic dictionary, in which the definition for the word is an LLM trained on the word
Indexes become shortcuts to meanings that we can choose based on case and context
Does this exist already?
per·snick·et·y: placing too much emphasis on trivial or minor details; fussy. "she's very persnickety about her food"
A dynamic entry could instead be an LLM what will answer things related to they word, ex:
What is the definition of persnickety?
How can I use it in a sentence?
What are some notable documents that include it?
Any famous quotes?
…
So each entry is an LLM trained mostly only on that keyword/concept definition
There are some that believe in smaller models: https://twitter.com/chai_research/status/1655649081035980802...
In this case, it can't possibly be approached. It certainly can't be attained.
Borges' Library of Babel, which represents all possible combinations of letters that can fit into a 400-page book, only contains some 25^1312000 books. And the overwhelming majority of its books are full of gibberish. The amount of "knowledge" that a LLM can learn or describe is VERY strictly bounded and strictly finite. (This is perhaps its defining characteristic.)
I know this is pedantic, but I am a philosopher of mathematics and this is a matter that's rather important to me.
I don’t think this is pedantic. Words carry a specific meaning or what’s the point of words otherwise.
We've done this in NLP and search forever. I guess even SQL query planners and other things that automatically rewrite queries might count.
It's just that now the parameters seem squishier with a prompt interface. It's almost like we need some kind of symbolic structure again.