If we look back at the history of mechanical machines, we see a lot of the same kind of debates happening there that we do around AI today -- comparing them to the abilities of humans or animals, arguing that "sure, this machine can do X, but humans can do Y better..." But over time, we've generally stopped doing that as we've gotten used to mechanical machines. I don't know that I've ever heard anyone compare a wheel to leg, for instance, even though both "do" the same thing, because at this point we take wheels for granted. Wheels are much more efficient at transporting objects across a surface in some circumstances, but no one's going around saying "yeah, but they will never be able to climb stairs as well" because, well, at this point we recognize that's not an actual argument we need to have. We know what wheels do and don't.
These AI machine are a fairly novel type of machine, so we don't yet really understand what arguments make sense to have and which ones are unnecessary. But I like these posts that get more into exactly what an LLM _is_, as I find them helpful in understanding better exactly what kind of machine an LLM is. They're not "intelligent" any more than any other machine is (and historically, people have sometimes ascribed intelligence, even sentience, to simple mechanical machines), but that's not so important. Exactly what we'll end up doing with these machines will be very interesting.
I believe that this is dangerous in many valuable applications, and will mean that the current generation of LLM's will be more limited in value that some people believe. I think this is quite similar to the problems that self driving cars have; we can make good ones for sure, but they are not good enough or predictable enough to be trusted and used without significant constraints.
My worry is that LLM's will get used inappropriately and will hurt lots of people, I wonder if there is a way to stop this?
It’s just a tool
Language models feed from the same source. They carry as much claim to intelligence, it's the same intelligence. What makes language models inferior today is the lack of access to feedback signals. They are not embodied, embedded and enacted in the environment (the 4 E's). They don't even have a code execution engine to iterate on bugs. But they could have.
And when a models does have access to massive experimentation, search and can learn from its outcomes, like AlphaGo, then it can beat us at our own game. Trained just in self-play mode, learning from verifying outcomes, was enough to surpass two thousand years of history, all of our players put together.
I think future code generation models will surpass human level based on massive problem solving experience, and most of it will be generated by its previous version. A human could not experience as much in a lifetime.
This is the second source of intelligence - experience. For language models it only costs money to generate, it's not a matter of getting more human data. So the path is wide open now. Who has the money to crank out millions of questions, problems and tasks + their solutions?
I also cannot fathom how models can develop a sense of time, or structured knowledge of the world consisting of discrete objects, even with a large dose of RLHF, if the internal representations are continuous, and layer normalised, and otherwise incapable of arriving at any hard-ish, logic-like rules? All these models seem have deep seated architectural limitations, and they are almost at the limit of the available training data. Being non-vague and positive-minded about this doesn't solve the issue. The models can write polite emails and funny reviews of Persian rags in haiku, but they are deeply unreasonable and 100% unreliable. There is hardly a solid business or social case for this stuff.
Curiously enough, I imagine that sort of filtering/translation is the sort of thing a Large Language Model would be pretty good at.
The models do not understand language like humans do.
Duh? they are not humans? Of course they differ in some of their mechanisms. They still can tell us a lot about language structure. And for what they don't tell us, we can look elsewhere.
Instead of an argument about "tone," it's always going to be better to be specific about your objection. In my experience, nine times out of ten when asked to be specific, the "tone" problem turns out to be that the author said something "is wrong," and the reviewer is pretending that a horrible mistake has been made by not instead saying "I think it could be wrong," or "this is how I think that this thing might be improved."
Nobody should be required to prefix the things they are saying they believe with the fact that those things are their opinions. Who else's opinions would they be? Also, nobody should be required to describe what they think in a way that compliments and builds on things that they think are wrong. It's up to those people to make their arguments themselves. There is no obligation to try to fix things that you actually just want to replace.
Contrary to what you say here, I don't think those behaviors make anyone more receptive to one's arguments, because those objections are actually vacuous rhetorical distractions from actual disagreements (whether something is true or false) that can be argued on their merits if there are merits to argue. In fact, I think those behaviors indicate an eagerness to reduce conflict that will only be taken advantage of by someone objecting to "tone" in bad faith. If you've said "I think that this method would improve the process," there's really no reason that a "tone"-arguer can't be upset that you said that it "would" improve the process instead of "could" improve the process. In fact, it's an act of presumptuous elitism that you think you could improve the process, and it disrespects the many very well-regarded researchers involved to state as a fact that you could see something that they haven't.
Sorry for the rant, but I think that arguments about "tone" or whether something is "just your opinion, man" are far worse internet pollution than advertising, and I get triggered.
By improving this section I think we can have a standard go-to doc to refute the common-but-boring arguments. By pre-anticipating what they say (and yes, Bender is very predictable... yuo could almost make a chatbot that predicts her) it greatly weakens their argument.
Criticism is essential for progress in science, and even in AI research (which is far from science). Get over it. The role of the critic is not to be your enemy, the role of the critic is to help you improve your work. That makes no difference if the critic is a bad person who wants your downfall, or not. What makes a difference is if you can convincingly demonstrate that your critic's criticism does not hold anymore. Then people stop listening to the critic- not when you shout louder than the critic.
Oh and, btw, you do that demonstrating by improving your work, which implies that you need to be one of the researchers whose work is criticised to do that, rather than some random cheerleader of the interwebs. What you propose here, to compose some sort of document to paste all over twitter everytime someone says something critical of the "home team", that's not what researchers do; it's organising an internet mob. And it has exactly 0 chance of being of any use to anyone.
Not to mention the focus on Emily Bender is downright creepy.
This is the key comparison. A 747-400 burns 10+ metric tons of kerosene per hour, which means its basic energy consumption is > 110MW. The cost to train GPT-3 was approximately the same energy spent by one 8-hour airline flight.
(As a PS, I've seen that last one mainly as a refutation for the "LLMs are ready to kill search" meme. In that context it's a very valid objection.)
Having said that, I would add a note about the whole category of ontological or "nothing but" arguments - saying that an LLM is nothing but a fancy database, search engine, autocomplete or whatever. There's an element of question-begging when these statements are prefaced with "they will never lead to machine understanding because...", and beyond that, the more they are conflated with everyday technology, the more noteworthy their performance appears.
> Also, let's put things in perspective: yes, it is enviromentally costly, but we aren't training that many of them, and the total cost is miniscule compared to all the other energy consumptions we humans do.
Part of the reason LLMs aren't that big in the grand scheme of things is because they haven't been good enough and businesses haven't started to really adopt them. That will change, but the costs will be high because they're also extremely expensive to run. I think the author is focusing on the training costs for now, but that will likely get dwarfed by operational costs. What then? Waving one's arms and saying it'll just "get cheaper over time" isn't an acceptable answer because it's hard work and we don't really know how cheap we can get right now. It must be a focus if we actually care about widespread adoption and environmental impact.
The interesting details are: the companies with large GPU/TPU fleets are already running them in fairly efficient setups, with high utilization (so you're not blowing carbon emissions on idle machines), and can scale those setups if demand increases. This is not irresonsible. And, the scaleup will only happen if the systems are actually useful.
Basically there are 100 other things I'd focus on trimming environment impact for before LLMs.
Edit: This isn’t handwaving btw, this is to say some fairly decent solutions are available now.
Now maybe I'm naive somehow because I'm a machine-learning person who doesn't work on LLMs/big-ass-transformers, but uh... why do they actually have to be this large to get this level of performance?
To add a research-oriented comparison to the others being presented here, the LHC's annual energy budget is about 3,000 times that of training GPT-3.
And transformers are not even the final model , who knows what will come next
> Yoav Goldberg is a computer science professor and researcher in the field of natural language processing (NLP). He is currently a professor at Bar-Ilan University in Israel and a senior researcher at the Allen Institute for Artificial Intelligence (AI2).
Professor Goldberg has made significant contributions to the NLP field, particularly in the areas of syntactic parsing, word embeddings, and multi-task learning. He has published numerous papers in top-tier conferences and journals, and his work has been widely cited by other researchers.
This is what Math is, abstract syntactic rules. GPTs however seem to struggle in particular at counting, probably because their structure does not have a notion of order. I wonder if future LLMs built for math will basically solve all math (if they will be able to find any proof that is provable or not).
Grounding LLMs to images will be super interesting to see though, because images have order and so much of abstract thinking is spatial/geometric in its base. Perhaps those will be the first true AIs
I apologize for the confusion caused by my previous response. You are correct that the star-shaped block will not fit into the square hole. That is because the edges of the star shape will obstruct the block from fitting into the square hole. The star-shaped block fits into the round hole.
Block-and-hole puzzles were developed in the early 20th century as children’s teaching time. They’re a common fixture in play rooms and doctors offices throughout the world. The star shape was invented in 1973.
Please let me know if there’s anything else I can assist you with.
I think this misses a big component of RLHF (the reinforcement learning). The approach described above is "just" supervised learning on human demonstrations. RLHF uses a reinforcement learning objective to train the model rather than maximizing likelihood of human demonstrations. In fact, you can then take the utterances your model has generated, collect human feedback on those to improve your reward model, and then train a new (hopefully better) model -- you no longer need a human roleplaying as an AI. This changed objective addresses some of the alignment issues that LMs struggle with: Open AI does a pretty good job of summarizing the motivation in https://arxiv.org/abs/2009.01325:
> While [supervised learning] has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance. Optimizing for quality may be a principled approach to overcoming these problems.
where RLHF is one approach to "optimizing for quality".
At this point for me, the notion of machine "intelligence" is a more reasonable proposition. However this shift is the result of a reconsideration of the binary proposition of "dumb or intelligent like humans".
First, I propose a possible discriminant for "intelligence" vs "computation" to be the ability of an algorithm to brute force compute a response given the input corpus of the 'AI' under consideration, where the machine has provided a reasonable response.
It also seems reasonable to begin to differentiate 'kinds' of intelligence. On this very planet there are a variety of creatures that exhibit some form of intelligence. And they seem to be distinct kinds. Social insects are arguably intelligent. Crows are discussed frequently on hacker news. Fluffy is not entirely dumb either. But are these all the same 'kind' of intelligence?
Putting cards on the table, at this point it seems eminently possible that we will create some form of mechanical insectoid intelligence. I do not believe insects have any need for 'meaning' - form will do. That distinction also takes the sticky 'what is consciousness?' Q out of the equation.
And... the claim is that humans can do this? Is it just the boring "This AI can only receive information via tokens, whereas humans get it via more high resolution senses of various types, and somehow that is what causes the ability to figure out two things are actually the same thing?" thing?
I expect if you asked "Did $FAMOUS_EVENT happen before $OTHER_FAMOUS_EVENT" it would do OK, just as "What is $FAMOUS_NUMBER plus $FAMOUS_NUMBER?" does OK, but as you get more obscure it will fall down badly on tasks that humans would generally do OK at.
Though, no, humans are not perfect at this by any means either.
It is important to remember that what this entire technology boils down to is "what word is most likely to follow the content up to this point?", iterated. What that can do is impressive, no question, but at the same time, if you can try to imagine interacting with the world through that one and only tool, you may be able to better understand the limitations of this technology too. There are some tasks that just can't be performed that way.
(You'll have a hard time doing so, though. It is very hard to think in that manner. As a human I really tend to think in a bare minimum of sentences at a time, which I then serialize into words. Trying to imagine operating in terms of "OK, what's the next word?" "OK, what's the next word?" "OK, what's the next word?" with no forward planning beyond what is implied by your choice of this particular word is not something that comes even remotely naturally to us.)
When this tech answers the question "Did $FAMOUS_EVENT happen before $OTHER_FAMOUS_EVENT?", it is not thinking, OK, this event happened in 1876 and the other event happened in 1986, so, yes, it's before. It is thinking "What is the most likely next word after '... $OTHER_FAMOUS_EVENT?" "What is the next most likely word after that?" and so on. For famous events it is reasonably likely to get them right because the training data has relationships for the famous events. It might even make mistakes in a very human manner. But it's not doing temporal logic, because it can't. There's nowhere for "temporal logic" to be taking place.
dumb boring critiques, so what? so boring! we'll "be careful", OK? so just shut up!
> Why is this significant? At the core the model is still doing language modeling, right? learning to predict the next word, based on text alone? Sure, but here the human annotators inject some level of grounding to the text. Some symbols ("summarize", "translate", "formal") are used in a consistent way together with the concept/task they denote. And they always appear in the beginning of the text. This make these symbols (or the "instructions") in some loose sense external to the rest of the data, making the act of producing a summary grounded to the human concept of "summary". Or in other words, this helps the model learn the communicative intent of the a user who asks for a "summary" in its "instruction". An objection here would be that such cases likely naturally occur already in large text collections, and the model already learned from them, so what is new here? I argue that it might be much easier to learn from direct instructions like these than it is to learn from non-instruction data (think of a direct statement like "this is a dog" vs needing to infer from over-hearing people talk about dogs). And that by shifting the distribution of the training data towards these annotated cases, substantially alter how the model acts, and the amount of "grounding" it has. And that maybe with explicit instructions data, we can use much less training text compared to what was needed without them. (I promised you hand waving didn't I?)
It's easy to say "Oh well, humans are biased too" when the biases of these machines don't: misgender you, mistranslate text that relates to you, have negative affect toward you, are more likely to write violent stories related to you, have lower performance on tasks related to you, etc.
The models encode many biases and stereotypes.
Well, sure they do. They model observed human's language, and we humans are terrible beings, we are biased and are constantly stereotyping. This means we need to be careful when applying these models to real-world tasks, but it doesn't make them less valid, useful or interesting from a scientiic perspective."
Not sure how this can be seen as dismissive.>Yoav can dismiss this because it just doesn't affect him much.
Maybe just maybe someone named Yoav Goldberg might maybe be in a group where bias affects him quite strongly.
I'll take an example. I'm making an adventure/strategy game that is set in the 90s Finland. We had a lot of Somali refugees coming from the Soviet Union back then and to reflect that I've created a female Somali character who is unable to find employment due to the racist attitudes of the time.
I'm using DALL-E 2 to create some template graphics for the game and using the prompt "somali middle aged female pixel art with hijab" produces some real monstrosities https://imgur.com/a/1o2CEi9 whereas "nordic female middle age minister short dark hair pixel art portrait pixelated smiling glasses" produces exclusively decent results https://imgur.com/a/ag2ifqi .
I'm an extremely privileged white, middle-aged, straight cis male and I'm able to point out a problem. Of course I'm not against hiring minorities, just saying that you don't need to belong to any minority group to spot the biases.