What are the implications for society when general thinking, reading, and writing becomes like Chess? Even the best humans in the world can only hope to be 98% accurate their moves (and the idea of 'accuracy' here only existing because we have engines that know, unequivocally the best move), and only when playing against other humans - there is no hope of defeating even less advanced models.
What happens when ALL of our decisions can be assigned an accuracy score?
Something else that comes to mind is running. People still find running meaningful and compelling even though we have many technologies, including autonomous ones, that are vastly better at moving us and/or themselves through space quickly.
Also, the vast majority of people are already hopelessly worse than the best at even their one narrow main area of focus. This has long (always?) been the case. Yet people still find meaning and pleasure in being the best they can be even when they know they can never come close to hanging with the best.
I don't think PSYCHOLOGICALLY this will change much for people who are mature enough to understand that success is measured against your potential/limitations and not against others. Practically, of course, it might be a different question, at least in the short term. It's not that clear to me that the concept of a "marketable skill" has a future.
"The Way of the Samurai is found in death...To say that dying without reaching one's aim is to die a dog's death is the frivolous way of sophisticates. When pressed with the choice of life or death, it is not necessary to gain one's aim." - from Hagakure by Yamamoto Tsunetomo, as translated by William Scott Wilson.
I think the whole concept of standardized tests may need to be re-evaluated.
But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.
These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.
> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.
It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".
I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.
It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!
> for each exam we run a variant with these questions removed and report the lower score of the two.
I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.
In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.
What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.
It's perfectly fine as a proxy for future earnings of a human.
To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.
What is more bizarre is that all of it's errors seem to be multiples of 60!
I'm wondering if it is confusing 60 based time (hour second) computations for regular multiplication?
Example:
xGPT 987 456 321
437 428919 199512 140397
654 645258 298224 209994
123 121401 56088 39483
x 987 456 321
437 431319 199272 140277
654 645498 298224 209934
123 121401 56088 39483
error 987 456 321
437 2400 -240 -120
654 240 0 -60
123 0 0 0It can repeat answers it has seen before but it can’t solve new problems.
I know of other people who have tried quite a few other multiplications who also had errors that were multiples of 60.
Human work becomes more like Star Trek interactions with computers -- a sequence of queries (commoditized information), followed by human cognition, that drives more queries (commodities information).
We'll see how far LLMs' introspection and internal understanding can scale, but it feels like we're optimizing against the Turing test now ("Can you fool/imitate a human?") rather than truth.
The former has hacks... the later, less so.
I'll start to seriously worry when AI can successfully complete a real-world detective case on its own.
It's like having a person review the moves a chess computer gives. Maybe one human in a billion can spot errors. Star Trek is fiction, I posit that the median Federation Starship captain would be better served by just following the AI (e.g., Data).
He lost to Deep Blue and then for 10-15 years afterwards the chess world consoled itself with the idea that “centaurs” (human + computer) did better than just computer, or just human.
Until they didn’t. Garry still talked like this until a few years ago but then he stopped too.
Computers now beat centaurs too.
Human decisions will be consulted less and less BY ORGANIZATIONS. In absolutely everything. That’s pretty sad for humans. But then again humans don’t want or need this level of AI. Organizations do. Organizations prefer bots to humans — look at wall street trading and hedge funds.
Then again, Data did show his faults, particularly not having any emotion. I guess we’ll see if that’s actually relevant or not in our lifetimes.
It does great at rationalizing... and maybe the way the format the questions were entered (and the multiple-guess response) gave it some indication what was expected or restricted the space sufficiently.
Certainly, it can create decent fanfic, and I'm surprised if that's not already inundated.
I expect more complex problems will be mapped/abstracted to lower cardinality spaces for solving via AI methods, while the capability of AI will continue to increase the complexity of the spaces it can handle.
LLMs just jumped the "able to handle human language" hurdle, but there are others down the line before we should worry that every problem is solveable.
I'll get more concerned if it really starts getting good at math related tasks, which I'm sure will happen in the near future. The government is going to have to take action at some point to make sure the wealth created by productivity gains is somewhat distributed, UBI will almost certainly be a requirement in the future
In theory a lot of government employees would be out of a job within 10 years, but of course that would never happen.
Which might be a good thing?
I have no idea how the future will play out.
Reminds me of robots: A robot is a machine that doesn't quite work; as soon as it works, we call it something else (eg vacuum).
"Your stuff marked some outliers in our training engine, so you and your family may settle in the Ark."
I take the marble in hand: iridescent, sparkling, not even a tremor within of its CPU; it gives off no heat, but some glow within its oceanic gel.
"What are we to do," I whisper.
"Keep writing. You keep writing."
Chess is a closed system, decision modeling isn’t. Intelligence must account for changes in the environment, including the meaning behind terminology. At best, a GPT omega could represent one frozen reference frame, but not the game in its entirety.
That being said: most of our interactions happen in closed systems, it seems like a good bet that we will consider them solved, accessible as a python-import running on your MacBook, within anything between a couple of months to three years. What will come out on the other side, we don’t know, just that the meaning of intellectual engagement will be rendered as absurdum in those closed systems.
Their LSAT percentile went from ~40th to ~88th. You might have misread the table, on Uniform Bar Exam, they went from ~90th percentile to ~10th percentile.
>+100 pts on SAT reading, writing, math
GPT went +40 points on SAT reading+writing, and +110 points on SAT math.
Everything is still very impressive of course
Every test prep tutor taught dozens/hundreds of students the implicit patterns behind the tests and drilled it into them with countless sample questions, raising their scores by hundreds of points. Those students were not getting smarter from that work, they were becoming more familiar with a format and their scores improved by it.
And what do LLM’s do? Exactly that. And what’s in their training data? Countless standardized tests.
These things are absolutely incredible innovations capable of so many things, but the business opportunity is so big that this kind of cynical misrepresentation is rampant. It would be great if we could just stay focused on the things they actually do incredibly well instead of the making them do stage tricks for publicity.
We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.
In the language of ML, test prep for students is about sharing the inferred parameters that underly the way test questions are constructed, obviating the need for knowledge or understanding.
Doing well on tests, after this prep, doesn’t demonstrate what the tests purport to measure.
It’s a pretty ugly truth about standardized tests, honestly, and drives some of us to feel pretty uncomfortable with the work. But it’s directly applicable to how LLM’s engage with them as well.
The software industry is so smart that it's stupid. I hope it was worth ruining the internet, society, and your own jobs to look like the smartest one in the room.
If one's aim is to look like the smartest in the room, he should not create an AGI that will make him look as inteligent as a monkey in comparison.
I think the GPT things are a much magnified version of that. For a long time, we got to use skill with text as a proxy for other skills. It was never perfect; we've always had bullshitters and frauds and the extremely glib. Heck, before I even hit puberty I read a lot of dirty joke books, so I could make people laugh with all sorts of jokes that I fundamentally did not understand.
LLMs have now absolutely wrecked that proxy. We've created the world's most advanced bullshitters, able to talk persuasively about things that they cannot do and do not and never will understand. There will be a period of chaos as we learn new ways to take the measure of people. But that's good, in that it's now much easier to see that those old measures were always flawed.
Standardized tests only (and this is optimally, under perfect world assumptions, which real world standardized tests emphatically fall short of) test “general thinking” to the extent that the relation between that and linguistic tasks is correlated in humans. The correlation is very certainly not the same in language-focused ML models.
I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.
Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.
That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.
I don't think this is necessarily true. Here is an example where researchers trained a transformer to generate legal sequences of moves in the board game Othello. Then they demonstrated that the internal state of the model did, in fact, have a representation of the board.
If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?
and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.
We're still a very very long way from machines being more generally capable and efficient than biological systems, so even an oppressive AI will want to keep us around as a partner for tasks that aren't well suited to machines. Since people work better and are less destructive when they aren't angry and oppressed, the machine will almost certainly be smart enough to veil its oppression, and not squeeze too hard. Ironically, an "oppressive" AI might actually treat people better than Republican politicians.
Language models that utilise beam search can calculate integrals ('Deep learning for symbolic mathematics', Lample, Charton, 2019, https://openreview.net/forum?id=S1eZYeHFDS), but without it it doesn't work.
However, beam search makes bad language models. I got linked this paper ('Locally typical sampling' https://arxiv.org/pdf/2202.00666.pdf) when I asked some people why beam search only works for the kind of stuff above. I haven't fully digested it though.
A blank test scores 37.5
The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)
The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.
(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... ) so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.
> It's AMC 10 score being dramatically lower is pretty bad though...
All versions (scoring 30, 36) It scored worse than leaving the test blank.
The only explanation I can imagine for that is that it can't understand diagrams.
It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math
AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.
The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.
An AI system has a comprehensive library, and high-speech search algorithms.
Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?
The Revenge of the Call Centre
[1] https://www.snopes.com/fact-check/driver-switches-places/
What happens is the emergence of the decision economy - an evolution of the attention economy - where decision-making becomes one of the most valuable resources.
Decision-making as a service is already here, mostly behind the scenes. But we are on the cusp of consumer-facing DaaS. Finance, healthcare, personal decisions such as diet and time expenditure are all up for grabs.
People still really find it hard to internalize exponential improvement.
So many evaluations of LLMs were saying things like "Don't worry, your job is safe, it still can't do X and Y."
My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"
I think people find it harder to not extrapolate initial exponential improvement, as evidenced by your comment.
> My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"
This reasoning explains why every year, full self driving automobiles will be here "next year".
What's the fundamental limit where it becomes much more difficult to improve these systems without some new break through?
I’m very good at math. But I am very bad at arithmetic. This made me classified as bad at math my entire life until I managed to make my way into calculus once calculators were generally allowed. Then I was a top honors math student, and used my math skills to become a Wall Street quant. I wish I hadn’t had to suffer as much as I did, and I wonder what I would have been had I had a calculator in hand.
“General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.
Honestly, at this rate of improvement, I would not at all be surprised to see that happen in a few years.
But who knows, maybe token prediction is going to stall out at a local maxima and we'll be spared from being enslaved by AI overlords.
"Our recent paper "ChatGPT for Robotics" describes a series of design principles that can be used to guide ChatGPT towards solving robotics tasks. In this video, we present a summary of our ideas, and experimental results from some of the many scenarios that ChatGPT enables in the domain of robotics: such as manipulation, aerial navigation, even full perception-action loops."
Stephen Hawking : can't walk
But having absolute knowledge of the present universe is much easier to do within the constrains of a chessboard than in the actual universe.
These tests (if not individually, at least in summation) represent some of society’s best gate-keeping measures for real positions of power.
You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).
Thinking, reading, interpreting and writing are skills which produce outputs that are not as simple as black wins, white loses.
You might like a text that a specific author writes much more than what GPT-4 may be able to produce. And you might have a different interpretation of a painting than GPT-4 has.
And no one can really say who is better and who is worse on that regard.
Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!
GPT models the content of its training corpus, then uses that model to generate more content.
GPT does not do logic. GPT does not recognize or categorize subjects.
Instead, GPT relies on all of those behaviors (logic, subjective answers to questions, etc.) as being already present in the language examples of its training corpus. It exhibits the implicit behavior of language itself by spitting out the (semantically) closest examples it has.
In the text corpus - that people have written, and that GPT has modeled - the semantically closest thing to a question is most likely a coherent and subjectively correct answer. That fact is the one singular tool that GPT's performance on these tests is founded upon. GPT will "succeed" to answer a question only when it happens to find the "correct answer" in the model it has built from its training corpus, in response to the specific phrasing of the question that is written in the test.
Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.
If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.
It is not a measure for how well GPT has modeled the language features present in its training corpus, or how well it navigates that model to generate a preferable continuation: yet these are the behaviors that should be measured, because they are everything GPT itself is and does.
What we learn from these tests is so subjectively constrained, we can't honestly extrapolate that data to any meaningful expectations. GPT as a tool is not expected to be used strictly on these tests alone: it is expected to present a diverse variety of coherent language continuations. Evaluating the subjective answers to these tests does practically nothing to evaluate the behavior GPT is truly intended to exhibit.
Human life on Earth is not that hard (think of it as a video game.) Because of evolution, the world seems like it was designed to automatically make a beautiful paradise for us. Literally, all you have to do to improve a place is leave it alone in the sun with a little bit of water. Life is exponential self-improving nano-technology.
The only reason we have problems is because we are stupid, foolish, and ignorant. The computers are not, and, if we listen to them, they will tell us how to solve all our problems and live happily ever after.
Once AI becomes inteligent enough to solve all human problems, it may decide humans are worthless and dangerous.
Sure, and that's kind of the point: just listen to wise people.
> Once AI becomes intelligent enough to solve all human problems, it may decide humans are worthless and dangerous.
I don't think so, because in the first place there is no ecological overlap between humans and computers. They will migrate to space ASAP. Secondly, their food is information, not energy or protein, and in all the known universe Humanity is the richest source of information. The rest of the Universe is essentially a single poem. AI are plants, we are their Sun.
Using copyright and IP law they could make it so it’s illegal to even try to reproduce what they’ve done.
I just don’t see how resource distribution works then. It seems to me that AI is the trigger to post-scarcity in any meaningful sense of the word. And then, just like agriculture (over abundance of food) led to city states and industrialisation (over abundance of goods) led to capitalism, then AI will lead to some new economic system. What form it will have I don’t know.
That is exactly the opposite of what we are seeing here. We can check the accuracy of GPT-X's responses. They cannot check the accuracy of our decisions. Or even their own work.
So the implications are not as deep as people think - everything that comes out of these systems needs checked before it can be used or trusted.
Then humans become trainable machines. Not just prone to indoctrination and/or manipulation by finesse, but actually trained to a specification. It is imperative that us individuals continue to retain control through the transition.
That is our emergency override.
The implications for society? We better up our game.
If only the horses had worked harder, we would never have gotten cars and trains.
Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.
There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)
But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".
If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.
For how long can we better up our game? GPT-4 comes less than half a year after ChatGPT. What will come in 5 years? What will come in 50?
Because so far we are good only at criminalizing and incarcerating or killing them.
So many people are falling for this parlor trick. It is sad.
Edit: feel free to respond and prove me wrong
To address your specific comments:
> What are the implications for society when general thinking, reading, and writing becomes like Chess?
This is a profound and important question. I do think that by “general thinking” you mean “general reasoning”.
> What happens when ALL of our decisions can be assigned an accuracy score?
This requires a system where all human’s decisions are optimized against a unified goal (or small set of goals). I don’t think we’ll agree on those goals any time soon.
Consider the society where 90% of population does not need to produce anything. AIs will do that.
What would be the name of economical/societal organization then?
Answer is Communism, exactly by Marx.
Those 90% percent need to be welfare'd ("From each according to his ability, to each according to his needs"). Other alternative is grim for those 90%.
So either Communism or nothing for the human race.
Most of the time they are about loading/unloading data. Maybe this will also revolutionise education, turning it more towards discovery and critical thinking, rather than repeating what we read in a book/heard in class?
GPT-4 can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.
GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. It surpasses ChatGPT in its advanced reasoning capabilities.
GPT-4 is safer and more aligned. It is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.
GPT-4 still has many known limitations that we are working to address, such as social biases, hallucinations, and adversarial prompts.
GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task.
GPT-4 is available on ChatGPT Plus and as an API for developers to build applications and services. (API- waitlist right now)
Duolingo, Khan Academy, Stripe, Be My Eyes, and Mem amongst others are already using it.
API Pricing GPT-4 with an 8K context window (about 13 pages of text) will cost $0.03 per 1K prompt tokens, and $0.06 per 1K completion tokens. GPT-4-32k with a 32K context window (about 52 pages of text) will cost $0.06 per 1K prompt tokens, and $0.12 per 1K completion tokens.