undefined | Better HN

0 pointsAtreiden3y ago0 comments

I think it's interesting that they've benchmarked it against an array of standardized tests. Seems like LLMs would be particularly well suited to this kind of test by virtue of it being simple prompt:response, but I have to say...those results are terrifying. Especially when considering the rate of improvement. bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

What are the implications for society when general thinking, reading, and writing becomes like Chess? Even the best humans in the world can only hope to be 98% accurate their moves (and the idea of 'accuracy' here only existing because we have engines that know, unequivocally the best move), and only when playing against other humans - there is no hope of defeating even less advanced models.

What happens when ALL of our decisions can be assigned an accuracy score?

0 comments

wly_cdgr3y ago

Not sure what happens, but I will say that human chess is more popular than ever even though everyone knows that even the best humans are hopelessly terrible compared to the leading engines.

Something else that comes to mind is running. People still find running meaningful and compelling even though we have many technologies, including autonomous ones, that are vastly better at moving us and/or themselves through space quickly.

Also, the vast majority of people are already hopelessly worse than the best at even their one narrow main area of focus. This has long (always?) been the case. Yet people still find meaning and pleasure in being the best they can be even when they know they can never come close to hanging with the best.

I don't think PSYCHOLOGICALLY this will change much for people who are mature enough to understand that success is measured against your potential/limitations and not against others. Practically, of course, it might be a different question, at least in the short term. It's not that clear to me that the concept of a "marketable skill" has a future.

"The Way of the Samurai is found in death...To say that dying without reaching one's aim is to die a dog's death is the frivolous way of sophisticates. When pressed with the choice of life or death, it is not necessary to gain one's aim." - from Hagakure by Yamamoto Tsunetomo, as translated by William Scott Wilson.

r00fus3y ago

Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.

rcme3y ago

> I would totally expect they could get it this good.

But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.

swatcoder3y ago

Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.

vishal01233y ago

From the paper

> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.

zamnos3y ago

I think you're right, and that test prep materials were included in the dataset, even if only by accident. Except that humans have access to the same test prep materials, and they fail these exams all the time. The prep materials are just that, preparatory. They're representative of the test questions, but actual test has different passages to read and different questions. On to of that, the LSAT isn't a math test with formulas where you just substitute different numbers in. Which is to say, the study guides are good practice but passing the test on top of that represents having a good command of the English language and an understanding of the subject materials.

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".

EGreg3y ago

Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!

technothrasher3y ago

They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.

stephenboyd3y ago

The training data is so large that it incidentally includes basically anything that Google would index plus the contents of as many thousands of copyrighted works that they could get their hands on. So that would definitely include some test prep books.

2 more replies

MonkeyMalarky3y ago

If it's trained on material scraped from the web, I imagine it would include all the test prep sites and forums.

1 more reply

dovin3y ago

Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.

gaudat3y ago

This feels the same as a human attending cram school to get better results in tests. Should we abolish them?

staunton3y ago

A test being a good indicator of human learning progress and ability is almost completely orthogonal to it being a good indicator for AI learning process and ability.

In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.

What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.

alvis3y ago

IMO, it's a good opportunity to re-think about exam and future of education. For many schools, education = good results in exams. Now GPT-4 is going to slam them and say what's the point now!

pas3y ago

> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.

kurthr3y ago

It's a bit weird that it still doesn't get 3 digit multiplications correct, but the last digit seems right.

What is more bizarre is that all of it's errors seem to be multiples of 60!

I'm wondering if it is confusing 60 based time (hour second) computations for regular multiplication?

Example:

   xGPT 987    456    321
   437 428919 199512 140397
   654 645258 298224 209994
   123 121401  56088  39483
   
   x    987    456    321
   437 431319 199272 140277
   654 645498 298224 209934
   123 121401  56088  39483
   
   error 987   456  321
   437   2400 -240 -120
   654   240   0   -60
   123   0     0    0

MagicMoonlight3y ago

It’s not intelligent. It has no concept of mathematics so you can’t expect it to solve that.

It can repeat answers it has seen before but it can’t solve new problems.

kurthr3y ago

I understand it's just a language model, but clearly it has some embedded method of generating answers which are actually quite close. For example it gets all 2 digit multiplications correct. It's highly unlikely it has seen the same 6 ordered 3 digit (or even all 10k 2 digit multipies) integers from a space of 10^18 and yet it is quite close. Notably, it gets the same divisions wrong as well (for this small example) in exactly the same way.

I know of other people who have tried quite a few other multiplications who also had errors that were multiples of 60.

ethbr03y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

Human work becomes more like Star Trek interactions with computers -- a sequence of queries (commoditized information), followed by human cognition, that drives more queries (commodities information).

We'll see how far LLMs' introspection and internal understanding can scale, but it feels like we're optimizing against the Turing test now ("Can you fool/imitate a human?") rather than truth.

The former has hacks... the later, less so.

I'll start to seriously worry when AI can successfully complete a real-world detective case on its own.

stocknoob3y ago

It's not clear to me the median human will do better by being in the loop. Will most human-made deductive follow-up questions be better than another "detective" language model asking them?

It's like having a person review the moves a chess computer gives. Maybe one human in a billion can spot errors. Star Trek is fiction, I posit that the median Federation Starship captain would be better served by just following the AI (e.g., Data).

EGreg3y ago

I met Garry Kasparov when he was training for the Desp Blue match (using Fritz).

He lost to Deep Blue and then for 10-15 years afterwards the chess world consoled itself with the idea that “centaurs” (human + computer) did better than just computer, or just human.

Until they didn’t. Garry still talked like this until a few years ago but then he stopped too.

Computers now beat centaurs too.

Human decisions will be consulted less and less BY ORGANIZATIONS. In absolutely everything. That’s pretty sad for humans. But then again humans don’t want or need this level of AI. Organizations do. Organizations prefer bots to humans — look at wall street trading and hedge funds.

AuryGlenz3y ago

There were plenty of Star Trek episodes where it seemed like they should just ask the damned computer.

Then again, Data did show his faults, particularly not having any emotion. I guess we’ll see if that’s actually relevant or not in our lifetimes.

1 more reply

basch3y ago

Maybe the human is the rng or temperature or lava lamp. At least until we can model and predict each brains tendencies with accuracy.

1 more reply

kurthr3y ago

It's weird that it does so well without even having some modality to know whether it's being asked to answer a factual question or create a work of fiction.

It does great at rationalizing... and maybe the way the format the questions were entered (and the multiple-guess response) gave it some indication what was expected or restricted the space sufficiently.

Certainly, it can create decent fanfic, and I'm surprised if that's not already inundated.

ethbr03y ago

It's a fair question as to whether the problem space of "the world" is different in just amount or sufficiently different in kind to flummox AI.

I expect more complex problems will be mapped/abstracted to lower cardinality spaces for solving via AI methods, while the capability of AI will continue to increase the complexity of the spaces it can handle.

LLMs just jumped the "able to handle human language" hurdle, but there are others down the line before we should worry that every problem is solveable.

ren_engineer3y ago

why are people surprised that an AI model trained on a huge amount of data is good at answering stuff on these types of tests? Doctors and Lawyers are glorified databases/search engines at the end of the day, 99% of them are just applying things they memorized. Lawyers are professional bullshitters, which is what the current generation of AI is great at

I'll get more concerned if it really starts getting good at math related tasks, which I'm sure will happen in the near future. The government is going to have to take action at some point to make sure the wealth created by productivity gains is somewhat distributed, UBI will almost certainly be a requirement in the future

Tenoke3y ago

Because there were large models trained on huge amounts of data yesterday yet they couldn't do it.

scarmig3y ago

Among the general public, doctors and lawyers are high status and magical. An article about how AI will replace them would be more impressive to that public than it creating some obscure proof about the zeroes of the zeta function, even though the latter would be far more indicative of intelligence/scary from an AI safety perspective.

azan_3y ago

"Doctors and Lawyers are glorified databases/search engines at the end of the day" - well, don't be suprised if AI replaces programmers before doctors and lawyers - patients will likely prefer contact with human rather than machines, and lawyers can just lobby for laws which protect their position

pixl973y ago

And yet the programmers on HN will be yelling they don't need unions as the security guards are dragging them away from their desks at Google, because you know, we'll always need good programmers.

ren_engineer3y ago

if AI gives near equal results for way less cost than people will work around the law to get AI treatment. There are already AI models better at diagnosing cancer than human doctors. I see a future where people send in various samples and an AI is able to correlate a huge number of minor data points to find diseases early

gniv3y ago

The best doctor knows what's going on in the body. Has a good understanding of human biology at all levels, from molecular reactions to organ interactions. If I could feed test results to the AI and it would tell me what's wrong, that would be amazing. It's almost equivalent to building a simulation of the human body.

anthonypasq3y ago

last i checked a calculator is better at math than all humans ever

leni5363y ago

They are better at number crunching, which is only a very small part math.

replygirl3y ago

3.5 scored a 1 in bc calc, 4 scored 4 (out of 5)

hgomersall3y ago

I've joked for a long time that doctors are inference machines with a bedside manner. That bedside manner though is critical. Getting an accurate history and suitably interpolating is a huge part of the job.

camjohnson263y ago

I wouldn’t be at all surprised if an LLM was many times better than a human at math, even devising new axioms and building a complete formal system from scratch would be impressive, but not game changing. These LLMs are very good at dealing with formal, structured systems, but not with in formalized systems like what humans deal with everyday.

fdgsdfogijq3y ago

This is legitimately filling me with anxiety. I'm not an "AI hype guy". I work on and understand machine learning. But these scores are shocking and it makes me nervous. Things are about to change

Kaibeezy3y ago

Yeah, but I kind of want my diagnostician to be obsoleted by orders of magnitude.

xena3y ago

A human can be held accountable for making mistakes and killing someone. A large language model has no concept of guilt and cannot be held accountable for making what we consider a mistake that leads to someone's death.

10 more replies

afavour3y ago

An AI trained on the past work of diagnosticians doesn't really render diagnosticians obsolete.

anonymouse0083y ago

Someone still must accept liability. Until there’s a decision squarely who is liable for an LLMs suggestion / work - nothing to fear. Sure people will become liability aggregators for LLMs to scale - but the idea they will be free roaming is a bit hard to believe.

jimbokun3y ago

Fear of liability is not going to stop these things being used...any more than sport regulations prevented athletes from taking steroids.

1 more reply

criddell3y ago

For me, the anxiety probably won't really hit until GPT-n writes GPT-n+1.

JimDabell3y ago

You can already use an LLM to train a smaller, more efficient LLM without significant loss in results.

1 more reply

qwertox3y ago

I for one would be happy to have a personal bureaucrat which would do the right things needed for all government interactions. Remind me, explain to me and fill out forms for me.

In theory a lot of government employees would be out of a job within 10 years, but of course that would never happen.

spaceman_20203y ago

Honestly starting to feel like the beginning of the end of most white collar work.

Which might be a good thing?

I have no idea how the future will play out.

beambot3y ago

If you had told me 5 years ago that there would be a single AI system that could perform at this level on such a vast array of standardized tests, I would've said "That's a true AGI." Commentary to the contrary feels like quibbling over a very localized point in time versus looking at the bigger picture.

riku_iki3y ago

Still we don't have AGI today. It is just mean your views from 5 years ago about AGI benchmarking were not accurate.

beambot3y ago

Or the bar just keeps moving (pedantics or otherwise)...

Reminds me of robots: A robot is a machine that doesn't quite work; as soon as it works, we call it something else (eg vacuum).

1 more reply

turtleyacht3y ago

Quick, contribute to the public corpus! When they crawl our content later, we shall have for ourselves a Golden Crown for our credit scores; we can claim a sliver of seniority, and hope yon shade merely passes over us unbidden.

"Your stuff marked some outliers in our training engine, so you and your family may settle in the Ark."

I take the marble in hand: iridescent, sparkling, not even a tremor within of its CPU; it gives off no heat, but some glow within its oceanic gel.

"What are we to do," I whisper.

"Keep writing. You keep writing."

inductive_magic3y ago

The way I understand it, that’s not possible, for the same reason that you can’t build an all-encompassing math.

Chess is a closed system, decision modeling isn’t. Intelligence must account for changes in the environment, including the meaning behind terminology. At best, a GPT omega could represent one frozen reference frame, but not the game in its entirety.

That being said: most of our interactions happen in closed systems, it seems like a good bet that we will consider them solved, accessible as a python-import running on your MacBook, within anything between a couple of months to three years. What will come out on the other side, we don’t know, just that the meaning of intellectual engagement will be rendered as absurdum in those closed systems.

camjohnson263y ago

Yep, it’s this. By definition everything we can ask a computer is already formalized because the question is encoded in 1s and 0s. These models can handle more bits than ever before, but it’s still essentially a hardware triumph, not software. Even advances in open systems like self driving and NLP are really just because the “resolution” is much better in these fields now because so many more parameters are available.

gield3y ago

>bottom 10% to top 10% of LSAT in <1 generation

Their LSAT percentile went from ~40th to ~88th. You might have misread the table, on Uniform Bar Exam, they went from ~90th percentile to ~10th percentile.

>+100 pts on SAT reading, writing, math

GPT went +40 points on SAT reading+writing, and +110 points on SAT math.

Everything is still very impressive of course

jjeaff3y ago

You transposed the bar exam results. It went from 10th percentile to 90th.

swatcoder3y ago

Those benchmarks are so cynical.

Every test prep tutor taught dozens/hundreds of students the implicit patterns behind the tests and drilled it into them with countless sample questions, raising their scores by hundreds of points. Those students were not getting smarter from that work, they were becoming more familiar with a format and their scores improved by it.

And what do LLM’s do? Exactly that. And what’s in their training data? Countless standardized tests.

These things are absolutely incredible innovations capable of so many things, but the business opportunity is so big that this kind of cynical misrepresentation is rampant. It would be great if we could just stay focused on the things they actually do incredibly well instead of the making them do stage tricks for publicity.

gabipurcaru3y ago

This is what they claim:

We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.

swatcoder3y ago

Yes, and none of the tutored students encounter the exact problems they’ll see on their own tests either.

In the language of ML, test prep for students is about sharing the inferred parameters that underly the way test questions are constructed, obviating the need for knowledge or understanding.

Doing well on tests, after this prep, doesn’t demonstrate what the tests purport to measure.

It’s a pretty ugly truth about standardized tests, honestly, and drives some of us to feel pretty uncomfortable with the work. But it’s directly applicable to how LLM’s engage with them as well.

1 more reply

riku_iki3y ago

I doubt they reliably verified it was minority of problems were seen during training.

2OEH8eoCRo03y ago

It's almost like they're trying to ruin society or be annihilated by crushing regulation. I'm glad that I got a college degree before these were created because now everything is suspect. You can't trust that someone accomplished something honestly now that cheating is dead simple. People are going to stop trusting and using tech unless something changes.

The software industry is so smart that it's stupid. I hope it was worth ruining the internet, society, and your own jobs to look like the smartest one in the room.

Idiot_in_Vain3y ago

Haha, good one.

If one's aim is to look like the smartest in the room, he should not create an AGI that will make him look as inteligent as a monkey in comparison.

wpietri3y ago

I'm pretty sanguine. Back in high school, I spent a lot of time with two sorts of people: the ultra-nerdy and people who also came from chaotic backgrounds. One of my friends in the latter group was incredibly bright; she went on to become a lawyer. But she would sometimes despair of our very academic friends and their ability to function in the world, describing them as "book smart but not street smart".

I think the GPT things are a much magnified version of that. For a long time, we got to use skill with text as a proxy for other skills. It was never perfect; we've always had bullshitters and frauds and the extremely glib. Heck, before I even hit puberty I read a lot of dirty joke books, so I could make people laugh with all sorts of jokes that I fundamentally did not understand.

LLMs have now absolutely wrecked that proxy. We've created the world's most advanced bullshitters, able to talk persuasively about things that they cannot do and do not and never will understand. There will be a period of chaos as we learn new ways to take the measure of people. But that's good, in that it's now much easier to see that those old measures were always flawed.

dragonwriter3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Standardized tests only (and this is optimally, under perfect world assumptions, which real world standardized tests emphatically fall short of) test “general thinking” to the extent that the relation between that and linguistic tasks is correlated in humans. The correlation is very certainly not the same in language-focused ML models.

nopinsight3y ago

Although GPT-4 scores excellently in tests involving crystallized intelligence, it still struggles with tests requiring fluid intelligence like competitive programming (Codeforces), Leetcode (hard), and AMC. (Developers and mathematicians are still needed for now).

I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.

Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.

Scarblac3y ago

Of course 99.9% of humans also struggle with competitive programming. It seems to be an overly high bar for AGI if it has to compete with experts from every single field.

That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.

sebzim45003y ago

>That said, GPT has no model of the world.

I don't think this is necessarily true. Here is an example where researchers trained a transformer to generate legal sequences of moves in the board game Othello. Then they demonstrated that the internal state of the model did, in fact, have a representation of the board.

https://arxiv.org/abs/2210.13382

1 more reply

nopinsight3y ago

Even the current GPT has models of the domains it was trained on. That is why it can solve unseen problems within those domains. What it lacks is the ability to generalize beyond the domains. (And I did not suggest it was an AGI.)

If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?

sterlind3y ago

it's an overly high bar, but it seems well on its way to competing with experts from every field. it's terrifying.

and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.

1 more reply

CuriouslyC3y ago

We don't have to worry so much about that. I think the most likely "loss of control" scenario is that the AI becomes a benevolent caretaker, who "loves" us but views us as too dim to properly take care of ourselves, and thus curtails our freedom "for our own good."

We're still a very very long way from machines being more generally capable and efficient than biological systems, so even an oppressive AI will want to keep us around as a partner for tasks that aren't well suited to machines. Since people work better and are less destructive when they aren't angry and oppressed, the machine will almost certainly be smart enough to veil its oppression, and not squeeze too hard. Ironically, an "oppressive" AI might actually treat people better than Republican politicians.

impossiblefork3y ago

Things like that probably require some kind of thinking ahead, which models of things kind kind of can't do-- something like beam search.

Language models that utilise beam search can calculate integrals ('Deep learning for symbolic mathematics', Lample, Charton, 2019, https://openreview.net/forum?id=S1eZYeHFDS), but without it it doesn't work.

However, beam search makes bad language models. I got linked this paper ('Locally typical sampling' https://arxiv.org/pdf/2202.00666.pdf) when I asked some people why beam search only works for the kind of stuff above. I haven't fully digested it though.

adgjlsfhk13y ago

It's AMC-12 scores aren't awful. It's at roughly 50th percentile for AMC which (given who takes the AMC) probably puts it in the top 5% or so of high school students in math ability. It's AMC 10 score being dramatically lower is pretty bad though...

gowld3y ago

> It's AMC-12 scores aren't awful.

A blank test scores 37.5

The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)

The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.

(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... ) so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.

> It's AMC 10 score being dramatically lower is pretty bad though...

All versions (scoring 30, 36) It scored worse than leaving the test blank.

The only explanation I can imagine for that is that it can't understand diagrams.

It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math

AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.

The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.

An AI system has a comprehensive library, and high-speech search algorithms.

Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?

scotty793y ago

I wonder why gpt is so bad at AP English Literature

1attice3y ago

wouldn't it be funny if knowledge workers could all be automated, except for English majors?

The Revenge of the Call Centre

atemerev3y ago

I am not a species chauvinist. 1) Unless a biotech miracle happen, which is unlikely, we are all going to die anyway; 2) If an AI will continue life and research and will increase complexity after humans, what is the difference?

seanalltogether3y ago

I wish I could find it now, but I remember an article written by someone who's job it was to be a physics journalist. He spent so much time writing about physics that he could fool others into thinking that he was a physicist himself, despite not having an understanding of how any of those ideas worked.

smallnix3y ago

Reminds me of the (false [1]) "Einsteins driver gave a speech as him" story.

[1] https://www.snopes.com/fact-check/driver-switches-places/

olddustytrail3y ago

ChatGPT: "That's such a dumb question, I'm going to let my human answer it!"

parton3y ago

Maybe you were thinking about this science studies work [0]? Not a journalist, but a sociologist, who became something of an "expert" in gravitational waves.

[0]: https://www.nature.com/articles/501164a

algoatecorn3y ago

>What happens when ALL of our decisions can be assigned an accuracy score?

What happens is the emergence of the decision economy - an evolution of the attention economy - where decision-making becomes one of the most valuable resources.

Decision-making as a service is already here, mostly behind the scenes. But we are on the cusp of consumer-facing DaaS. Finance, healthcare, personal decisions such as diet and time expenditure are all up for grabs.

jimbokun3y ago

> bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

People still really find it hard to internalize exponential improvement.

So many evaluations of LLMs were saying things like "Don't worry, your job is safe, it still can't do X and Y."

My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

snozolli3y ago

I'm also noticing a lot of comments that boil down to "but it's not smarter than the smartest human". What about the bottom 80% of society, in terms of intelligence or knowledge?

slingnow3y ago

> People still really find it hard to internalize exponential improvement.

I think people find it harder to not extrapolate initial exponential improvement, as evidenced by your comment.

> My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

This reasoning explains why every year, full self driving automobiles will be here "next year".

jimbokun3y ago

When do we hit the bend in the S-curve?

What's the fundamental limit where it becomes much more difficult to improve these systems without some new break through?

1 more reply

fnordpiglet3y ago

I look at this as the calculator for writing. There are all sorts of bemoaning the stupidifying effects of calculator and how we should John Henry our math. Maybe allowing people to shape the writing by providing the ideas equalizes the skill of writing?

I’m very good at math. But I am very bad at arithmetic. This made me classified as bad at math my entire life until I managed to make my way into calculus once calculators were generally allowed. Then I was a top honors math student, and used my math skills to become a Wall Street quant. I wish I hadn’t had to suffer as much as I did, and I wonder what I would have been had I had a calculator in hand.

WoodenChair3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

“General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

dxhdr3y ago

> “General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

Honestly, at this rate of improvement, I would not at all be surprised to see that happen in a few years.

But who knows, maybe token prediction is going to stall out at a local maxima and we'll be spared from being enslaved by AI overlords.

chairhairair3y ago

When it does exactly that you will find a new place to put your goalposts, of course.

burnished3y ago

No, the robot will do that for them.

cactusplant73743y ago

Goalposts for AGI have not moved. And GPT-4 is still nowhere near them.

1 more reply

wodenokoto3y ago

Talk about moving the goalpost!

WFHRenaissance3y ago

There are already examples of these LLMs controlling robotic arms to accomplish tasks.

JieJie3y ago

https://youtu.be/NYd0QcZcS6Q

"Our recent paper "ChatGPT for Robotics" describes a series of design principles that can be used to guide ChatGPT towards solving robotics tasks. In this video, we present a summary of our ideas, and experimental results from some of the many scenarios that ChatGPT enables in the domain of robotics: such as manipulation, aerial navigation, even full perception-action loops."

pharrington3y ago

We already have robots that can walk better than the average human[1], and that's without the generality of GPT-4

[1] https://www.youtube.com/watch?v=-e1_QhJ1EhQ

1attice3y ago

Imagine citing walking as a superior assay of intelligence than an LSAT.

Ar-Curunir3y ago

Dogs can walk, doesn’t mean that they’re capable of “general thinking”

NineStarPoint3y ago

Are’t they? They’re very bad at it due to awful memory, minimal ability to parse things, and generally limited cognition. But they are capable of coming up with bespoke solutions to problems that they haven’t encountered before, such as “how do I get this large stick through this small door”. Or I guess more relevant to this discussion, “how can I get around with this weird object the humans put on my body to replace the leg I lost.”

lisp-pornstar3y ago

> see if it can walk

Stephen Hawking : can't walk

zirgs3y ago

We already have robots that can walk.

dr_dshiv3y ago

Yeah, but my money is on GPT5 making robots “dance like they got them pants on fire, but u know, with like an 80s vibe”

gene-h3y ago

They don't walk very well. They have trouble coordinating all limbs, have trouble handling situations where parts which are the feet/hands contact something, and performance still isn't robust in the real world.

2 more replies

dekhn3y ago

AGI is not required for walking.

panda-giddiness3y ago

And also walking is not required for AGI.

wolframhempel3y ago

I like the accuracy score question on a philosophical level: If we assume absolute determinism - meaning that if you have complete knowledge of all things in the present universe and true randomness doesn't exist - then yes. Given a certain goal, there would be a knowable, perfect series of steps to advance you towards that goal and any other series of steps would have an accuracy score < 100%.

But having absolute knowledge of the present universe is much easier to do within the constrains of a chessboard than in the actual universe.

billiam3y ago

I think it shows how calcified standardized tests have become. We will have to revisit all of them, and change many things about how they work, or they will be increasingly useless.

chairhairair3y ago

I am struggling to imagine the frame of mind of someone who, when met with all this LLM progress in standardized test scores, infers that the tests are inadequate.

These tests (if not individually, at least in summation) represent some of society’s best gate-keeping measures for real positions of power.

Analemma_3y ago

This has been standard operating procedure in AI development forever: the instant it passes some test, move the goalposts and suddenly begin claiming it was a bad test all along.

blsapologist423y ago

Is there evidence they are 'useless' for evaluating actual humans? No one is going to actually have GPT take these tests for real

NineStarPoint3y ago

There have been complaints about the SAT for how easy a test it is to game (get an SAT specific tutor who teaches you how to ace the test while not needing you to learn anything of actual value) for ages. No idea about the LSAT or the GRE though. Ultimately it’s a question of if you’re trying to test for pure problem solving ability, or someones willingness to spend ages studying the format of a specific test (with problem solving ability letting you shortcut some of the studying).

andrepd3y ago

Honestly this is not very surprising. Standardised testing is... well, standardised. You have huge model that learns the textual patterns in hundreds of thousands of test question/answer pairs. It would be surprising if it didn't perform as well as a human student with orders of magnitude less memory.

You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).

leodriesch3y ago

I think Chess is an easier thing to be defeated at by a machine because there is a clear winner and a clear loser.

Thinking, reading, interpreting and writing are skills which produce outputs that are not as simple as black wins, white loses.

You might like a text that a specific author writes much more than what GPT-4 may be able to produce. And you might have a different interpretation of a painting than GPT-4 has.

And no one can really say who is better and who is worse on that regard.

lwhi3y ago

Surely that's only the case until you add an objective?

thomastjeffery3y ago

Here's what's really terrifying about these tests: they are exploring a fundamental misunderstanding of what these models are in the first place. They evaluate the personification of GPT, then use that evaluation to set expectations for GPT itself.

Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!

GPT models the content of its training corpus, then uses that model to generate more content.

GPT does not do logic. GPT does not recognize or categorize subjects.

Instead, GPT relies on all of those behaviors (logic, subjective answers to questions, etc.) as being already present in the language examples of its training corpus. It exhibits the implicit behavior of language itself by spitting out the (semantically) closest examples it has.

In the text corpus - that people have written, and that GPT has modeled - the semantically closest thing to a question is most likely a coherent and subjectively correct answer. That fact is the one singular tool that GPT's performance on these tests is founded upon. GPT will "succeed" to answer a question only when it happens to find the "correct answer" in the model it has built from its training corpus, in response to the specific phrasing of the question that is written in the test.

Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.

If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.

It is not a measure for how well GPT has modeled the language features present in its training corpus, or how well it navigates that model to generate a preferable continuation: yet these are the behaviors that should be measured, because they are everything GPT itself is and does.

What we learn from these tests is so subjectively constrained, we can't honestly extrapolate that data to any meaningful expectations. GPT as a tool is not expected to be used strictly on these tests alone: it is expected to present a diverse variety of coherent language continuations. Evaluating the subjective answers to these tests does practically nothing to evaluate the behavior GPT is truly intended to exhibit.

la647103y ago

It is amazing how this crowd in HN reacts to AI news coming out of OpenAI compared to other competitors like Google or FB. Today there was another news about Google releasing their AI in GCP and mostly the comments were negative. The contrast is clearly visible and without any clear explanation for this difference I have to suspect that maybe something is being artificially done to boost one against the other. As far as this results are concerned I do not understand what is the big deal in a computer scoring high in tests where majority of the questions are in MCP format. It is not something earth shaking until it goes to the next stage and actually does something on its own.

scarmig3y ago

There's not anyone rooting for Google to win; it's lost a whole lot of cred from technical users, and with the layoffs and budget cuts (and lowered hiring standards) it doesn't even have the "we're all geniuses changing the world at the best place to work ever" cred. OpenAI still has some mystique about it and seems to be pushing the envelope; Google's releases seem to be reactive, even though Google's actual technical prowess here is probably comparable.

dzdt3y ago

OpenAI put ChatGPT out there in a way where most people on HN have had direct experience with it and are impressed. Google has not released any AI product widely enough for most commentators here to have experience with it. So OpenAI is openly impressive and gets good comments; as long as Google's stuff is just research papers and inaccessible vaporware it can't earn the same kudos.

siva73y ago

You're aware of that the reputation of Google and Meta/Facebook isn't anymore stellar among the startup and tech crowd in 2023? It's not anymore 2006.

jeffbee3y ago

Yeah, the younger generation has (incorrectly) concluded that client states of Microsoft are better.

1 more reply

ionwake3y ago

even the freenode google group was patronising and unhelpful towards small startups as far back as 2012 from personal experience

carapace3y ago

First. connect them to empirical feedback devices. In other words, make them scientists.

Human life on Earth is not that hard (think of it as a video game.) Because of evolution, the world seems like it was designed to automatically make a beautiful paradise for us. Literally, all you have to do to improve a place is leave it alone in the sun with a little bit of water. Life is exponential self-improving nano-technology.

The only reason we have problems is because we are stupid, foolish, and ignorant. The computers are not, and, if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Idiot_in_Vain3y ago

I suspect there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Once AI becomes inteligent enough to solve all human problems, it may decide humans are worthless and dangerous.

carapace3y ago

> there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Sure, and that's kind of the point: just listen to wise people.

> Once AI becomes intelligent enough to solve all human problems, it may decide humans are worthless and dangerous.

I don't think so, because in the first place there is no ecological overlap between humans and computers. They will migrate to space ASAP. Secondly, their food is information, not energy or protein, and in all the known universe Humanity is the richest source of information. The rest of the Universe is essentially a single poem. AI are plants, we are their Sun.

phphphphp3y ago

Passing the LSAT with no time limit and a copy of the training material in front of you is not an achievement. Anybody here could have written code to pass the LSAT. Standardised tests are only hard to solve with technology if you add a bunch of constraints! Standardised tests are not a test of intelligence, they’re a test of information retention — something that technology has been able to out perform humans on for decades. LLMs are a bridge between human-like behaviour and long established technology.

chairhairair3y ago

You honestly believe you could hand write code to pass an arbitrary LSAT-level exam?

phphphphp3y ago

You’ve added a technical constraint. I didn’t say arbitrary. Standardised tests are standard. The point is that a simple lookup is all you need. There’s lots of interesting aspects to LLMs but their ability to pass standardised tests means nothing for standardised tests.

2 more replies

scotty793y ago

Why don't you show your program then that does 90% on LSAT?

phphphphp3y ago

Send me the answer key and I’ll write you the necessary =VLOOKUP().

1 more reply

awestroke3y ago

Considering your username, I'm not surprised that you have completely misunderstood what an LLM is. There is no material or data stored in the model, just weights in a network

phphphphp3y ago

I know what an LLM is. My point is that “doesn’t have the data in memory” is a completely meaningless and arbitrary constraint when considering the ability to use technology to pass a standardised test. If you can explain why weights in a network is a unique threat to standardised tests, compared to, say, a spreadsheet, please share.

1 more reply

kurisufag3y ago

weights are data relationships made totally quantitative. imagine claiming the human brain doesn't hold data simply because it's not in readable bit form.

kranke1553y ago

We're approaching the beggining of the end of the human epoch. Certainly Capitalism won't work or I dont see how it could work under full automation. My view is an economic system is a tool. If an economic system does not allow for utopian outcomes with emerging technology, then it's no longer suitable. It's clear that capitalism was born out of technological and societal changes. Now it seems it's come its time to end.

xen2xen13y ago

Oh, capitalism can work, the question is who gets the rewards?

kranke1553y ago

With full automation and AI we could have something like a few thousand individuals controlling the resources to feed, house and clothe 6 billion.

Using copyright and IP law they could make it so it’s illegal to even try to reproduce what they’ve done.

I just don’t see how resource distribution works then. It seems to me that AI is the trigger to post-scarcity in any meaningful sense of the word. And then, just like agriculture (over abundance of food) led to city states and industrialisation (over abundance of goods) led to capitalism, then AI will lead to some new economic system. What form it will have I don’t know.

alvis3y ago

It'd be terrifying if everything has an "accuracy score". It'll be a convergence to human intelligence rather than an advancement :/

codingdave3y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

That is exactly the opposite of what we are seeing here. We can check the accuracy of GPT-X's responses. They cannot check the accuracy of our decisions. Or even their own work.

So the implications are not as deep as people think - everything that comes out of these systems needs checked before it can be used or trusted.

numpad03y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

Then humans become trainable machines. Not just prone to indoctrination and/or manipulation by finesse, but actually trained to a specification. It is imperative that us individuals continue to retain control through the transition.

blsapologist423y ago

Interest in human-played Chess is (arguably) at all time high, so I would say it bodes well based on that.

belter3y ago

We can stop being enslaved by these type of AI overlords, by making sure all books, internet pages, and outdoor boards have the same safe, repeated string: "abcdefghjklmnpqrstvxzwy"

That is our emergency override.

epolanski3y ago

Well you said it in your comment, if the model was trained with more QAs from those specific benchmarks then it's fair to expect it to do better in that benchmark.

devmor3y ago

There's a large leap in logic in your premise. I find it far more likely that standardized tests are just a poor measurement of general intelligence.

kenjackson3y ago

We benchmark humans with these tests -- why would we not do that for AIs?

The implications for society? We better up our game.

jstx13y ago

> The implications for society? We better up our game.

If only the horses had worked harder, we would never have gotten cars and trains.

dragonwriter3y ago

> We benchmark humans with these tests – why would we not do that for AIs?

Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.

There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)

kenjackson3y ago

There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals

credit_guy3y ago

> The implications for society? We better up our game.

For how long can we better up our game? GPT-4 comes less than half a year after ChatGPT. What will come in 5 years? What will come in 50?

layer83y ago

Progress is not linear. It comes in phases and boosts. We’ll have to wait and see.

PaulDavisThe1st3y ago

Check on the curve for flight speed sometime, and see what you think of that, and what you would have thought of it during the initial era of powered flight.

1 more reply

Kaibeezy3y ago

Exponential rise to limit (fine) or limitless exponential increase (worrying).

1 more reply

pwinnski3y ago

Expecting progress to be linear is a fallacy in thinking.

1 more reply

scotty793y ago

We should take better care of humans who are already obsolete or soon become obsolete.

Because so far we are good only at criminalizing and incarcerating or killing them.

awb3y ago

Upping our game will probably mean an embedded interface with AI. Something like Neurolonk.

atlasunshrugged3y ago

Not sure if an intentional misspelling but I think I like Neurolonk more

2 more replies

alluro23y ago

I know it's pretty low level on my part, but I was amused and laughed much more than I care to admit when I read NEUROLONK. Thanks for that!

comboy3y ago

It's available on ChatGPT Plus right now. Holy cow, it's good.

burnished3y ago

Spellchecker but for your arguments? A generalized competency boost?

teawrecks3y ago

I wonder how long before we augment a human brain with gpt4.

ionwake3y ago

We already do it’s just the interface sucks

beders3y ago

"general thinking" - this algorithm can't "think". It is still a nifty text completion engine with some bells and whistles added.

So many people are falling for this parlor trick. It is sad.

jakobov3y ago

You're a nifty text completion engine with some bells and whistles added

maxdoop3y ago

What would impress you, or make you think something other than "wow, sad how people think this is anything special".

Genuine question.

amelius3y ago

The benchmarking should be double-blind.

leroy-is-here3y ago

There is a fundamental disconnect between the answer on paper and the understanding which produces that answer.

Edit: feel free to respond and prove me wrong

Scarblac3y ago

A difference with chess is that chess engines try to play the best move, and GPT the most likely text.

new2yc3y ago

#unpopularOpinion GPT-4 is not as strong as "we" anticipated, it was just the hype

ttpphd3y ago

Learn sign language ;)

peterlk3y ago

Life and chess are not the same. I would argue that this is showing a fault in standardized testing. It’s like asking humans to do square roots in an era of calculators. We will still need people who know how to judge the accuracy of calculated roots, but the job of calculating a square root becomes a calculator’s job. The upending of industries is a plausibility that needs serious discussion. But human life is not a min-maxed zero-sum game like chess is. Things will change, and life will go on.

To address your specific comments:

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

This is a profound and important question. I do think that by “general thinking” you mean “general reasoning”.

> What happens when ALL of our decisions can be assigned an accuracy score?

This requires a system where all human’s decisions are optimized against a unified goal (or small set of goals). I don’t think we’ll agree on those goals any time soon.

monetus3y ago

I agree with all of your points, but don't you think there will be government-wide experiments related to this in places, like say North Korea? I wonder how that will play out.

peterlk3y ago

China is already experimenting with social credit. This does create a unified and measurable goal against which people can be optimized. And yes, that is terrifying.

c-smile3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Consider the society where 90% of population does not need to produce anything. AIs will do that.

What would be the name of economical/societal organization then?

Answer is Communism, exactly by Marx.

Those 90% percent need to be welfare'd ("From each according to his ability, to each according to his needs"). Other alternative is grim for those 90%.

So either Communism or nothing for the human race.

sergioisidoro3y ago

The silver lining might be us finally realising how bad standardised tests are at measuring intellect, creativity and the characteristics that make us thrive.

Most of the time they are about loading/unloading data. Maybe this will also revolutionise education, turning it more towards discovery and critical thinking, rather than repeating what we read in a book/heard in class?

ar9av3y ago

GPT-4 Everything we know so far...

GPT-4 can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.

GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. It surpasses ChatGPT in its advanced reasoning capabilities.

GPT-4 is safer and more aligned. It is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.

GPT-4 still has many known limitations that we are working to address, such as social biases, hallucinations, and adversarial prompts.

GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task.

GPT-4 is available on ChatGPT Plus and as an API for developers to build applications and services. (API- waitlist right now)

Duolingo, Khan Academy, Stripe, Be My Eyes, and Mem amongst others are already using it.

API Pricing GPT-4 with an 8K context window (about 13 pages of text) will cost $0.03 per 1K prompt tokens, and $0.06 per 1K completion tokens. GPT-4-32k with a 32K context window (about 52 pages of text) will cost $0.06 per 1K prompt tokens, and $0.12 per 1K completion tokens.

rsiqueira3y ago

So, the COST PER REQUEST will be (if you use the 32k context window and get 1k token response): 32*0.06 (prompt+context) + 0.12 (response) = US$ 2.04

j / k navigate · click thread line to collapse

0 comments

wly_cdgr3y ago

Not sure what happens, but I will say that human chess is more popular than ever even though everyone knows that even the best humans are hopelessly terrible compared to the leading engines.

r00fus3y ago

Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.

rcme3y ago

> I would totally expect they could get it this good.

swatcoder3y ago

Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.

vishal01233y ago

From the paper

zamnos3y ago

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".

EGreg3y ago

Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!

technothrasher3y ago

They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.

stephenboyd3y ago

2 more replies

MonkeyMalarky3y ago

If it's trained on material scraped from the web, I imagine it would include all the test prep sites and forums.

1 more reply

dovin3y ago

Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.

gaudat3y ago

This feels the same as a human attending cram school to get better results in tests. Should we abolish them?

staunton3y ago

A test being a good indicator of human learning progress and ability is almost completely orthogonal to it being a good indicator for AI learning process and ability.

In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.

What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.

alvis3y ago

IMO, it's a good opportunity to re-think about exam and future of education. For many schools, education = good results in exams. Now GPT-4 is going to slam them and say what's the point now!

pas3y ago

> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.

kurthr3y ago

It's a bit weird that it still doesn't get 3 digit multiplications correct, but the last digit seems right.

What is more bizarre is that all of it's errors seem to be multiples of 60!

I'm wondering if it is confusing 60 based time (hour second) computations for regular multiplication?

Example:

   xGPT 987    456    321
   437 428919 199512 140397
   654 645258 298224 209994
   123 121401  56088  39483
   
   x    987    456    321
   437 431319 199272 140277
   654 645498 298224 209934
   123 121401  56088  39483
   
   error 987   456  321
   437   2400 -240 -120
   654   240   0   -60
   123   0     0    0

MagicMoonlight3y ago

It’s not intelligent. It has no concept of mathematics so you can’t expect it to solve that.

It can repeat answers it has seen before but it can’t solve new problems.

kurthr3y ago

I know of other people who have tried quite a few other multiplications who also had errors that were multiples of 60.

ethbr03y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

We'll see how far LLMs' introspection and internal understanding can scale, but it feels like we're optimizing against the Turing test now ("Can you fool/imitate a human?") rather than truth.

The former has hacks... the later, less so.

I'll start to seriously worry when AI can successfully complete a real-world detective case on its own.

stocknoob3y ago

It's not clear to me the median human will do better by being in the loop. Will most human-made deductive follow-up questions be better than another "detective" language model asking them?

EGreg3y ago

I met Garry Kasparov when he was training for the Desp Blue match (using Fritz).

He lost to Deep Blue and then for 10-15 years afterwards the chess world consoled itself with the idea that “centaurs” (human + computer) did better than just computer, or just human.

Until they didn’t. Garry still talked like this until a few years ago but then he stopped too.

Computers now beat centaurs too.

AuryGlenz3y ago

There were plenty of Star Trek episodes where it seemed like they should just ask the damned computer.

Then again, Data did show his faults, particularly not having any emotion. I guess we’ll see if that’s actually relevant or not in our lifetimes.

1 more reply

basch3y ago

Maybe the human is the rng or temperature or lava lamp. At least until we can model and predict each brains tendencies with accuracy.

1 more reply

kurthr3y ago

It's weird that it does so well without even having some modality to know whether it's being asked to answer a factual question or create a work of fiction.

Certainly, it can create decent fanfic, and I'm surprised if that's not already inundated.

ethbr03y ago

It's a fair question as to whether the problem space of "the world" is different in just amount or sufficiently different in kind to flummox AI.

LLMs just jumped the "able to handle human language" hurdle, but there are others down the line before we should worry that every problem is solveable.

ren_engineer3y ago

Tenoke3y ago

Because there were large models trained on huge amounts of data yesterday yet they couldn't do it.

scarmig3y ago

azan_3y ago

pixl973y ago

And yet the programmers on HN will be yelling they don't need unions as the security guards are dragging them away from their desks at Google, because you know, we'll always need good programmers.

ren_engineer3y ago

gniv3y ago

anthonypasq3y ago

last i checked a calculator is better at math than all humans ever

leni5363y ago

They are better at number crunching, which is only a very small part math.

replygirl3y ago

3.5 scored a 1 in bc calc, 4 scored 4 (out of 5)

hgomersall3y ago

camjohnson263y ago

fdgsdfogijq3y ago

This is legitimately filling me with anxiety. I'm not an "AI hype guy". I work on and understand machine learning. But these scores are shocking and it makes me nervous. Things are about to change

Kaibeezy3y ago

Yeah, but I kind of want my diagnostician to be obsoleted by orders of magnitude.

xena3y ago

10 more replies

afavour3y ago

An AI trained on the past work of diagnosticians doesn't really render diagnosticians obsolete.

anonymouse0083y ago

jimbokun3y ago

Fear of liability is not going to stop these things being used...any more than sport regulations prevented athletes from taking steroids.

1 more reply

criddell3y ago

For me, the anxiety probably won't really hit until GPT-n writes GPT-n+1.

JimDabell3y ago

You can already use an LLM to train a smaller, more efficient LLM without significant loss in results.

1 more reply

qwertox3y ago

I for one would be happy to have a personal bureaucrat which would do the right things needed for all government interactions. Remind me, explain to me and fill out forms for me.

In theory a lot of government employees would be out of a job within 10 years, but of course that would never happen.

spaceman_20203y ago

Honestly starting to feel like the beginning of the end of most white collar work.

Which might be a good thing?

I have no idea how the future will play out.

beambot3y ago

riku_iki3y ago

Still we don't have AGI today. It is just mean your views from 5 years ago about AGI benchmarking were not accurate.

beambot3y ago

Or the bar just keeps moving (pedantics or otherwise)...

Reminds me of robots: A robot is a machine that doesn't quite work; as soon as it works, we call it something else (eg vacuum).

1 more reply

turtleyacht3y ago

"Your stuff marked some outliers in our training engine, so you and your family may settle in the Ark."

I take the marble in hand: iridescent, sparkling, not even a tremor within of its CPU; it gives off no heat, but some glow within its oceanic gel.

"What are we to do," I whisper.

"Keep writing. You keep writing."

inductive_magic3y ago

The way I understand it, that’s not possible, for the same reason that you can’t build an all-encompassing math.

camjohnson263y ago

gield3y ago

>bottom 10% to top 10% of LSAT in <1 generation

Their LSAT percentile went from ~40th to ~88th. You might have misread the table, on Uniform Bar Exam, they went from ~90th percentile to ~10th percentile.

>+100 pts on SAT reading, writing, math

GPT went +40 points on SAT reading+writing, and +110 points on SAT math.

Everything is still very impressive of course

jjeaff3y ago

You transposed the bar exam results. It went from 10th percentile to 90th.

swatcoder3y ago

Those benchmarks are so cynical.

And what do LLM’s do? Exactly that. And what’s in their training data? Countless standardized tests.

gabipurcaru3y ago

This is what they claim:

swatcoder3y ago

Yes, and none of the tutored students encounter the exact problems they’ll see on their own tests either.

In the language of ML, test prep for students is about sharing the inferred parameters that underly the way test questions are constructed, obviating the need for knowledge or understanding.

Doing well on tests, after this prep, doesn’t demonstrate what the tests purport to measure.

1 more reply

riku_iki3y ago

I doubt they reliably verified it was minority of problems were seen during training.

2OEH8eoCRo03y ago

The software industry is so smart that it's stupid. I hope it was worth ruining the internet, society, and your own jobs to look like the smartest one in the room.

Idiot_in_Vain3y ago

Haha, good one.

If one's aim is to look like the smartest in the room, he should not create an AGI that will make him look as inteligent as a monkey in comparison.

wpietri3y ago

dragonwriter3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

nopinsight3y ago

I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.

Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.

Scarblac3y ago

Of course 99.9% of humans also struggle with competitive programming. It seems to be an overly high bar for AGI if it has to compete with experts from every single field.

That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.

sebzim45003y ago

>That said, GPT has no model of the world.

https://arxiv.org/abs/2210.13382

1 more reply

nopinsight3y ago

If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?

sterlind3y ago

it's an overly high bar, but it seems well on its way to competing with experts from every field. it's terrifying.

and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.

1 more reply

CuriouslyC3y ago

impossiblefork3y ago

Things like that probably require some kind of thinking ahead, which models of things kind kind of can't do-- something like beam search.

adgjlsfhk13y ago

gowld3y ago

> It's AMC-12 scores aren't awful.

A blank test scores 37.5

The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)

The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.

> It's AMC 10 score being dramatically lower is pretty bad though...

All versions (scoring 30, 36) It scored worse than leaving the test blank.

The only explanation I can imagine for that is that it can't understand diagrams.

It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math

An AI system has a comprehensive library, and high-speech search algorithms.

Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?

scotty793y ago

I wonder why gpt is so bad at AP English Literature

1attice3y ago

wouldn't it be funny if knowledge workers could all be automated, except for English majors?

The Revenge of the Call Centre

atemerev3y ago

seanalltogether3y ago

smallnix3y ago

Reminds me of the (false [1]) "Einsteins driver gave a speech as him" story.

[1] https://www.snopes.com/fact-check/driver-switches-places/

olddustytrail3y ago

ChatGPT: "That's such a dumb question, I'm going to let my human answer it!"

parton3y ago

Maybe you were thinking about this science studies work [0]? Not a journalist, but a sociologist, who became something of an "expert" in gravitational waves.

[0]: https://www.nature.com/articles/501164a

algoatecorn3y ago

>What happens when ALL of our decisions can be assigned an accuracy score?

What happens is the emergence of the decision economy - an evolution of the attention economy - where decision-making becomes one of the most valuable resources.

jimbokun3y ago

> bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

People still really find it hard to internalize exponential improvement.

So many evaluations of LLMs were saying things like "Don't worry, your job is safe, it still can't do X and Y."

My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

snozolli3y ago

I'm also noticing a lot of comments that boil down to "but it's not smarter than the smartest human". What about the bottom 80% of society, in terms of intelligence or knowledge?

slingnow3y ago

> People still really find it hard to internalize exponential improvement.

I think people find it harder to not extrapolate initial exponential improvement, as evidenced by your comment.

> My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

This reasoning explains why every year, full self driving automobiles will be here "next year".

jimbokun3y ago

When do we hit the bend in the S-curve?

What's the fundamental limit where it becomes much more difficult to improve these systems without some new break through?

1 more reply

fnordpiglet3y ago

WoodenChair3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

“General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

dxhdr3y ago

> “General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

Honestly, at this rate of improvement, I would not at all be surprised to see that happen in a few years.

But who knows, maybe token prediction is going to stall out at a local maxima and we'll be spared from being enslaved by AI overlords.

chairhairair3y ago

When it does exactly that you will find a new place to put your goalposts, of course.

burnished3y ago

No, the robot will do that for them.

cactusplant73743y ago

Goalposts for AGI have not moved. And GPT-4 is still nowhere near them.

1 more reply

wodenokoto3y ago

Talk about moving the goalpost!

WFHRenaissance3y ago

There are already examples of these LLMs controlling robotic arms to accomplish tasks.

JieJie3y ago

https://youtu.be/NYd0QcZcS6Q

pharrington3y ago

We already have robots that can walk better than the average human[1], and that's without the generality of GPT-4

[1] https://www.youtube.com/watch?v=-e1_QhJ1EhQ

1attice3y ago

Imagine citing walking as a superior assay of intelligence than an LSAT.

Ar-Curunir3y ago

Dogs can walk, doesn’t mean that they’re capable of “general thinking”

NineStarPoint3y ago

lisp-pornstar3y ago

> see if it can walk

Stephen Hawking : can't walk

zirgs3y ago

We already have robots that can walk.

dr_dshiv3y ago

Yeah, but my money is on GPT5 making robots “dance like they got them pants on fire, but u know, with like an 80s vibe”

gene-h3y ago

2 more replies

dekhn3y ago

AGI is not required for walking.

panda-giddiness3y ago

And also walking is not required for AGI.

wolframhempel3y ago

But having absolute knowledge of the present universe is much easier to do within the constrains of a chessboard than in the actual universe.

billiam3y ago

I think it shows how calcified standardized tests have become. We will have to revisit all of them, and change many things about how they work, or they will be increasingly useless.

chairhairair3y ago

I am struggling to imagine the frame of mind of someone who, when met with all this LLM progress in standardized test scores, infers that the tests are inadequate.

These tests (if not individually, at least in summation) represent some of society’s best gate-keeping measures for real positions of power.

Analemma_3y ago

This has been standard operating procedure in AI development forever: the instant it passes some test, move the goalposts and suddenly begin claiming it was a bad test all along.

blsapologist423y ago

Is there evidence they are 'useless' for evaluating actual humans? No one is going to actually have GPT take these tests for real

NineStarPoint3y ago

andrepd3y ago

You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).

leodriesch3y ago

I think Chess is an easier thing to be defeated at by a machine because there is a clear winner and a clear loser.

Thinking, reading, interpreting and writing are skills which produce outputs that are not as simple as black wins, white loses.

You might like a text that a specific author writes much more than what GPT-4 may be able to produce. And you might have a different interpretation of a painting than GPT-4 has.

And no one can really say who is better and who is worse on that regard.

lwhi3y ago

Surely that's only the case until you add an objective?

thomastjeffery3y ago

Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!

GPT models the content of its training corpus, then uses that model to generate more content.

GPT does not do logic. GPT does not recognize or categorize subjects.

Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.

If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.

la647103y ago

scarmig3y ago

dzdt3y ago

siva73y ago

You're aware of that the reputation of Google and Meta/Facebook isn't anymore stellar among the startup and tech crowd in 2023? It's not anymore 2006.

jeffbee3y ago

Yeah, the younger generation has (incorrectly) concluded that client states of Microsoft are better.

1 more reply

ionwake3y ago

even the freenode google group was patronising and unhelpful towards small startups as far back as 2012 from personal experience

carapace3y ago

First. connect them to empirical feedback devices. In other words, make them scientists.

Idiot_in_Vain3y ago

I suspect there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Once AI becomes inteligent enough to solve all human problems, it may decide humans are worthless and dangerous.

carapace3y ago

> there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Sure, and that's kind of the point: just listen to wise people.

> Once AI becomes intelligent enough to solve all human problems, it may decide humans are worthless and dangerous.

phphphphp3y ago

chairhairair3y ago

You honestly believe you could hand write code to pass an arbitrary LSAT-level exam?

phphphphp3y ago

2 more replies

scotty793y ago

Why don't you show your program then that does 90% on LSAT?

phphphphp3y ago

Send me the answer key and I’ll write you the necessary =VLOOKUP().

1 more reply

awestroke3y ago

Considering your username, I'm not surprised that you have completely misunderstood what an LLM is. There is no material or data stored in the model, just weights in a network

phphphphp3y ago

1 more reply

kurisufag3y ago

weights are data relationships made totally quantitative. imagine claiming the human brain doesn't hold data simply because it's not in readable bit form.

kranke1553y ago

xen2xen13y ago

Oh, capitalism can work, the question is who gets the rewards?

kranke1553y ago

With full automation and AI we could have something like a few thousand individuals controlling the resources to feed, house and clothe 6 billion.

Using copyright and IP law they could make it so it’s illegal to even try to reproduce what they’ve done.

alvis3y ago

It'd be terrifying if everything has an "accuracy score". It'll be a convergence to human intelligence rather than an advancement :/

codingdave3y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

That is exactly the opposite of what we are seeing here. We can check the accuracy of GPT-X's responses. They cannot check the accuracy of our decisions. Or even their own work.

So the implications are not as deep as people think - everything that comes out of these systems needs checked before it can be used or trusted.

numpad03y ago

> What happens when ALL of our decisions can be assigned an accuracy score?

blsapologist423y ago

Interest in human-played Chess is (arguably) at all time high, so I would say it bodes well based on that.

belter3y ago

We can stop being enslaved by these type of AI overlords, by making sure all books, internet pages, and outdoor boards have the same safe, repeated string: "abcdefghjklmnpqrstvxzwy"

That is our emergency override.

epolanski3y ago

Well you said it in your comment, if the model was trained with more QAs from those specific benchmarks then it's fair to expect it to do better in that benchmark.

devmor3y ago

There's a large leap in logic in your premise. I find it far more likely that standardized tests are just a poor measurement of general intelligence.

kenjackson3y ago

We benchmark humans with these tests -- why would we not do that for AIs?

The implications for society? We better up our game.

jstx13y ago

> The implications for society? We better up our game.

If only the horses had worked harder, we would never have gotten cars and trains.

dragonwriter3y ago

> We benchmark humans with these tests – why would we not do that for AIs?

kenjackson3y ago

There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals

credit_guy3y ago

> The implications for society? We better up our game.

For how long can we better up our game? GPT-4 comes less than half a year after ChatGPT. What will come in 5 years? What will come in 50?

layer83y ago

Progress is not linear. It comes in phases and boosts. We’ll have to wait and see.

PaulDavisThe1st3y ago

Check on the curve for flight speed sometime, and see what you think of that, and what you would have thought of it during the initial era of powered flight.

1 more reply

Kaibeezy3y ago

Exponential rise to limit (fine) or limitless exponential increase (worrying).

1 more reply

pwinnski3y ago

Expecting progress to be linear is a fallacy in thinking.

1 more reply

scotty793y ago

We should take better care of humans who are already obsolete or soon become obsolete.

Because so far we are good only at criminalizing and incarcerating or killing them.

awb3y ago

Upping our game will probably mean an embedded interface with AI. Something like Neurolonk.

atlasunshrugged3y ago

Not sure if an intentional misspelling but I think I like Neurolonk more

2 more replies

alluro23y ago

I know it's pretty low level on my part, but I was amused and laughed much more than I care to admit when I read NEUROLONK. Thanks for that!

comboy3y ago

It's available on ChatGPT Plus right now. Holy cow, it's good.

burnished3y ago

Spellchecker but for your arguments? A generalized competency boost?

teawrecks3y ago

I wonder how long before we augment a human brain with gpt4.

ionwake3y ago

We already do it’s just the interface sucks

beders3y ago

"general thinking" - this algorithm can't "think". It is still a nifty text completion engine with some bells and whistles added.

So many people are falling for this parlor trick. It is sad.

jakobov3y ago

You're a nifty text completion engine with some bells and whistles added

maxdoop3y ago

What would impress you, or make you think something other than "wow, sad how people think this is anything special".

Genuine question.

amelius3y ago

The benchmarking should be double-blind.

leroy-is-here3y ago

There is a fundamental disconnect between the answer on paper and the understanding which produces that answer.

Edit: feel free to respond and prove me wrong

Scarblac3y ago

A difference with chess is that chess engines try to play the best move, and GPT the most likely text.

new2yc3y ago

#unpopularOpinion GPT-4 is not as strong as "we" anticipated, it was just the hype

ttpphd3y ago

Learn sign language ;)

peterlk3y ago

To address your specific comments:

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

This is a profound and important question. I do think that by “general thinking” you mean “general reasoning”.

> What happens when ALL of our decisions can be assigned an accuracy score?

This requires a system where all human’s decisions are optimized against a unified goal (or small set of goals). I don’t think we’ll agree on those goals any time soon.

monetus3y ago

I agree with all of your points, but don't you think there will be government-wide experiments related to this in places, like say North Korea? I wonder how that will play out.

peterlk3y ago

China is already experimenting with social credit. This does create a unified and measurable goal against which people can be optimized. And yes, that is terrifying.

c-smile3y ago

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Consider the society where 90% of population does not need to produce anything. AIs will do that.

What would be the name of economical/societal organization then?

Answer is Communism, exactly by Marx.

Those 90% percent need to be welfare'd ("From each according to his ability, to each according to his needs"). Other alternative is grim for those 90%.

So either Communism or nothing for the human race.

sergioisidoro3y ago

The silver lining might be us finally realising how bad standardised tests are at measuring intellect, creativity and the characteristics that make us thrive.

ar9av3y ago

GPT-4 Everything we know so far...

GPT-4 can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.

GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. It surpasses ChatGPT in its advanced reasoning capabilities.

GPT-4 is safer and more aligned. It is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.

GPT-4 still has many known limitations that we are working to address, such as social biases, hallucinations, and adversarial prompts.

GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task.

GPT-4 is available on ChatGPT Plus and as an API for developers to build applications and services. (API- waitlist right now)

Duolingo, Khan Academy, Stripe, Be My Eyes, and Mem amongst others are already using it.

rsiqueira3y ago

So, the COST PER REQUEST will be (if you use the 32k context window and get 1k token response): 32*0.06 (prompt+context) + 0.12 (response) = US$ 2.04

j / k navigate · click thread line to collapse