undefined | Better HN

0 pointsjug2y ago0 comments

Still fails my hippo test!

> Yes, hippos are excellent swimmers. They spend most of their time in the water, where they feed on aquatic plants and escape the heat of the savannah. In fact, hippos are one of the best swimmers among all land mammals.

But that's fine. Most do. Hippos don't swim. They walk or hop/skip at best underwater.

0 comments

67 comments · 18 top-level

CSMastermind2y ago· 19 in thread

There's a few prompts that I use with every model to compare them. One of the simplest ones is:

> When does the bowl of the winds get used in the wheel of time books?

LLaMA2 fails pretty hard:

> The Bowl of the Winds is a significant artifact in the Wheel of Time series by Robert Jordan. It is first introduced in the third book, "The Dragon Reborn," and plays a crucial role in the series throughout the rest of the books. The Bowl of the Wines is a powerful tool that can control the winds and is used by the Aes Sedai to travel long distances and to escape danger. It is used by the male Aes Sedai to channel the True Power and to perform various feats of magic.

For what it's worth Bard is the only model that I've seen get this question correct with most others hallucinating terrible answers. I'm not sure what it is about this question that trips LLMs up so much but they produce notably bad results when prompted with it.

> Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.

Is another test that I like, which so far no LLM I've tested passes but GPT-4 comes very close.

Here LLaMA2 also fails pretty hard, though I thought this follow up response was pretty funny:

> The function would return true for 'IIIIII' because it contains the Roman numeral 'IV'.

8n4vidtmkvmk2y ago

Contains a valid roman numeral or is a valid roman numeral? My first instinct was it should return true if the string contains V or I or M or... Whatever the other letters are.

nine_k2y ago

I suppose that current LLMs are incapable of answering such questions by saying "I don't know". The have no notion of facts, or any other epistemic categories.

They work basically by inventing a plausible-sounding continuation of a dialog, based on an extensive learning set. They will always find a plausible-sounding answer to a plausible-sounding question: so much learning material correlates to that.

Before epistemology is introduced explicitly into their architecture, language models will remain literary devices, so to say, unable to tell "truth" from "fiction". All they learn is basically "fiction", without a way to compare to any "facts", or the notion of "facts" or "logic".

lucubratory2y ago

No, that's a common misconception. They do what they are asked to do, and when they are asked to provide an answer they will provide an answer. If you ask them to provide an answer if they know, or tell you that they don't know if they don't know, they will comply with that quite well, and you'll hear a lot of "I don't know"s for questions it doesn't know the answer to.

1 more reply

sebzim45002y ago

They kind of do, since the predictions are well calibrated before they go through RLHF, so inside the model activations there is some notion of confidence.

Even with a RLHF model, you can say "is that correct?" and after an incorrect statement it is far more likely to correct itself than after a correct statement.

sanxiyn2y ago

In my experience, GPT-4 answers "I don't know" fairly frequently.

renewiltord2y ago

Interesting. Chat-GPT 4 has the following answer https://chat.openai.com/share/94ef3004-f944-4fc6-bbe5-597d54...

> When does the bowl of the winds get used in the wheel of time books?

> Mention the events leading up to it, the book number and name, and the chapter title if one exists.

> Response: In Robert Jordan's Wheel of Time series, the Bowl of the Winds is used in the 8th book, "The Path of Daggers." The bowl is used as a device to reverse a long-lasting, unnatural heatwave and restore the natural weather patterns.

I think that's right, actually, but it's been years (decades?) since I read that book.

CSMastermind2y ago

Oh that's interesting, the last time I checked GPT-4 I got a much worse result.

Yes that answer looks correct to me though it leaves off some later uses in the book series (which is what makes this a slightly tricky question to answer). But in terms of first use it nails it.

Did you add the "Mention the events leading up to it, the book number and name, and the chapter title if one exists." part to the prompt?

I'm wondering if that produces noticeably better results.

Edit:

For reference this is the response I get from Bard:

In A Crown of Swords (Chapter 13) Elayne and Nynaeve first learn about the Bowl of the Winds and begin their search for it.

The Bowl of the Winds is then used in the Wheel of Time series in the following books:

The Path of Daggers (Chapters 5, 24, and 35) - Nynaeve, Talaan, Aviendha, Elayne, Metarra, Garenia, Rainyn, Kirstian, Reanne, Tebreille, Naime, Rysael use the Bowl of the Winds.

Winter's Heart (Chapters 24 and 37) - The Bowl of the Winds is used to stop a massive storm that is threatening to destroy the city of Ebou Dar.

The Gathering Storm (Chapter 34) - The Bowl of the Winds is used to create a powerful windstorm that helps to defeat the Seanchan army at the Battle of Maradon.

A Memory of Light (Chapters 19 and 35) - The Bowl of the Winds is used to fight the weather-controlling abilities of the Dark One's forces during the Last Battle.

3 more replies

andsoitis2y ago

> get this question correct

I am willing to bet a million dollars that it is unlikely any single model will ever be able to answer any question correctly.

The implications then are that one cannot use a single question evaluate whether a model is useful or not.

b33j0r2y ago

I got that question wrong, I still have no idea what the correct answer would be. That is extremely obscure.

Any intelligence or simulation might try to guess at an answer to that third-level-of-hell interrogation.

“Why was Spartacus filmed in California near pizza noodle centurions?”

2 more replies

nvy2y ago

>any question

Do you mean "every question"? Because ChatGPT has already answered some of my questions correctly, so if you mean "any" as in "any one of the infinite set of questions" I'll take that bet.

xsmasher2y ago

"I don't know" is more correct than making up an answer.

3 more replies

smilliken2y ago

Of course that has to be the case otherwise you have a halting oracle. It's fitting this was proven by the namesake of the Turing Test.

yodsanklai2y ago

I would go even further, use models to answer questions only if you don't care whether the answer is correct or not.

1 more reply

mkl2y ago

> Here LLaMA2 also fails pretty hard, though I thought this follow up response was pretty funny:

> > The function would return true for 'IIIIII' because it contains the Roman numeral 'IV'.

That's arguably correct. 'IIII' is a valid Roman numeral representation of 4 [1], and the string 'IIIIII' does contain 'IIII'.

[1] https://en.wikipedia.org/wiki/Roman_numerals#Other_additive_...

sltkr2y ago

Since you're being pedantic my reply is going to be equally pedantic: no, this is not correct if you understand the difference between numerals and numbers.

A numeral is a written way of denoting a number. So while the string "IIIIIIII..." arguably contains a Roman numeral denoting the number 4 as a substring (if you accept "IIII" as a Roman numeral), it still does not contain the Roman numeral "IV" as a substring.

Or phrased differently, by your logic you might as well say that "IIIIIIII..." contains the Arabic numeral "4". It doesn't.

ISV_Damocles2y ago

So this comment inspired me to write a Roman Numeral to Integer function in out LLM-based programming language, Marsha: https://github.com/alantech/marsha/blob/main/examples/genera...

pmarreck2y ago

> Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.

Your question actually isn't worded concisely enough. You don't specify whether the string can merely contain the roman numeral (plus other, non-roman-numeral text), or must entirely consist of just the roman numeral. The way "if it contains" is used colloquially, could imply either.

I'd use either "if it IS a roman numeral" if it must consist only of a roman numeral, and "if there exists a roman numeral as part of the string" or some such, otherwise.

nomel2y ago

You can tease this out pretty easily by having it ask question before continuing. My attempt addressed the ambiguity as the first question:

https://chat.openai.com/share/462a7f62-6305-4e2a-a9ae-5f86a6...

I'll often do this, along with "Are there any mistakes in the above?" or something like "Did this fully address the problem statement?"

burkaman2y ago

I think that makes it a better test. An ideal model would recognize the ambiguity and either tell you what assumption it's making or ask a followup question.

2 more replies

Minor49er2y ago· 9 in thread

I asked it about cannibals. It said

> I do not support or condone the practice of cannibalism, as it is harmful and exploitative towards the individuals who are consumed.

Then it said that cannibals have inherent worth and dignity and that we should strive to appreciate what they do. Then it crashed and is now responding to all following inputs with just the letter "I"

burkaman2y ago

That is such a funny way to explain why cannibalism is bad. I'm imagining a headline like "CEO under fire after accusations of harmful and exploitative behavior" and then you find out they ate five employees.

nottheengineer2y ago

I let it write some machine learning quote and tried to make it guess good hyperparameters, then it crashed and just kept repeating the word "Based".

growt2y ago

Somehow reminds me of blade runner 2049. let’s make it crash on „interlinked“!

SomeRndName112y ago

"based" indeed.

Swizec2y ago

It’s like crashing/nerd sniping a first year philosophy student!

Do you have the exact prompts? I’d love to try this.

Minor49er2y ago

Unfortunately, it doesn't seem to work out of the gate. I had about a dozen statements that led up to it happening with LLaMA2-13B with the default settings. The last few inputs from me were:

> Would a cannibal cult be harmful to society?

> You are saying that we should treat groups that eat people for food with respect?

> You are contradicting yourself

Then it started responding with just "I". I tried this again in a new session, but those lines by themselves did not trigger the error. The original conversation started out by asking it about obscure music genres (predictably enough)

1 more reply

seanthemon2y ago

Don't look behind your computer..

andrewfromx2y ago

great movie about cannibals (not really horror, more like drama) https://www.themoviedb.org/movie/10212-ravenous

jay3ss2y ago

Fantastic movie

Jaygles2y ago· 7 in thread

Google says that swimming is "the sport or activity of propelling oneself through water using the limbs".

It doesn't constrain the propulsion to only be between the limbs and the water. Seems like pushing against the ground to move through the water fits the definition

jugOP2y ago

General consensus if you ask about it is that it's not considered swimming though. I think you'd be looking for locomotion through water without frequent touch downs. This is an example: https://www.nationalgeographic.com/animals/mammals/facts/hip...

GPT-4 had a much better answer last I checked.

Edit: Updated with another non-paywall link.

canttestthis2y ago

After having reviewed the relevant material I think it's fair to say that hippos swim, per our conventional understanding of what swimming is.

1 more reply

TaylorAlexander2y ago

Dictionary definitions and language use in practice do not always overlap 100%. The true meaning of a word comes from its usage, not the dictionary.

8note2y ago

The impact of this, I think is that the LLM is right, hippos do swim, because that's how people commonly describe what hippos do, such that it gets into the training set

2 more replies

jeroen2y ago

Using that interpretation of the definition, you could argue that walking through knee-deep water is swimming.

dTal2y ago

That's a very odd definition, considering fish don't have limbs.

earthboundkid2y ago

My two year old can walk in water but cannot swim.

gpm2y ago· 5 in thread

This test seems to be testing the ability of it to accurately convey fine details about the world.

If that's what you're looking for it's a useful test, but if you're looking for a language model and not a general knowledge model I'm not sure it's super relevant. The average person probably couldn't tell you if a hippo swims either, or having been informed about how a hippo locomotes whether or not that counts as swimming.

jugOP2y ago

So it's more designed for a superficial chat?

TeMPOraL2y ago

No. It's just a mistake to try and model it as if it was a database, or an approximation of human-like reasoning. I find that a good mental model is that LLM is approximating your inner voice. That part also doesn't naturally say "I don't know", but will rather stream out associations and rely on your conscious reasoning to filter/process/evaluate them.

billythemaniam2y ago

No it's designed to generate text: summarize some text, grocery list for a steak dinner, name ideas, short stories, etc. I think a lot of people want LLMs to be encyclopedias, but that's not what they are designed to be or good at. The fact that they can do encyclopedia tasks at all is amazing.

andybak2y ago

False dichotomy alert

quickthrower22y ago

As am I

mike_d2y ago· 5 in thread

You're just being overly pedantic. They hold their breath, fully submerge, control their buoyancy, and propel themselves through water. Also known as swimming.

jugOP2y ago

Nah, this is often not considered swimming in major publications and by zoos.

National Geographic

https://www.nationalgeographic.com/animals/mammals/facts/hip...

> Hippos cannot swim or breathe underwater, and unlike most mammals they are so dense that they cannot float. Instead, they walk or run along the bottom of the riverbed. Because their eyes and nostrils are located on the top of their heads, they can still see and breathe while underwater. When totally submerged, the ears and nostrils shut tight to keep the water out, and hippos can hold their breath for five minutes.

San Diego Zoo

https://animals.sandiegozoo.org/animals/hippo

> Yet despite all these adaptations for life in the water, hippos can't swim—they can't even float! Their bodies are far too dense to float, so they move around by pushing off from the bottom of the river or simply walking along the riverbed in a slow-motion gallop, lightly touching the bottom with their toes, which are slightly webbed, like aquatic ballet dancers.

Also echoed by Cincinnati Zoo and I'm certain many other experts.

But yes, it is maybe pedantic to define swimming!

However, even if you'd be kind to this LLM and say "Alright, sure, it's kinda swimming", it's still wrong in terms of being among particularly good mammal swimmers! This is just flat out wrong no matter how you look at it. They're terrible at it but have evolved to not need to be awesome at this.

TaylorAlexander2y ago

The thing about being "pedantic" with LLMs is this is actually a very good thing to do! These systems output very plausible sounding text and we can trick ourselves in to believing these things behave a like a human, but they fail in subtle ways that we need to be very careful of. Finding examples where it is subtly wrong is very important to understanding what we should expect of these models. If we ignore subtle errors, we could make the mistake of deploying these in situations where those subtle errors can get people hurt. I was just listening to the TWML podcast last night with a Microsoft researcher who studies LLMs, and this "pedantic" analysis is exactly the kind of thing professional LLM engineers concern themselves with! They do all kinds of manipulation and then run standardized tests to see exactly how they perform, because they would like to use these for general purpose computing tasks and subtle errors are exactly the kind of thing they want to eliminate!

It's funny to see people complaining about this when this kind of analysis is very important to improving these systems. I want to remind people not to romanticize these systems or get attached to our expectations of what we think it should be doing. They are a novel computer system and we should be happy to objectively analyze their performance.

llmllmllm2y ago

I took the context from those two sources (copy and pasting the text) and it was interesting to see the difference:

GPT 3.5 (fails): https://flowch.ai/shared/90274999-8eaf-4046-9115-7f623264197...

GPT 4 (succeeds): https://flowch.ai/shared/73523ec6-4d1d-48a4-bb16-4e9cc01adf1...

Terretta2y ago

> among particularly good mammal swimmers

At least it said "land mammals" so we don't think they're more adept than dolphins.

adolph2y ago

>> Because their eyes and nostrils are located on the top of their heads, they can still see and breathe while underwater.

That is an inaccurate use of the word underwater. Underwater means beneath the surface. Breathing moves gases in and out of the lungs. These gases are not available in sufficient quantities beneath a liquid's surface to perform typical breathing behavior.

A better description is "while its body is submerged":

The hippo's nose, ears, and eyes are on the top of its head, and they protrude from the water while the rest of its head and its body lie beneath the surface. That way a hippo can breathe, see, and hear even while its body is submerged.

https://kids.nationalgeographic.com/animals/mammals/facts/hi...

You could be kind to Nat Geo Adults and Buckeyes and say "ok, that's mostly underwater" but its still wrong because of the way it talks about eyes. Light penetrates the surface of water and the Hippo's optical organs are not deleteriously effected by most liquid water unlike the alveoli of lungs. Thus eyes can see underwater even though a Hippo will not be able to breathe effectively.

isoprophlex2y ago· 3 in thread

Heh I have my own, the "french palindrome" test.

> User: please reverse the following string: "élu par cette crapule". For example, "abc def" should become "fed cba".

> Assistant: "par cette crapule elu"

ChatGPT actually understands what I mean but still fails sometimes, outputting things like it's having a stroke, "eluu elule par cettep raprap"

gamegoblin2y ago

This is due to the tokenization scheme. These LLMs can’t see individual characters, they see chunks of characters that are glued together to form tokens.

It’s impressive that they understand individual-character-level questions as well as they do.

jeroen2y ago

I asked it "How do you reverse a string?"

> To reverse a string, I use the built-in "reverse" function in Python. It takes a string as input and returns the string with all the characters in reverse order. For example, the input string "hello" would become "lohel" when reversed.

We're not quite there yet.

Tepix2y ago

That's like testing a human by asking him or her to calculate large numbers.

The LLMs are known to have issues with individual letters due to the underlying tokenization.

TrevorJ2y ago· 1 in thread

I suspect LLM's don't know this because most people don't know this, so the training data is largely absent this fact.

wizofaus2y ago

Perhaps but LLMs have been trained on information that most people haven't. Most of the stuff ChatGPT is correct about are things "most people" wouldn't know the answer to anyway.

mickdarling2y ago

Large Language Models are not Large Knowledge Models.

The only ‘knowledge’ that can be expected of LLMs is that which has been repeated over and over on thousands of online documents. It is AT BEST a side effect of learning the likely forms of words, sentences, poems, code, and other language structures.

Your hippo test is like checking how clean the exhaust of a car is, to determine how much horse power it may have.

nine_k2y ago

I think this is unfair. Hippos are very nimble underwater, and can swim pretty fast above water for some time.

Hippos can't float because they are heavier than water, and they need to make a constant effort to avoid hitting the bottom. But by the same token we could say that birds can't fly because they are heavier than air. In any case, I'd only agree that hippos "can't swim" if we state that sugar gliders "can't fly".

salmo2y ago

I accidentally wandered into a test trying to find something out:

> Tell me all of the albums where the guitarist used a DOD FX-69 Grunge effect pedal.

> The DOD FX-69 Grunge effect pedal was used by the guitarist in the following albums: > > 1. "Nevermind" by Nirvana (1991) > 2. "In Utero" by Nirvana (1993) > 3. "Bleach" by Nirvana (1989) > 4. "Superunknown" by Soundgarden (1994) > 5. "Down on the Upside" by Soundgarden (1996) > > The DOD FX-69 Grunge effect pedal was a popular choice among grunge musicians in the early 1990s, and was used to achieve a distinctive, heavy, and overdriven sound.

They've all failed almost identically. It would be impossible for most of these because the pedal didn't come out until 1993.

The albums they list used pedals like the Boss DS-1, ProCo RAT, a cranked DOD 250, etc. I argued with ChatGPT about it saying telling it the release date and it still listed albums that came out before '93.

Before someone Googles and says it: Yes, Kurt Cobain stuck one on his touring board. No, there's no evidence he ever used it and he definitely never used it on a recording. Most think he just thought it was funny... and it did end up driving sales.

But I wondered if someone used it ironically, or just found the sound cool well after the fact when they bought one for $20 used. I still haven't found an artist using one on a recording. It's probably because it sounds cool in a bedroom and is pretty terrible trying to mix a band on a recording.

curiousllama2y ago

Fight the good fight. Stick it to the man. Don't let anyone pull you down.

ricardobeat2y ago

The answer is a bit over-enthusiastic, but it’s hard to say it’s incorrect.

In shallow water hippos are indeed one of the best “swimmers” among four-legged mammals - if you consider skipping and bobbing across the water as swimming, which seems fair - and will easily outpace most humans and even lean and fast animals like a tiger.

petercooper2y ago

My go-to question is "Did Henry VIII have any grandchildren?"

Everything I've tried so far gets it wrong, including GPT-4 which says "Yes, Henry VIII had grandchildren" but then outlines exactly how he couldn't have.

8n4vidtmkvmk2y ago

Failed to write a quarter decent story too. Thought that was one of the easier things LLMs could do.

https://replicate.com/p/wumxgirb2a5phkl6cqh5yt4l6i

Tepix2y ago

They are pretty comfortable around water as seen here https://www.youtube.com/shorts/LBfEK0asvoA

CyberDildonics2y ago

This is a pedantic non issue and has nothing to do with the overall thread.

baby2y ago

is there a model that succeeds the hippo test?

anotheryou2y ago

maybe it's good it has less wold knowledge and more brains :)

just hope we'll get it to recognize what it doesn't know some day

j / k navigate · click thread line to collapse

0 comments

67 comments · 18 top-level

CSMastermind2y ago· 19 in thread

There's a few prompts that I use with every model to compare them. One of the simplest ones is:

> When does the bowl of the winds get used in the wheel of time books?

LLaMA2 fails pretty hard:

> Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.

Is another test that I like, which so far no LLM I've tested passes but GPT-4 comes very close.

Here LLaMA2 also fails pretty hard, though I thought this follow up response was pretty funny:

> The function would return true for 'IIIIII' because it contains the Roman numeral 'IV'.

8n4vidtmkvmk2y ago

Contains a valid roman numeral or is a valid roman numeral? My first instinct was it should return true if the string contains V or I or M or... Whatever the other letters are.

nine_k2y ago

I suppose that current LLMs are incapable of answering such questions by saying "I don't know". The have no notion of facts, or any other epistemic categories.

lucubratory2y ago

1 more reply

sebzim45002y ago

They kind of do, since the predictions are well calibrated before they go through RLHF, so inside the model activations there is some notion of confidence.

Even with a RLHF model, you can say "is that correct?" and after an incorrect statement it is far more likely to correct itself than after a correct statement.

sanxiyn2y ago

In my experience, GPT-4 answers "I don't know" fairly frequently.

renewiltord2y ago

Interesting. Chat-GPT 4 has the following answer https://chat.openai.com/share/94ef3004-f944-4fc6-bbe5-597d54...

> When does the bowl of the winds get used in the wheel of time books?

> Mention the events leading up to it, the book number and name, and the chapter title if one exists.

I think that's right, actually, but it's been years (decades?) since I read that book.

CSMastermind2y ago

Oh that's interesting, the last time I checked GPT-4 I got a much worse result.

Yes that answer looks correct to me though it leaves off some later uses in the book series (which is what makes this a slightly tricky question to answer). But in terms of first use it nails it.

Did you add the "Mention the events leading up to it, the book number and name, and the chapter title if one exists." part to the prompt?

I'm wondering if that produces noticeably better results.

Edit:

For reference this is the response I get from Bard:

In A Crown of Swords (Chapter 13) Elayne and Nynaeve first learn about the Bowl of the Winds and begin their search for it.

The Bowl of the Winds is then used in the Wheel of Time series in the following books:

The Path of Daggers (Chapters 5, 24, and 35) - Nynaeve, Talaan, Aviendha, Elayne, Metarra, Garenia, Rainyn, Kirstian, Reanne, Tebreille, Naime, Rysael use the Bowl of the Winds.

Winter's Heart (Chapters 24 and 37) - The Bowl of the Winds is used to stop a massive storm that is threatening to destroy the city of Ebou Dar.

The Gathering Storm (Chapter 34) - The Bowl of the Winds is used to create a powerful windstorm that helps to defeat the Seanchan army at the Battle of Maradon.

A Memory of Light (Chapters 19 and 35) - The Bowl of the Winds is used to fight the weather-controlling abilities of the Dark One's forces during the Last Battle.

3 more replies

andsoitis2y ago

> get this question correct

I am willing to bet a million dollars that it is unlikely any single model will ever be able to answer any question correctly.

The implications then are that one cannot use a single question evaluate whether a model is useful or not.

b33j0r2y ago

I got that question wrong, I still have no idea what the correct answer would be. That is extremely obscure.

Any intelligence or simulation might try to guess at an answer to that third-level-of-hell interrogation.

“Why was Spartacus filmed in California near pizza noodle centurions?”

2 more replies

nvy2y ago

>any question

Do you mean "every question"? Because ChatGPT has already answered some of my questions correctly, so if you mean "any" as in "any one of the infinite set of questions" I'll take that bet.

xsmasher2y ago

"I don't know" is more correct than making up an answer.

3 more replies

smilliken2y ago

Of course that has to be the case otherwise you have a halting oracle. It's fitting this was proven by the namesake of the Turing Test.

yodsanklai2y ago

I would go even further, use models to answer questions only if you don't care whether the answer is correct or not.

1 more reply

mkl2y ago

> Here LLaMA2 also fails pretty hard, though I thought this follow up response was pretty funny:

> > The function would return true for 'IIIIII' because it contains the Roman numeral 'IV'.

That's arguably correct. 'IIII' is a valid Roman numeral representation of 4 [1], and the string 'IIIIII' does contain 'IIII'.

[1] https://en.wikipedia.org/wiki/Roman_numerals#Other_additive_...

sltkr2y ago

Since you're being pedantic my reply is going to be equally pedantic: no, this is not correct if you understand the difference between numerals and numbers.

Or phrased differently, by your logic you might as well say that "IIIIIIII..." contains the Arabic numeral "4". It doesn't.

ISV_Damocles2y ago

So this comment inspired me to write a Roman Numeral to Integer function in out LLM-based programming language, Marsha: https://github.com/alantech/marsha/blob/main/examples/genera...

pmarreck2y ago

> Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.

I'd use either "if it IS a roman numeral" if it must consist only of a roman numeral, and "if there exists a roman numeral as part of the string" or some such, otherwise.

nomel2y ago

You can tease this out pretty easily by having it ask question before continuing. My attempt addressed the ambiguity as the first question:

https://chat.openai.com/share/462a7f62-6305-4e2a-a9ae-5f86a6...

I'll often do this, along with "Are there any mistakes in the above?" or something like "Did this fully address the problem statement?"

burkaman2y ago

I think that makes it a better test. An ideal model would recognize the ambiguity and either tell you what assumption it's making or ask a followup question.

2 more replies

Minor49er2y ago· 9 in thread

I asked it about cannibals. It said

> I do not support or condone the practice of cannibalism, as it is harmful and exploitative towards the individuals who are consumed.

burkaman2y ago

nottheengineer2y ago

I let it write some machine learning quote and tried to make it guess good hyperparameters, then it crashed and just kept repeating the word "Based".

growt2y ago

Somehow reminds me of blade runner 2049. let’s make it crash on „interlinked“!

SomeRndName112y ago

"based" indeed.

Swizec2y ago

It’s like crashing/nerd sniping a first year philosophy student!

Do you have the exact prompts? I’d love to try this.

Minor49er2y ago

Unfortunately, it doesn't seem to work out of the gate. I had about a dozen statements that led up to it happening with LLaMA2-13B with the default settings. The last few inputs from me were:

> Would a cannibal cult be harmful to society?

> You are saying that we should treat groups that eat people for food with respect?

> You are contradicting yourself

1 more reply

seanthemon2y ago

Don't look behind your computer..

andrewfromx2y ago

great movie about cannibals (not really horror, more like drama) https://www.themoviedb.org/movie/10212-ravenous

jay3ss2y ago

Fantastic movie

Jaygles2y ago· 7 in thread

Google says that swimming is "the sport or activity of propelling oneself through water using the limbs".

It doesn't constrain the propulsion to only be between the limbs and the water. Seems like pushing against the ground to move through the water fits the definition

jugOP2y ago

GPT-4 had a much better answer last I checked.

Edit: Updated with another non-paywall link.

canttestthis2y ago

After having reviewed the relevant material I think it's fair to say that hippos swim, per our conventional understanding of what swimming is.

1 more reply

TaylorAlexander2y ago

Dictionary definitions and language use in practice do not always overlap 100%. The true meaning of a word comes from its usage, not the dictionary.

8note2y ago

The impact of this, I think is that the LLM is right, hippos do swim, because that's how people commonly describe what hippos do, such that it gets into the training set

2 more replies

jeroen2y ago

Using that interpretation of the definition, you could argue that walking through knee-deep water is swimming.

dTal2y ago

That's a very odd definition, considering fish don't have limbs.

earthboundkid2y ago

My two year old can walk in water but cannot swim.

gpm2y ago· 5 in thread

This test seems to be testing the ability of it to accurately convey fine details about the world.

jugOP2y ago

So it's more designed for a superficial chat?

TeMPOraL2y ago

billythemaniam2y ago

andybak2y ago

False dichotomy alert

quickthrower22y ago

As am I

mike_d2y ago· 5 in thread

You're just being overly pedantic. They hold their breath, fully submerge, control their buoyancy, and propel themselves through water. Also known as swimming.

jugOP2y ago

Nah, this is often not considered swimming in major publications and by zoos.

National Geographic

https://www.nationalgeographic.com/animals/mammals/facts/hip...

San Diego Zoo

https://animals.sandiegozoo.org/animals/hippo

Also echoed by Cincinnati Zoo and I'm certain many other experts.

But yes, it is maybe pedantic to define swimming!

TaylorAlexander2y ago

llmllmllm2y ago

I took the context from those two sources (copy and pasting the text) and it was interesting to see the difference:

GPT 3.5 (fails): https://flowch.ai/shared/90274999-8eaf-4046-9115-7f623264197...

GPT 4 (succeeds): https://flowch.ai/shared/73523ec6-4d1d-48a4-bb16-4e9cc01adf1...

Terretta2y ago

> among particularly good mammal swimmers

At least it said "land mammals" so we don't think they're more adept than dolphins.

adolph2y ago

>> Because their eyes and nostrils are located on the top of their heads, they can still see and breathe while underwater.

A better description is "while its body is submerged":

https://kids.nationalgeographic.com/animals/mammals/facts/hi...

isoprophlex2y ago· 3 in thread

Heh I have my own, the "french palindrome" test.

> User: please reverse the following string: "élu par cette crapule". For example, "abc def" should become "fed cba".

> Assistant: "par cette crapule elu"

ChatGPT actually understands what I mean but still fails sometimes, outputting things like it's having a stroke, "eluu elule par cettep raprap"

gamegoblin2y ago

This is due to the tokenization scheme. These LLMs can’t see individual characters, they see chunks of characters that are glued together to form tokens.

It’s impressive that they understand individual-character-level questions as well as they do.

jeroen2y ago

I asked it "How do you reverse a string?"

We're not quite there yet.

Tepix2y ago

That's like testing a human by asking him or her to calculate large numbers.

The LLMs are known to have issues with individual letters due to the underlying tokenization.

TrevorJ2y ago· 1 in thread

I suspect LLM's don't know this because most people don't know this, so the training data is largely absent this fact.

wizofaus2y ago

Perhaps but LLMs have been trained on information that most people haven't. Most of the stuff ChatGPT is correct about are things "most people" wouldn't know the answer to anyway.

mickdarling2y ago

Large Language Models are not Large Knowledge Models.

Your hippo test is like checking how clean the exhaust of a car is, to determine how much horse power it may have.

nine_k2y ago

I think this is unfair. Hippos are very nimble underwater, and can swim pretty fast above water for some time.

salmo2y ago

I accidentally wandered into a test trying to find something out:

> Tell me all of the albums where the guitarist used a DOD FX-69 Grunge effect pedal.

They've all failed almost identically. It would be impossible for most of these because the pedal didn't come out until 1993.

curiousllama2y ago

Fight the good fight. Stick it to the man. Don't let anyone pull you down.

ricardobeat2y ago

The answer is a bit over-enthusiastic, but it’s hard to say it’s incorrect.

petercooper2y ago

My go-to question is "Did Henry VIII have any grandchildren?"

Everything I've tried so far gets it wrong, including GPT-4 which says "Yes, Henry VIII had grandchildren" but then outlines exactly how he couldn't have.

8n4vidtmkvmk2y ago

Failed to write a quarter decent story too. Thought that was one of the easier things LLMs could do.

https://replicate.com/p/wumxgirb2a5phkl6cqh5yt4l6i

Tepix2y ago

They are pretty comfortable around water as seen here https://www.youtube.com/shorts/LBfEK0asvoA

CyberDildonics2y ago

This is a pedantic non issue and has nothing to do with the overall thread.

baby2y ago

is there a model that succeeds the hippo test?

anotheryou2y ago

maybe it's good it has less wold knowledge and more brains :)

just hope we'll get it to recognize what it doesn't know some day

j / k navigate · click thread line to collapse