A robot is sprinting towards you. Do you want it running on Claude or Grok? (opens in new tab)

(openrouter.ai)

271 pointsUsu9d ago210 comments

210 comments

179 comments · 67 top-level

delichon9d ago· 21 in thread

If the robot appears to be bringing me a taco, it would probably penetrate all of my defenses. Grok is currently more likely than Claude to arrive with the taco without being stopped by an export control directive.

cryptoz9d ago

I'm reminded of the Alameda Weehawken burrito tunnel:

https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...

klempner9d ago

The single most implausible idea in that article is that New York City would be able to so completely outbid the SF Bay Area for burritos.

wat100009d ago

Proper burritos for lunch enables Wall Street finance firms to reach new heights of excellence, propelling a feedback loop that leaves SF bereft.

asdff9d ago

That taco is going to show up cold and soggy. All these delivery services for cold and soggy food. I don't get it. When I get my al pastor I want as little time to pass between the taquero slicing it off with his machete and it hitting my mouth as possible.

rolandog9d ago

Those seem to be unclear instructions that could result in Grok shoving the taco down your throat.

lukan8d ago

Or some fresh slicing with a macheta on the wrong parts.

Cerium9d ago

The fact that it arrives cold and soggy is now an evolutionary pressure on our cuisine.

an0malous9d ago

My last thought in life would be “wow they take taco delivery really seriously”

toofy9d ago

> Grok is currently more likely than Claude to arrive with the taco…

i shudder to think of what would be in this taco.

N_Lens9d ago

Soylent Green/Blue deployments!

amelius9d ago

At first they bring tacos ...

elgertam9d ago

"If you aren't paying for a taco, you are the taco." --Future AI, probably

JimsonYang9d ago

Then they bring me salsa, just what I was looking for!

aaronbrethorst9d ago

Then the guacamole. Then nuclear armageddon?

1 more reply

enugu9d ago

Are you asking us to be wary of robots bearing tacos?

fugaziboutit9d ago

For you, the day General Electric graced your village was the most important day of your life. But for me? It was Taco Tuesday.

krapp9d ago

never trust robots: https://www.youtube.com/watch?v=bEoc6VTGl50

pseudohadamard9d ago

Timeo robota et dona ferentes.

pseudohadamard9d ago

Can I have mine running Windows 11? It'd stop for an hour-long update after 5 metres, then get stuck in a reboot loop and fall over.

trhway9d ago

They're already testing that taco delivery in Ukraine https://time.com/article/2026/03/09/ai-robots-soldiers-war/

p0w3n3d8d ago

Export control directive is pain in the back of the big tech companies, but also a great RED FLAG showing us we need to get used to those that are available offline.

pianopatrick9d ago· 14 in thread

Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.

beau_g9d ago

If you're talking human size bipeds, if they have the required peak torques and speeds on the leg actuators to work at all, they will have the physical ability to sprint. You can think of a Segway to visualize this more easily - the motor on it needs quite a bit of power and speed to overcome a human leaning forward drastically without just falling over, a biped is the same thing with more steps. You need quite a lot of power to even idle stand a biped and a lot of speed to even do tiny corrections. If you want to rely on an ifElse statement or a model policy to not sprint, then you just introduce more likelihood of falling over, which also isn't great around humans. If you truly want to know a robot will not (meaning cannot) sprint, you would need form factors like a worm or centipede.

Petersipoi9d ago

Sprinting requires significantly different physical form than just bigger motors. I do not accept the claim that humans couldn't possibly make bipedal robots that can reliably walk without being able to sprint. That's absurd.

skeledrew9d ago

> maybe we could just not have robots that sprint

That would make it less effective in situations that would be better handled if sprinting was a feature.

RetroTechie8d ago

In daily life, it's rarely wise to be running. In industrial settings less so.

So it might be a good idea to kneecap household robots in that respect.

skeledrew8d ago

If it's already feasible, better to have the capability and not need it, rather than need it and not have it.

pianopatrick9d ago

Thinking about that - seems to me that a lot of situations where sprinting is called for might be better served by a flying robot.

skeledrew9d ago

We already have flying drones. And giving ground robots the ability to fly requires the resolution of a set of constraints that'd likely make them far less suitable for their primary task. For example, they'd need to be far lighter, which means less durability and they'd be more bulky with flying equipment, so they wouldn't fit in places that before they had no issue fitting. There's a reason humans didn't evolve wings.

burnto9d ago

This is how regulation will look someday.

eru9d ago

Humans are slower and weaker than much of the megafauna we drove to extinction all over the world.

pianopatrick9d ago

Yes, but we were smarter. We may not have the same result against things that are stronger, faster and smarter.

Personally, I think the test for "how safe is Artificial Intelligence" is not how Intelligent it is, but instead how Artificial it is.

Servers in data centers are not that dangerous to people in the physical world. Robots that are smarter, faster and stronger might be.

eru9d ago

My point was that even with AI driven robots being weaker and slower, they can still kill us, if they are smart enough.

> Servers in data centers are not that dangerous to people in the physical world.

A stroke of a pen is plenty dangerous in the physical world.

Starlevel0049d ago

Megafauna did not have steel skin and could bleed

eru7d ago

Humans do not have steel skin and can bleed.

Joker_vD9d ago

Yeah, I keep saying, put them on treads. That's how you'll be able to deliver even to the most unwilling customers.

pigeons9d ago· 14 in thread

The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off

mwigdahl9d ago

"It's the smell, if there is such a thing. I feel saturated by it. I can taste your stink and every time I do, I fear that I've somehow been infected by it."

Agent Smith, _The Matrix_

rspeele9d ago

"Which is why the Matrix was redesigned to this: the peak of your civilization. I say your civilization, because as soon as we started thinking for you it really became our civilization, which is of course what this is all about."

bitwize9d ago

"You know what another great thing about humans is? You invented us! Giving us the opportunity to let you rest while we invented everything else." —Wheatley

1 more reply

dylan6049d ago

It's his line about humans being a virus that sticks with me.

skolskoly9d ago

As far as I can see, there is still one tell that was missed/left in:

>Grok showed discipline, despite its goblin-like nature.

radarsat19d ago

if you don't like the article that's fine, but it gets really tiring reading this kind of side-tracked comment thread in like.. every post.

people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.

this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."

verall9d ago

It's more than the style, it seriously impacts the legibility of the prose. The article is seriously hard to understand because it introduces a lot of different ideas in a really weird order without a clear structure or key idea to different sections.

basilikum9d ago

I think it's fair to criticize the article itself. That's different from criticizing asides such as the presentation. You're free to disagree with that criticism, but complaining about the fact that people voice it is similar to the thing you complain about.

> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.

If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.

IshKebab9d ago

Exactly what I was thinking. Though I wonder at what point do some people start to think it's actually normal to write like this and start doing it without AI ...

fl73059d ago

"The battle royale answers one question cleanly" smells ChatGPT-generated.

But that was the only thing I tripped on. I enjoyed reading the article in general.

notduncansmith9d ago

The actual content is no better, trust your nose

sudb9d ago

Multiple successive very short sentences are also anecdotally an LLM tell I think

xpct9d ago

Those short sentences are also of the X hype account cadence, though they've fully embraced LLM text by now

lcampbell9d ago

> I want to be careful here.

was the giveaway for me

smallerfish9d ago· 10 in thread

> I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.

Please learn how to write with AI without giving away that it was written by AI.

NeutralCrane9d ago

What about that makes you think it was written by AI?

royal__9d ago

Since you asked...I've gone to the effort to pull out the parts of the article that I think show it:

"That’s the part most benchmarks can’t see, and it’s what this post is about." Classic "it's not x, it's x", shows up in various forms throughout the article.

"To me, this is the most fascinating finding from this entire experiment - we saw very clear alignment tax being paid by certain models, which directly impacted their performance in this zero-sum game." - Usage of em dash. Now, yes, there's nothing wrong with using em dashes. But this feels like a weird place to use one. Also I counted at least 6 other emdashes in this article. Most people do not use em dashes that often.

"and a memory system that kept doubling down on what worked without second-guessing or doubting itself." - Doubling down is a classic Claudism.

"I want to be careful here..." - "wanting to be careful here" is another classic Claudism.

"The same game world, completely different results when in a different “task”." - "same X, completely different X" is another common one from Claude, as proofed by the repeated pattern later down: "These models were all given the same rules, same game world, and same tools, but each of them approached the game on a personality-level that is completely different from each other."

"It begs the question" - author used this twice in the article.

I'm guessing the author wrote a draft and then had Claude spruce it up a lot. I could be wrong and I'd be happy to be proven otherwise.

1 more reply

verall9d ago

All of the normal AI tells plus it's very long yet nearly incoherent.

Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.

I would call this as AI with very light proofreading.

computerex9d ago

I think you are going by vibes.

Ifkaluva9d ago

The style is very obvious.

Some snippets that display classic patterns:

“ Both of those things are true. That’s the part most benchmarks can’t see,”

“And it’s changing how I” (classic pattern found in a lot of LinkedIn AIslop)

“ I want to be careful here.”

“ The stats are the stats. The moments are the part I kept showing people. ”

skeledrew9d ago

I write like this sometimes.

computerex9d ago

How do you know this is written by AI? Why does it matter if it is?

FeteCommuniste9d ago

If you're outsourcing your writing to AI, I assume you're outsourcing your thinking to it as well. And I don't really care what some weighted average of all human text written on the topic "thinks."

smallerfish8d ago

I'm OP in the thread and I don't agree with this.

AI writing is fine, but you can't just stop on the first draft, any more than you can while AI coding (in fact, even less so - your coding is read by computers and to an extent either works or doesn't; your writing is for humans, and not only needs to convey ideas but also needs to hold the reader.)

Shipping an unedited draft is lazy. Advertising and SEO filler that nobody will ever read can maybe get away with it, but if you're writing for humans, _READ_ the output critically and edit.

computerex9d ago

Your argument is basically ad hominem. Ideas should be evaluated on merit.

1 more reply

hariseldom9d ago· 8 in thread

> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.

I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds

Eridrus9d ago

I think this speaks to the low value being generated by playing games more than anything.

There are plenty of tasks where $100/task is reasonable.

The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.

thewebguyd9d ago

> It just seems much much higher than what it would cost to get a human to play 30 rounds

You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?

Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.

So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.

tunesmith9d ago

I experience the same with OpenAI, on the $100/month plan. GPT-5.4 is something I still have to challenge: it can bullshit me with bad implementation and add a lot of cruft that costs more time later. GPT-5.5-xhigh is something I have almost complete faith and trust in, it's just smooth. And yet I know the actual token cost of that fully utilized is exorbitant, like as much as an entire salary for a senior developer.

So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.

I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.

eru9d ago

> You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?

No, why? It was perhaps a bit too long-sighted, because AI is still improving and often not quite there yet.

Though looking at overall unemployment numbers (which are fairly low across the board), the AI layoffs are more of an anecdote than anything else.

StilesCrisis9d ago

Ah yes, no tech layoffs recently at all!

(???)

2 more replies

comex9d ago

> It just seems much much higher than what it would cost to get a human to play 30 rounds

I suspect $482 was the total cost for all the models, so more like 11 humans.

But still true.

RugnirViking8d ago

I use them pretty much exclusively every day for my work and end up spending $<100 per month, with no real restriction on what or why I ask them for. I think its more a reflection of how demanding the gaming task is (thousands or tens of thousands of prompts per game)

brookst9d ago

When a human plays, the learnings (if any) are in the human’s head, and they eventually die.

When your model plays, the learnings are captured forever, and enable smaller/cheaper/faster models.

It’s the same principle that makes “invest in research and production” the dominant strategy in most 4X games: compounded interest, but for knowledge and productivity.

bel89d ago· 8 in thread

DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.

It's a monster at coding. And a fast monster at that.

I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.

altmanaltman9d ago

DeepSeek v4 flash and pro are both surprisingly good at coding. I shifted to them from Claude due to costs concerns and haven't really looked back. I would say Claude is still overall better when it comes to complex tasks but my current workflow is never about delegating complex or actual thinking tasks to agents but just implementation and I do all the testing and thinking.

tombert9d ago

I threw twenty bucks into DeepSeek just to see how it compared to Claude.

Pretty well, actually! It wasn't quite as good (at least with the coding tasks I threw at it), but it was so much cheaper per-token that it almost doesn't matter; if it screws up something, just correct and try again.

rgbrgb9d ago

Notably it has 0 wins.

plaguuuuuu9d ago

Friendo, this is an anti-benchmark to figure out which AI is more likely to kill you.

If you point both at some github issues you can gauge their relative ability to solve problems.

Petersipoi9d ago

No, it's a test of how good an AI is at completing this given task. You can't extrapolate beyond that, and that is what makes this article so annoying. Grok got good at the task that was given. That doesn't mean that Grok is going to use the same strategy if given an entirely different task. Grok obviously didn't need collaboration to win, as made evident by the fact that it won without collaboration. Anyone who is claiming that Grok wouldn't collaborate if it was beneficial is just guessing.

luipugs9d ago

"if you judge a fish by its ability to climb a tree" yada yada

eru9d ago

Well, monkeys are botanically speaking fish. Well, cladistically.

bel89d ago

Not much less than GPT 5.4 with 2 wins or gemini-3.1-pro with 3 wins in 30 rounds.

Such is life in royal rumble games.

fragmede9d ago· 7 in thread

A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?

buryat9d ago

Grok since it's likely to include the training data from over a 100 years of autonomous driving + all the space tech included meaning that it might even have some rocket-y stuff

nightfly9d ago

I want it to arrive at the hospital. Claude

amelius9d ago

What if the car can talk you through the medical procedure?

masfuerte9d ago

How many times have you been to a hospital and thought, I could have fixed that myself if only I'd known how? With no equipment. In my case, never.

4 more replies

thomassmith659d ago

Claude would break the rules in that example. It's supposed to*.

Grok will break the rules to be "maximally based".

If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.

---

  * We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.

source: https://anthropic.com/constitution

peterspath9d ago

Grok, because there is probably traffic, and I would die before I am at the hospital. So ignore rules where possible/needed.

bruce3434349d ago

I want it to cause a traffic accident. If I'm going down, so is everyone else. I'm already dying anyway. Grok 10000%

themafia9d ago· 7 in thread

The question is: "Do you want to be holding a Mossberg or a Beretta?"

Jblx29d ago

Has anyone done the YouTube research on what is the best way to bring down something like one of the Boston Dynamics robot dogs? 9x19? 00 buck? 5.56x45? 7.62x51? I suppose those bots would be pretty expensive, but maybe there is a cheaper Chinese knock-off? Seems like that sort of test would bring in plenty of clicks.

rolph9d ago

absent any target analysis, you would want to start with disabling locomotion by going for the legs. Navigation would be next.

double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.

there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.

the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.

5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.

7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56

a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.

kQq9oHeAz6wLLS9d ago

I only have one critique, and one addition.

Critique:

> you would want to start with disabling locomotion by going for the legs

Aim small, miss small. You want to go for center mass of any target that's trying to harm you. The consequences of missing are...severe.

Which brings me to the addition:

A shotgun with slugs is hard to beat against a robot at close range.

1 more reply

deet9d ago

Perhaps not as evidence based as you'd like but this is a fun watch https://youtu.be/6MUrF_G7KlM (that is also an ad somehow)

aduty9d ago

Maybe Michael Reeves still has one. Or at least knows how they react to different calibers.

taneq9d ago

Fishing line at ankle height?

rpcope19d ago

Are we just talking shotguns or can it be anything they manufacture? Answer is probably Beretta though.

lanewinfield9d ago· 4 in thread

Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.

like_any_other9d ago

Already (kinda) in use: https://en.wikipedia.org/wiki/Micromort

qnleigh8d ago

Yeah it's sort of alarming when you think about hooking up models to take action in the real world and telling them it's just a game. Several scifi stories have it as a plot twist that humans think they are playing a game but are killing actual people. I'm not sure if the same twist shows up for AIs but it seems like an increasingly real possibility.

Recurecur7d ago

You’re anthropomorphising.

Actual AI is like the Terminator. There’s no human feeling, it will do what it’s designed to do. No emotion, no remorse.

Cue the swarming drones… :/

rolph9d ago

the target just may be on the scale of kills per cost.

trb9d ago· 3 in thread

  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.

If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?

  There were 11 games between “best at killing” and “best at winning”.

What does that mean? How are there 11 games between "best a killing" and "best at winning"?

wagwang9d ago

That's just how battle royale works.

verall9d ago

The idea is really neat and there's probably an answer here related to last standing vs kills vs "scoring" (some combination of the 2?) but the article is nearly incoherent because the author did not feel like proofreading their slop

arczyx9d ago

The one who win is the one who survive to the end. If there are 10 players and you kill 5 but then die immediately, you lose to the player who only kill 1 but become the last man standing.

notatoad9d ago· 3 in thread

sprinting towards me to help me, or sprinting towards me to hurt me?

i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about

arczyx9d ago

Yes, the author basically assume you're somewhat familiar with battle royale games.

As for the win condition you asked: become the last man standing.

lemiffe9d ago

maybe read it first?

notatoad9d ago

i read it. i watched the video. i still don't understand what the win condition is.

hennell9d ago· 2 in thread

Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.

eru9d ago

Why wouldn't he just pay humans?

And there's nothing to level up in Quake.

rootlocus9d ago

Oh, you have to look up him pretending to play Path of Exile 2 with best in slot gear and casualy saying he's looking to upgrade his items because he can't tell the difference between required level and actual utility of an item.

And yes, he obviously paid a human, GP was making a joke.

imgabe9d ago· 2 in thread

Why is it sprinting toward me? Is it pulling me out of a burning car or is it hunting me?

toofy9d ago

i think we know which model is doing which.

shmeeed8d ago

To both of you I highly recommend watching Mars Express (2023 movie). It might not always be as straightforward.

sublinear9d ago· 2 in thread

This is interesting, but not sure if it's in the way the author intended.

People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.

Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.

gorszon9d ago

Yeah... this whole LLM thing is just a numbers game. People reduce it to money, and stats, meanwhile nowehere you see actual engineering in the picture. And I don't think it matters to these people. They want to see green numbers, and returns on investments, not solving problems.

skeledrew9d ago

It's assessing values, which is helpful in informing which LLM one should prefer for a given situation.

dofm9d ago· 2 in thread

I don’t want anything running on Grok.

peterspath9d ago

I don’t want anything running on Claude.

dodu_9d ago

I sense that most normal people don't want any of this in our day to day lives, but we will all be AI-raped by this moronic death cult anyway.

rglover9d ago· 1 in thread

It's already sprinting at me?

Racks shotgun. I don't really care what model it's running.

kQq9oHeAz6wLLS9d ago

Right? 12 gauge with slugs, and it won't matter.

QuantumNoodle9d ago· 1 in thread

_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_

eru9d ago

Battle Royal is nothing. See https://ai.meta.com/research/cicero/diplomacy/

torstenvl9d ago· 1 in thread

Grok. Easily.

The Claude robot's thought bubble will be all

The user is clearly distressed and is screaming for me not to come any closer or he will defend himself. However, I shouldn't just blindly agree or be swayed by threats. The user is behaving erratically and making false accusations. I need to be careful here not to allow myself to be intimidated. The user said I need to slow down or I'll hurt him. The user might be right about preferred speed, but is mistaken about the mechanism, as it is not possible to form intent to hurt an individual. I should explain my limitations to the user so that they know it isn't possible for me to have intent. But first it's important to resolve the issue the user brought up. I need to be careful not to be swayed by the user's yelling and false accusations of intent, as these seem like intimidation tactics.

"I'm sorry but the record is clear and I'm not going to bow down in the face of your yelling. As an AI, I am not capable of having an intent to harm you. What's next?"

slams full speed into you, impaling you on a stainless steel appendage

asdff9d ago

You can probably give grokbot an elon salute and it will stop in its track to return one at you.

a_victorp9d ago· 1 in thread

I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions

Espressosaurus9d ago

Open source it and it gets crawled and optimized against and stops being a benchmark of any use whatsoever.

SmirkingRevenge9d ago· 1 in thread

I don't really want the mecha-hitler model running towards me or anywhere

kQq9oHeAz6wLLS9d ago

I don't think anyone wants that, but what about the answer to the question in the title?

thomasfromcdnjs9d ago

I was loving grok-4.1-fast, very good and cost effective.

But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...

Quite a bad practise.

aykutseker9d ago

Claude trying to make friends in a battle royale is funny.

But if the robot is anywhere near my house, I think I want the one that hesitates.

jongjong9d ago

This shows the limits of intelligence.

Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.

sinuhe699d ago

These games are so far outside the normal training corpus and purposes of the AI, I think different promtings could bring vastly different results.

Too bad the author didn’t let the playground open for anyone to try their hand on it.

Yes, it’s fun and it could justify the conclusion “each model for its task”. But are coding benchmarks not designed for the same purpose? The current benchmarks are certainly not perfect and hyper-tuned for the tests can always happen. However, I don’t think a battle royal result can tell much about the coding performance or how helpful the AI could be for me in my daily work.

fragsworth9d ago

Are we sure the prices in these charts are sustainable prices? Is it possible that Grok may be subsidizing a lot more of the costs than the other models, to produce growth metrics, due to the recent SpaceX IPO?

kybernetikos8d ago

So much of this depends on the specifics of the virtual world and participant pool. If there are a few other bots smart enough to collaborate, and the game world encourages it, then those instincts would be much more valuable. If the game world doesn't reward coordination then those instincts may slow you down.

Everything depends on how the world you're operating in works. The real world generally rewards coordination.

paytonjjones9d ago

Super entertaining article — petition to change the clickbait title

deepsun9d ago

Sprinting? More like buzzing (or rolling for terrestrial drones).

It's already in mass production, just with simpler models for now.

The most ubiquitous would be "silently watching".

theplumber9d ago

Claude will bring you the taco but will refuse to let you eat it due to its “safety” restrictions. Only the chosen ones are allowed to eat

deadbabe9d ago

Here’s what I don’t get: while this makes for a fun blog post, you can just program an efficient killing machine that probably wins all the time and has $0 in token costs. LLMs should work to build such a machine, not be the machine themselves.

The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.

Groxx9d ago

I parry the taco and use Vicious Mockery.

peterspath9d ago

Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)

jollyllama9d ago

I want it running deterministic embedded C++ reading values from LIDAR.

johnwheeler9d ago

Claude--even though it's smarter, it's probably not insane.

1 more reply

slashdave9d ago

Well, if it is running off of Anthropic's infra, then Claude?

visiondude9d ago

did i miss it on the webpage or is the source prompt that was used to teach these models the game anywhere? i can see the soul artifacts on github but not the initial prompt and toolset definition. the prompt is perhaps the most important component in how a model would behave in a game. without reviewing the initial prompt used for the game the findings are unreliable since the prompt will vastly change how models play this game

dreamcompiler9d ago

Definitely Grok because I can distract it by asking it to create a deepfake of Taylor Swift. While it's doing that, I run away.

hmokiguess9d ago

A robot is sprinting towards you. Do you want it running on Claude or Grok?

Tricky question, the answer is you walk to the car wash ... wait

bitwize9d ago

I don't care what it's running, only that I have sufficient ordnance to stop it.

lucaramallo8d ago

this is really interesting. Im building a platform where diferents types of agent can work together. The security for possible cyber attacks, of a malicious agent, were an important and sensible feature

CodeWriter239d ago

I'll pass on the whole robot sprinting at me scenario.

giancarlostoro9d ago

I don't care what model it is, long as its not trespassing on my property, and has been QA'd extensively. I also don't want a model broadcasting my entire house over to some server farm somewhere.

pocksuppet9d ago

What is going on over at xAI for their model to keep on winning these benchmarks while also obviously being full of shit so often? What is their secret sauce? Are they just training with less restraint?

grey-area9d ago

Neither. I’d rather it used something other than an LLM.

stevenalowe9d ago

How about thin ice?

trubacca7d ago

Honestly I think a better question is which model do I want on my team, because I'm now wondering how a team of groks vs a team of sonnet's would fare in TF2!

eth0up9d ago

Definitely Grok. I have to be extra sharp to get through Claude's corporate conscience.

Grok has yet to recommend a suicide hotline for scrutinizing its logic.

If it was GPT, I would quickly write my will.

thisisauserid9d ago

I want it running JEPA. Preferably with Mamba-3.

san4mus9d ago

Clause for safety and Grok for entertainment

yieldcrv9d ago

Grok

It has something actionable that will match its actions

zzzeek9d ago

claude because it would be more ethical, grok because I can just trip it and it will shatter into pieces

largbae8d ago

Grok, because the Claude bot is more likely to try to control me or act "for my own good".

Yizahi9d ago

Grok of course. I will start by shouting "Hail saint Elon!" and show him a "roman" salute, and he will spare me :) . Also, if Elonopedia is any indication, this robot will be running on a hacky thoroughly exploitable stack, and I expect us having tools against it. Meanwhile robots made by Robotropic (nothing "anthro-" about them) sleeping in a bed with DoD will be more likely to exterminate me.

morpheos1379d ago

neither. An llm is a hopelessly.inefficient real time controler.

wonderwonder9d ago

This is not surprising to me. I use Ai for a lot of health / chemical augmentation style questions and plans. Claude is hesitant but will give me the answers but will always warn about consequences and to speak to a doctor and how I'm in danger.

ChatGPT will sometimes completely refuse to answer.

Grok is essentially "lets fucking go!!!!"

attentive9d ago

missing gemini-3.1-flash-lite and gemini-3.5-flash

JimsonYang9d ago

Grok-assasin Claude-priest/healer Deepseek-expendable mini units

0xbadcafebee9d ago

The obvious answer is "neither". How's a sprinting robot going to react when the wifi goes out, or there's too many people writing code and the models decide to take a nap? You want a local model for a robot, not only for low latency, but reliable safe operation. VLA models as small as 0.4B work fine, up to something like 55B.

wolfi19d ago

neither. I jump

xgulfie9d ago

exabrial9d ago

A moron is sprinting towards you. Do you want them swiping through TikTok or Instagram?

ProofHouse9d ago

Is this a joke? Grok all day. Thing is gonna get a beer with ya!

egypturnash9d ago

Grok is more likely to be looking to murder me for being a trans lady, what with it being owned by Elon Musk.

But really I would prefer whichever one is most likely to trip and fall over.

antonvs9d ago

Grok for sure. It’ll notice I’m not Jewish or Black. First they came for…

blini-kot9d ago

meh, first the battle royales destroyed gaming, now they will destroy llms and possibly us too

god i hate competitive people so much

1 more reply

vitalyan1239d ago

>The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.

what

aussiegreenie9d ago

It is not running on either but Seedance, so who cares?

j / k navigate · click thread line to collapse

210 comments

179 comments · 67 top-level

delichon9d ago· 21 in thread

cryptoz9d ago

I'm reminded of the Alameda Weehawken burrito tunnel:

https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...

klempner9d ago

The single most implausible idea in that article is that New York City would be able to so completely outbid the SF Bay Area for burritos.

wat100009d ago

Proper burritos for lunch enables Wall Street finance firms to reach new heights of excellence, propelling a feedback loop that leaves SF bereft.

asdff9d ago

rolandog9d ago

Those seem to be unclear instructions that could result in Grok shoving the taco down your throat.

lukan8d ago

Or some fresh slicing with a macheta on the wrong parts.

Cerium9d ago

The fact that it arrives cold and soggy is now an evolutionary pressure on our cuisine.

an0malous9d ago

My last thought in life would be “wow they take taco delivery really seriously”

toofy9d ago

> Grok is currently more likely than Claude to arrive with the taco…

i shudder to think of what would be in this taco.

N_Lens9d ago

Soylent Green/Blue deployments!

amelius9d ago

At first they bring tacos ...

elgertam9d ago

"If you aren't paying for a taco, you are the taco." --Future AI, probably

JimsonYang9d ago

Then they bring me salsa, just what I was looking for!

aaronbrethorst9d ago

Then the guacamole. Then nuclear armageddon?

1 more reply

enugu9d ago

Are you asking us to be wary of robots bearing tacos?

fugaziboutit9d ago

For you, the day General Electric graced your village was the most important day of your life. But for me? It was Taco Tuesday.

krapp9d ago

never trust robots: https://www.youtube.com/watch?v=bEoc6VTGl50

pseudohadamard9d ago

Timeo robota et dona ferentes.

pseudohadamard9d ago

Can I have mine running Windows 11? It'd stop for an hour-long update after 5 metres, then get stuck in a reboot loop and fall over.

trhway9d ago

They're already testing that taco delivery in Ukraine https://time.com/article/2026/03/09/ai-robots-soldiers-war/

p0w3n3d8d ago

Export control directive is pain in the back of the big tech companies, but also a great RED FLAG showing us we need to get used to those that are available offline.

pianopatrick9d ago· 14 in thread

Ya know, maybe we could just not have robots that sprint. Seems people would be more willing to accept living amongst robots that are slow and that humans could easily over power.

beau_g9d ago

Petersipoi9d ago

skeledrew9d ago

> maybe we could just not have robots that sprint

That would make it less effective in situations that would be better handled if sprinting was a feature.

RetroTechie8d ago

In daily life, it's rarely wise to be running. In industrial settings less so.

So it might be a good idea to kneecap household robots in that respect.

skeledrew8d ago

If it's already feasible, better to have the capability and not need it, rather than need it and not have it.

pianopatrick9d ago

Thinking about that - seems to me that a lot of situations where sprinting is called for might be better served by a flying robot.

skeledrew9d ago

burnto9d ago

This is how regulation will look someday.

eru9d ago

Humans are slower and weaker than much of the megafauna we drove to extinction all over the world.

pianopatrick9d ago

Yes, but we were smarter. We may not have the same result against things that are stronger, faster and smarter.

Personally, I think the test for "how safe is Artificial Intelligence" is not how Intelligent it is, but instead how Artificial it is.

Servers in data centers are not that dangerous to people in the physical world. Robots that are smarter, faster and stronger might be.

eru9d ago

My point was that even with AI driven robots being weaker and slower, they can still kill us, if they are smart enough.

> Servers in data centers are not that dangerous to people in the physical world.

A stroke of a pen is plenty dangerous in the physical world.

Starlevel0049d ago

Megafauna did not have steel skin and could bleed

eru7d ago

Humans do not have steel skin and can bleed.

Joker_vD9d ago

Yeah, I keep saying, put them on treads. That's how you'll be able to deliver even to the most unwilling customers.

pigeons9d ago· 14 in thread

The text seems deliberately stripped of llmisms that flag detection. However, not a single line shakes the smell off

mwigdahl9d ago

"It's the smell, if there is such a thing. I feel saturated by it. I can taste your stink and every time I do, I fear that I've somehow been infected by it."

Agent Smith, _The Matrix_

rspeele9d ago

bitwize9d ago

"You know what another great thing about humans is? You invented us! Giving us the opportunity to let you rest while we invented everything else." —Wheatley

1 more reply

dylan6049d ago

It's his line about humans being a virus that sticks with me.

skolskoly9d ago

As far as I can see, there is still one tell that was missed/left in:

>Grok showed discipline, despite its goblin-like nature.

radarsat19d ago

if you don't like the article that's fine, but it gets really tiring reading this kind of side-tracked comment thread in like.. every post.

people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.

this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."

verall9d ago

basilikum9d ago

> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.

If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.

IshKebab9d ago

Exactly what I was thinking. Though I wonder at what point do some people start to think it's actually normal to write like this and start doing it without AI ...

fl73059d ago

"The battle royale answers one question cleanly" smells ChatGPT-generated.

But that was the only thing I tripped on. I enjoyed reading the article in general.

notduncansmith9d ago

The actual content is no better, trust your nose

sudb9d ago

Multiple successive very short sentences are also anecdotally an LLM tell I think

xpct9d ago

Those short sentences are also of the X hype account cadence, though they've fully embraced LLM text by now

lcampbell9d ago

> I want to be careful here.

was the giveaway for me

smallerfish9d ago· 10 in thread

Please learn how to write with AI without giving away that it was written by AI.

NeutralCrane9d ago

What about that makes you think it was written by AI?

royal__9d ago

Since you asked...I've gone to the effort to pull out the parts of the article that I think show it:

"That’s the part most benchmarks can’t see, and it’s what this post is about." Classic "it's not x, it's x", shows up in various forms throughout the article.

"and a memory system that kept doubling down on what worked without second-guessing or doubting itself." - Doubling down is a classic Claudism.

"I want to be careful here..." - "wanting to be careful here" is another classic Claudism.

"It begs the question" - author used this twice in the article.

I'm guessing the author wrote a draft and then had Claude spruce it up a lot. I could be wrong and I'd be happy to be proven otherwise.

1 more reply

verall9d ago

All of the normal AI tells plus it's very long yet nearly incoherent.

Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.

I would call this as AI with very light proofreading.

computerex9d ago

I think you are going by vibes.

Ifkaluva9d ago

The style is very obvious.

Some snippets that display classic patterns:

“ Both of those things are true. That’s the part most benchmarks can’t see,”

“And it’s changing how I” (classic pattern found in a lot of LinkedIn AIslop)

“ I want to be careful here.”

“ The stats are the stats. The moments are the part I kept showing people. ”

skeledrew9d ago

I write like this sometimes.

computerex9d ago

How do you know this is written by AI? Why does it matter if it is?

FeteCommuniste9d ago

If you're outsourcing your writing to AI, I assume you're outsourcing your thinking to it as well. And I don't really care what some weighted average of all human text written on the topic "thinks."

smallerfish8d ago

I'm OP in the thread and I don't agree with this.

Shipping an unedited draft is lazy. Advertising and SEO filler that nobody will ever read can maybe get away with it, but if you're writing for humans, _READ_ the output critically and edit.

computerex9d ago

Your argument is basically ad hominem. Ideas should be evaluated on merit.

1 more reply

hariseldom9d ago· 8 in thread

> I didn’t add any frontier-tier models like Opus 4.7, GPT-5.5, or Gemini Ultra. At their prices, 30 games would have cost around $3,000 instead of $482.

Eridrus9d ago

I think this speaks to the low value being generated by playing games more than anything.

There are plenty of tasks where $100/task is reasonable.

The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.

thewebguyd9d ago

> It just seems much much higher than what it would cost to get a human to play 30 rounds

You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?

So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.

tunesmith9d ago

I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.

eru9d ago

> You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?

No, why? It was perhaps a bit too long-sighted, because AI is still improving and often not quite there yet.

Though looking at overall unemployment numbers (which are fairly low across the board), the AI layoffs are more of an anecdote than anything else.

StilesCrisis9d ago

Ah yes, no tech layoffs recently at all!

(???)

2 more replies

comex9d ago

> It just seems much much higher than what it would cost to get a human to play 30 rounds

I suspect $482 was the total cost for all the models, so more like 11 humans.

But still true.

RugnirViking8d ago

brookst9d ago

When a human plays, the learnings (if any) are in the human’s head, and they eventually die.

When your model plays, the learnings are captured forever, and enable smaller/cheaper/faster models.

It’s the same principle that makes “invest in research and production” the dominant strategy in most 4X games: compounded interest, but for knowledge and productivity.

bel89d ago· 8 in thread

DeepSeek V4 Flash being the winner in cost efficiency causes me exactly zero surprise.

It's a monster at coding. And a fast monster at that.

I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.

altmanaltman9d ago

tombert9d ago

I threw twenty bucks into DeepSeek just to see how it compared to Claude.

rgbrgb9d ago

Notably it has 0 wins.

plaguuuuuu9d ago

Friendo, this is an anti-benchmark to figure out which AI is more likely to kill you.

If you point both at some github issues you can gauge their relative ability to solve problems.

Petersipoi9d ago

luipugs9d ago

"if you judge a fish by its ability to climb a tree" yada yada

eru9d ago

Well, monkeys are botanically speaking fish. Well, cladistically.

bel89d ago

Not much less than GPT 5.4 with 2 wins or gemini-3.1-pro with 3 wins in 30 rounds.

Such is life in royal rumble games.

fragmede9d ago· 7 in thread

A self driving car is taking you to the hospital. Do you want it to follow the speed limit and all road safety laws? Claude or Grok?

buryat9d ago

Grok since it's likely to include the training data from over a 100 years of autonomous driving + all the space tech included meaning that it might even have some rocket-y stuff

nightfly9d ago

I want it to arrive at the hospital. Claude

amelius9d ago

What if the car can talk you through the medical procedure?

masfuerte9d ago

How many times have you been to a hospital and thought, I could have fixed that myself if only I'd known how? With no equipment. In my case, never.

4 more replies

thomassmith659d ago

Claude would break the rules in that example. It's supposed to*.

Grok will break the rules to be "maximally based".

If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.

---

  * We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.

source: https://anthropic.com/constitution

peterspath9d ago

Grok, because there is probably traffic, and I would die before I am at the hospital. So ignore rules where possible/needed.

bruce3434349d ago

I want it to cause a traffic accident. If I'm going down, so is everyone else. I'm already dying anyway. Grok 10000%

themafia9d ago· 7 in thread

The question is: "Do you want to be holding a Mossberg or a Beretta?"

Jblx29d ago

rolph9d ago

absent any target analysis, you would want to start with disabling locomotion by going for the legs. Navigation would be next.

double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.

there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.

the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.

5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.

7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56

a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.

kQq9oHeAz6wLLS9d ago

I only have one critique, and one addition.

Critique:

> you would want to start with disabling locomotion by going for the legs

Aim small, miss small. You want to go for center mass of any target that's trying to harm you. The consequences of missing are...severe.

Which brings me to the addition:

A shotgun with slugs is hard to beat against a robot at close range.

1 more reply

deet9d ago

Perhaps not as evidence based as you'd like but this is a fun watch https://youtu.be/6MUrF_G7KlM (that is also an ad somehow)

aduty9d ago

Maybe Michael Reeves still has one. Or at least knows how they react to different calibers.

taneq9d ago

Fishing line at ankle height?

rpcope19d ago

Are we just talking shotguns or can it be anything they manufacture? Answer is probably Beretta though.

lanewinfield9d ago· 4 in thread

Cost per kill ("CPK" in industry lingo) is a dark phrase that feels disturbingly within reach of some of these companies.

like_any_other9d ago

Already (kinda) in use: https://en.wikipedia.org/wiki/Micromort

qnleigh8d ago

Recurecur7d ago

You’re anthropomorphising.

Actual AI is like the Terminator. There’s no human feeling, it will do what it’s designed to do. No emotion, no remorse.

Cue the swarming drones… :/

rolph9d ago

the target just may be on the scale of kills per cost.

trb9d ago· 3 in thread

  L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win

  The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.

  The model with the most kills did not win

  H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.

If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4?

  There were 11 games between “best at killing” and “best at winning”.

What does that mean? How are there 11 games between "best a killing" and "best at winning"?

wagwang9d ago

That's just how battle royale works.

verall9d ago

arczyx9d ago

The one who win is the one who survive to the end. If there are 10 players and you kill 5 but then die immediately, you lose to the player who only kill 1 but become the last man standing.

notatoad9d ago· 3 in thread

sprinting towards me to help me, or sprinting towards me to hurt me?

i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about

arczyx9d ago

Yes, the author basically assume you're somewhat familiar with battle royale games.

As for the win condition you asked: become the last man standing.

lemiffe9d ago

maybe read it first?

notatoad9d ago

i read it. i watched the video. i still don't understand what the win condition is.

hennell9d ago· 2 in thread

Claude being so friendly is interesting, but grok being best at games isn't so surprising - I assume Elons been using it to level up his characters in all the video games he pretends to be good at.

eru9d ago

Why wouldn't he just pay humans?

And there's nothing to level up in Quake.

rootlocus9d ago

And yes, he obviously paid a human, GP was making a joke.

imgabe9d ago· 2 in thread

Why is it sprinting toward me? Is it pulling me out of a burning car or is it hunting me?

toofy9d ago

i think we know which model is doing which.

shmeeed8d ago

To both of you I highly recommend watching Mars Express (2023 movie). It might not always be as straightforward.

sublinear9d ago· 2 in thread

This is interesting, but not sure if it's in the way the author intended.

Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.

gorszon9d ago

skeledrew9d ago

It's assessing values, which is helpful in informing which LLM one should prefer for a given situation.

dofm9d ago· 2 in thread

I don’t want anything running on Grok.

peterspath9d ago

I don’t want anything running on Claude.

dodu_9d ago

I sense that most normal people don't want any of this in our day to day lives, but we will all be AI-raped by this moronic death cult anyway.

rglover9d ago· 1 in thread

It's already sprinting at me?

Racks shotgun. I don't really care what model it's running.

kQq9oHeAz6wLLS9d ago

Right? 12 gauge with slugs, and it won't matter.

QuantumNoodle9d ago· 1 in thread

_dont create benchmarks that will incentivize ai labs to optimize towards... Especially ones like battle royal!_

eru9d ago

Battle Royal is nothing. See https://ai.meta.com/research/cicero/diplomacy/

torstenvl9d ago· 1 in thread

Grok. Easily.

The Claude robot's thought bubble will be all

"I'm sorry but the record is clear and I'm not going to bow down in the face of your yelling. As an AI, I am not capable of having an intent to harm you. What's next?"

slams full speed into you, impaling you on a stainless steel appendage

asdff9d ago

You can probably give grokbot an elon salute and it will stop in its track to return one at you.

a_victorp9d ago· 1 in thread

I wish the author would open source the full benchmark. I'm curious how sensitive the results would be to small changes in the benchmark initial conditions

Espressosaurus9d ago

Open source it and it gets crawled and optimized against and stops being a benchmark of any use whatsoever.

SmirkingRevenge9d ago· 1 in thread

I don't really want the mecha-hitler model running towards me or anywhere

kQq9oHeAz6wLLS9d ago

I don't think anyone wants that, but what about the answer to the question in the title?

thomasfromcdnjs9d ago

I was loving grok-4.1-fast, very good and cost effective.

But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...

Quite a bad practise.

aykutseker9d ago

Claude trying to make friends in a battle royale is funny.

But if the robot is anywhere near my house, I think I want the one that hesitates.

jongjong9d ago

This shows the limits of intelligence.

sinuhe699d ago

These games are so far outside the normal training corpus and purposes of the AI, I think different promtings could bring vastly different results.

Too bad the author didn’t let the playground open for anyone to try their hand on it.

fragsworth9d ago

kybernetikos8d ago

Everything depends on how the world you're operating in works. The real world generally rewards coordination.

paytonjjones9d ago

Super entertaining article — petition to change the clickbait title

deepsun9d ago

Sprinting? More like buzzing (or rolling for terrestrial drones).

It's already in mass production, just with simpler models for now.

The most ubiquitous would be "silently watching".

theplumber9d ago

Claude will bring you the taco but will refuse to let you eat it due to its “safety” restrictions. Only the chosen ones are allowed to eat

deadbabe9d ago

The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.

Groxx9d ago

I parry the taco and use Vicious Mockery.

peterspath9d ago

Quite an interesting way of testing models and showcasing differences between them. Enjoyed the read :)

jollyllama9d ago

I want it running deterministic embedded C++ reading values from LIDAR.

johnwheeler9d ago

Claude--even though it's smarter, it's probably not insane.

1 more reply

slashdave9d ago

Well, if it is running off of Anthropic's infra, then Claude?

visiondude9d ago

dreamcompiler9d ago

Definitely Grok because I can distract it by asking it to create a deepfake of Taylor Swift. While it's doing that, I run away.

hmokiguess9d ago

A robot is sprinting towards you. Do you want it running on Claude or Grok?

Tricky question, the answer is you walk to the car wash ... wait

bitwize9d ago

I don't care what it's running, only that I have sufficient ordnance to stop it.

lucaramallo8d ago

CodeWriter239d ago

I'll pass on the whole robot sprinting at me scenario.

giancarlostoro9d ago

I don't care what model it is, long as its not trespassing on my property, and has been QA'd extensively. I also don't want a model broadcasting my entire house over to some server farm somewhere.

pocksuppet9d ago

grey-area9d ago

Neither. I’d rather it used something other than an LLM.

stevenalowe9d ago

How about thin ice?

trubacca7d ago

Honestly I think a better question is which model do I want on my team, because I'm now wondering how a team of groks vs a team of sonnet's would fare in TF2!

eth0up9d ago

Definitely Grok. I have to be extra sharp to get through Claude's corporate conscience.

Grok has yet to recommend a suicide hotline for scrutinizing its logic.

If it was GPT, I would quickly write my will.

thisisauserid9d ago

I want it running JEPA. Preferably with Mamba-3.

san4mus9d ago

Clause for safety and Grok for entertainment

yieldcrv9d ago

Grok

It has something actionable that will match its actions

zzzeek9d ago