https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...
i shudder to think of what would be in this taco.
That would make it less effective in situations that would be better handled if sprinting was a feature.
So it might be a good idea to kneecap household robots in that respect.
Personally, I think the test for "how safe is Artificial Intelligence" is not how Intelligent it is, but instead how Artificial it is.
Servers in data centers are not that dangerous to people in the physical world. Robots that are smarter, faster and stronger might be.
> Servers in data centers are not that dangerous to people in the physical world.
A stroke of a pen is plenty dangerous in the physical world.
Agent Smith, _The Matrix_
>Grok showed discipline, despite its goblin-like nature.
people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.
this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."
> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.
But that was the only thing I tripped on. I enjoyed reading the article in general.
was the giveaway for me
Please learn how to write with AI without giving away that it was written by AI.
"That’s the part most benchmarks can’t see, and it’s what this post is about." Classic "it's not x, it's x", shows up in various forms throughout the article.
"To me, this is the most fascinating finding from this entire experiment - we saw very clear alignment tax being paid by certain models, which directly impacted their performance in this zero-sum game." - Usage of em dash. Now, yes, there's nothing wrong with using em dashes. But this feels like a weird place to use one. Also I counted at least 6 other emdashes in this article. Most people do not use em dashes that often.
"and a memory system that kept doubling down on what worked without second-guessing or doubting itself." - Doubling down is a classic Claudism.
"I want to be careful here..." - "wanting to be careful here" is another classic Claudism.
"The same game world, completely different results when in a different “task”." - "same X, completely different X" is another common one from Claude, as proofed by the repeated pattern later down: "These models were all given the same rules, same game world, and same tools, but each of them approached the game on a personality-level that is completely different from each other."
"It begs the question" - author used this twice in the article.
I'm guessing the author wrote a draft and then had Claude spruce it up a lot. I could be wrong and I'd be happy to be proven otherwise.
Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.
I would call this as AI with very light proofreading.
Some snippets that display classic patterns:
“ Both of those things are true. That’s the part most benchmarks can’t see,”
“And it’s changing how I” (classic pattern found in a lot of LinkedIn AIslop)
“ I want to be careful here.”
“ The stats are the stats. The moments are the part I kept showing people. ”
AI writing is fine, but you can't just stop on the first draft, any more than you can while AI coding (in fact, even less so - your coding is read by computers and to an extent either works or doesn't; your writing is for humans, and not only needs to convey ideas but also needs to hold the reader.)
Shipping an unedited draft is lazy. Advertising and SEO filler that nobody will ever read can maybe get away with it, but if you're writing for humans, _READ_ the output critically and edit.
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
There are plenty of tasks where $100/task is reasonable.
The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.
You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?
Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.
So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.
So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.
I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.
No, why? It was perhaps a bit too long-sighted, because AI is still improving and often not quite there yet.
Though looking at overall unemployment numbers (which are fairly low across the board), the AI layoffs are more of an anecdote than anything else.
I suspect $482 was the total cost for all the models, so more like 11 humans.
But still true.
When your model plays, the learnings are captured forever, and enable smaller/cheaper/faster models.
It’s the same principle that makes “invest in research and production” the dominant strategy in most 4X games: compounded interest, but for knowledge and productivity.
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
Pretty well, actually! It wasn't quite as good (at least with the coding tasks I threw at it), but it was so much cheaper per-token that it almost doesn't matter; if it screws up something, just correct and try again.
If you point both at some github issues you can gauge their relative ability to solve problems.
Such is life in royal rumble games.
Grok will break the rules to be "maximally based".
If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.
---
* We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.
source: https://anthropic.com/constitutiondouble aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.
there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.
the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.
5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.
7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56
a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.
Critique:
> you would want to start with disabling locomotion by going for the legs
Aim small, miss small. You want to go for center mass of any target that's trying to harm you. The consequences of missing are...severe.
Which brings me to the addition:
A shotgun with slugs is hard to beat against a robot at close range.
Actual AI is like the Terminator. There’s no human feeling, it will do what it’s designed to do. No emotion, no remorse.
Cue the swarming drones… :/
L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4? There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
And there's nothing to level up in Quake.
And yes, he obviously paid a human, GP was making a joke.
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
Racks shotgun. I don't really care what model it's running.
The Claude robot's thought bubble will be all
The user is clearly distressed and is screaming for me not to come any closer or he will defend himself. However, I shouldn't just blindly agree or be swayed by threats. The user is behaving erratically and making false accusations. I need to be careful here not to allow myself to be intimidated. The user said I need to slow down or I'll hurt him. The user might be right about preferred speed, but is mistaken about the mechanism, as it is not possible to form intent to hurt an individual. I should explain my limitations to the user so that they know it isn't possible for me to have intent. But first it's important to resolve the issue the user brought up. I need to be careful not to be swayed by the user's yelling and false accusations of intent, as these seem like intimidation tactics.
"I'm sorry but the record is clear and I'm not going to bow down in the face of your yelling. As an AI, I am not capable of having an intent to harm you. What's next?"
slams full speed into you, impaling you on a stainless steel appendage
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
But if the robot is anywhere near my house, I think I want the one that hesitates.
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
Too bad the author didn’t let the playground open for anyone to try their hand on it.
Yes, it’s fun and it could justify the conclusion “each model for its task”. But are coding benchmarks not designed for the same purpose? The current benchmarks are certainly not perfect and hyper-tuned for the tests can always happen. However, I don’t think a battle royal result can tell much about the coding performance or how helpful the AI could be for me in my daily work.
Everything depends on how the world you're operating in works. The real world generally rewards coordination.
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
Tricky question, the answer is you walk to the car wash ... wait
Grok has yet to recommend a suicide hotline for scrutinizing its logic.
If it was GPT, I would quickly write my will.
It has something actionable that will match its actions
ChatGPT will sometimes completely refuse to answer.
Grok is essentially "lets fucking go!!!!"
But really I would prefer whichever one is most likely to trip and fall over.
god i hate competitive people so much
what