Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
Sadly, the answer was wrong.
It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.
Still a useful tool though. It definitely gets the majority of the insights.
Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.
What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.
They would surely add the latest Euler problems with solutions in order to show off in benchmarks.
I personally think anthropomorphizing LLMs is a bad idea.
> I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
They are good at transforming one format to another. They are good at boilerplate.
They are bad at deciding requirements by themselves. They are bad at original research, for example developing a new algorithm.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.
I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.
"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo
I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")
.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.
The prompt the other person was using is:
``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```
Which is much more difficult.
For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...
Generate SVG animation of following:
1 - There is High fantasy mage tower with a top window a dome
2 - Green goblin come in front of tower with a torch
3 - Grumpy old mage with beard appear in a tower window in high purple hat
4 - Mage sends fireball that burns goblin and all screen is covered in fire.
Camera view must be from behind of goblin back so we basically look at tower in front of us:
https://codepen.io/Runway/pen/zxqzPyQ
PS: but yeah thats attempt #20 or something.
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
Is the only thing that prevents a benchmark from being meaningful publicity?
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.
IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.
This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!
it's easy to focus on what they can't do
Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?
https://gally.net/temp/20251107pelican-alternatives/index.ht...
There seem to be one or two parsing errors. I'll fix those later.
- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts
- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.
- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.
Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
I see that AI is reaching the level of a middle school boy...
Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?
Will run extended benchmarks later, let me know if you want to see actual data.
Results are amazing! 2.5 and 3 seems way way head.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
- from "Dirk Gently's Holistic Detective Agency"
Had to change to Electric Mentor....
The possibility was already a topic in the series "Mozart in the jungle" where they made a robot which supposedly finished the Requiem piece by Mozart.
That's basicaly "The Washing Machine Tragedy" by Stanislav Lem in a nutshell.
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
The ARC puzzles in question: https://arcprize.org/arc-agi/2/
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
but I would never want to have to interact with
That is its job security ;)It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...
Is there evidence that's true? That the other models are significantly better than the ones you named?
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...Very good. Nice work! These things will definitely change lives.
For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output
* 1,500 RPD (free), then $35 / 1,000 grounded prompts
to
* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries
It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)
I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.
Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/
Developer blog: https://blog.google/technology/developers/gemini-3-developer...
Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3
Google Antigravity: https://antigravity.google/
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
Code quality was fine for my very limited tests but I was disappointed with instruction following.
I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.
I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.
this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.
for codex those instructions make plan mode redundant.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
It's good there's competition in the space though.
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.
But at least this company is not run by a narcissistic sociopath.
Good at vibe coding 10:30 - step change where it's actually useful
AGI still 5-10 years. Needs reasoning, memory, world models.
Is it a bubble? - Partly 22:00
What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)
Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.
Is that all? So they just need to invent:
1. Thought
2. A mechanism for efficiently encoding and decoding arbitrary percepts
3. A formal model of the world
And then the existing large language models can handle the rest.
Yep, 5 years and a hundred billion dollars or so should do the trick.
Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.
Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.
That's a "loss" I can deal with.
Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.
Gemini 2.5 Pro scored 57.6, so this is a huge improvement.
The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.
Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.
People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".
Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.
Despite model supremacy, the products still matter.
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en
The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward
Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)
I had the exact same experience and walked away to chatgpt.
What a mess.
https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...
Also, the full document:
https://archive.org/details/gemini-3-pro-model-card/page/n3/...
The person also claims that with thinking on the gap narrows considerably.
We'll probably have 3rd party benchmarks in a couple of days.
I absolutely LOVE that Google themselves drew a sharp distinction here.
Edit: nvm it looks to be up for me again
Basically these tools can you you to a 100k LOC project without much effort, but it's not going to be a serious product. A serious product requires understanding still.
My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)
The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.
It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...
Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)
I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.
How come I can't even see prices without logging in... they doing regional pricing?
Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.
Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.
Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.
For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.
I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.
"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."
"You've reached your rate limit. Please try again later."
Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.
"Create me a SVG of a pelican riding on a bicycle"
That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.
The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.
Excited to see Gemini making a leap here.
It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.
- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...
- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...
Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.
Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.
[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...
Input: $1.25 -> $2.00 (1M tokens)
Output: $10.00 -> $12.00
Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.
"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.
Amazing and hilarious
Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.
No benchmark is safe, when this much money is on the line.
https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...
I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.
It's really, really, really good so far. Wow.
Note that I haven't tried it for coding yet!
Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.
- GPT-5 medium is the best
- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster
Really wonder how well Gemini 3 will perform
Standard Context(≤ 200K tokens)
Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)
Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)
Long Context(> 200K tokens)
Input $4.00 vs $2.50 (same +60%)
Output $18.00 vs $15.00 (same +20%)
"Users"? Or people that get presented with it and ignore it?
It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.
How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?
It's amazing to me they don't see it as an epic failure in communication and marketing.
I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.
As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.
Prompt:
Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:
1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.
Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.
OpenAI is going for the consumer-grade AI market, as opposed to a company like Anthropic making a specialized developer tool. Google can inject their AI tool in front of everybody in the world, and already have with Google AI search. All of these models are just going to reach parity eventually, but Google is burning cash compared to OpenAI burning debt. It seems like for consumer-grade purposes, AI use will just be free sooner or later (DeepSeek is free, Google AI search is free, students can get Gemini Pro for free for a year already). So all I'm seeing that OpenAI has is Sora, which seems like a business loser though I don't really understand it, and also ChatGPT seems to own the market of people roleplaying with chat bots as companions (which doesn't really seem like a multi-trillion dollar business but I could be wrong).
It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.
The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.
What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.
I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.
I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.
I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.
I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...
Do regular users know how to disable AI Overviews, if they don't love them?
0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...
Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.
I'm concerned.
- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)
[Edit: working for me now in ai studio]
Google goes full Apple...
> gemini-3-pro-preview-ais-applets
> gemini-3-pro-preview
Glad to see Google still can't get out of its own way.
Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.
side by side comparison of gemini with other models
gpt-5.1 honestly looking very comfortable given available usage limits and pricing
although gpt-5.1 used from chatgpt website seems to be better for some reason
Sonnet 4.5 agentic coding still holding up well and confirms my own experiences
i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for
What useful thing can it demonstrably do that its predecessors couldn't?
https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...
Good. That said, I wonder if those models are still LLMs.
Come on, you can’t be serious.
Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.
Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?
Try lowering the temperature, use SymPy etc.
More like 2 billion hostages
Gemini 3:
The cognitive dissonance in this thread is staggering.
We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.
Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":
1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.
2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.
Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.
Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?
> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.
Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?
https://www.ascii-art.de/ascii/uvw/unicorn.txt
However, it is amazing how far spatial comprehension has improved in multimodal models.
I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.
Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.
Temperature: 1.85
Top-P 0.98
Answer: Edit (replaced with URL) https://justpaste.it/kpow3
REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.
- Microsoft Code
- Gemeni
- Chrome Browser
[0] Fake racial slur for a robot, LLM chatbot, or other automated system
def solve_kangaroo_limit(): # Set precision to handle the "digits different from six" requirement decimal.getcontext().prec = 50
# For U(0,1), H(x) approaches 2x + 2/3 very rapidly (exponential decay of error)
# At x = 10^6, the value is indistinguishable from the asymptote
x = 10**6
limit_value = decimal.Decimal(2) * x + decimal.Decimal(2) / decimal.Decimal(3)
print(f"H({x}) ≈ {limit_value}")
# Output: 2000000.66666666666666666666...
if __name__ == "__main__":
solve_kangaroo_limit()
....p.s. for airheads=idiots: """decimal.Decimal(2) / decimal.Decimal(3)""" == 0.6666666666666666666666666666666666666666666666666666666666666666666666666 ... This is your Fukingly 'smart' computer???Yes, I know the Workspaces workaround, but that’s silly.
This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.
An example output:
'''
The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.
He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.
"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."
He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.
The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.
Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.
Three red dots appeared on his chest. They made a small triangle.
Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.
"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."
The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.
"But what I do have," Bryan said, "are a very particular set of skills."
The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.
"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.
The thing roared. It wasn't a human sound. It was a wet, guttural noise.
"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."
The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.
"And I will kill you," Bryan said.
He raised the Glock and fired twice. Pop-pop.
The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.
Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.
Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.
"You're big," Bryan whispered. "That makes you slow."
He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.
The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.
The blade hit the creature in the knee. There was a snap. A loud, wet snap.
The thing howled. It fell onto one knee.
Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.
The creature swung its arm back. It hit Bryan in the chest.
Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.
He sat up. His ribs hurt. Maybe broken.
The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.
Bryan knew a bomb when he saw one.
"No," Bryan said.
He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.
He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.
Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.
Silence came back.
The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.
The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.
"...good luck..."
Then it died. It just stopped.
Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.
His glass of scotch was still sitting there, untouched. The ice hadn't even melted.
He picked it up. He took a drink. It still tasted cheap.
He took his phone out and looked at it. No service.
"Well," he said.
He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.
'''
Meh, not interested already
Also probably over for OpenAI
the token efficiency and context is also mindblowing...
it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.
gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.
Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.
We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.
The result? It struggled where I didn't expect it to.
Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.
Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?
Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...
> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.
So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.
Where is the outrage?
[0] https://web.archive.org/web/20251118111103/https://storage.g...
[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...