Gemini 3.1 Pro (opens in new tab)

(blog.google)

963 pointsMallocVoidstar4mo ago914 comments

Preview: https://console.cloud.google.com/vertex-ai/publishers/google...

Card: https://deepmind.google/models/model-cards/gemini-3-1-pro/

914 comments

282 comments · 127 top-level

simonw4mo ago· 19 in thread

Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.

embedding-shape4mo ago

It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.

Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.

11 more replies

jasonjmcghee4mo ago

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

7 more replies

MrCheeze4mo ago

Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.

5 more replies

sam_14214mo ago

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

4 more replies

SoKamil4mo ago

It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

Arcuru4mo ago

Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/

1 more reply

WarmWash4mo ago

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

2 more replies

AmazingTurtle4mo ago

At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?

2 more replies

steve_adams_864mo ago

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.

3 more replies

bredren4mo ago

What is that, a snack in the basket?

3 more replies

TZubiri4mo ago

You think they are able to see their output and iterate on it? Or is it pure token generation?

benatkin4mo ago

I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output

tarr114mo ago

What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

1 more reply

saberience4mo ago

I hope we keep beating this dead horse some more, I'm still not tired of it.

calny4mo ago

Great pelican but what’s up with that fish in the basket?

3 more replies

infthi4mo ago

Wonder when will we get something other than a side view

2 more replies

DonHopkins4mo ago

How about STL files for 3d printing pelicans!

1 more reply

mohsen14mo ago

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

1 more reply

xnx4mo ago

Not even animated? This is 2026.

1 more reply

quacky_batak4mo ago· 13 in thread

I’m keen to know how and where are you using Gemini.

Anthropic is clearly targeted to developers and OpenAI is general go to AI model. Who are the target demographic for Gemini models? ik that they are good and Flash is super impressive. but i’m curious

jdc05894mo ago

I use it as my main platform right now both for work/swe stuff, and person stuff. It works pretty well, they have the full suite of tools I want from general LLM chat, to notebookLM, to antigravity.

My main use-cases outside of SWE generally involve the ability to compare detailed product specs and come up with answers/comparisons/etc... Gemini does really well for that, probably because of the deeper google search index integration.

Also I got a year of pro for free with my phone....so thats a big part.

ggregoire4mo ago

I use it in Google Search. For example yesterday I typed in Google "postgres generate series 24 hour" and this morning "ffmpeg convert mp4 to wav". Previously I would have clicked on the first StackOverflow result (RIP), now I just take it from the Gemini summary (I'd say 95% of the time it's correct for basic programming language questions. I remember some hallucinations about psycopg3 and date-fns tho. As usual with AI, you need to already know the answer, at least partially, to detect the bs).

Also what's great about Gemini in Google Search is that the answer comes with several links, I use them sometimes to validate the correctness of the solution, or check how old the solution is (I've never used chatGPT so I don't know if chatGPT does it).

hunta20974mo ago

I use the Gemini web interface just as I would ChatGPT. They also have coding environment analogues of Claude-Code in Anti-gravity and Gemini-CLI.

When you sign up for the pro tier you also get 2TB of storage, Gemini for workspace and Nest Camera history.

If you're in the Google sphere it offers good value for money.

dinosor4mo ago

I find gemini to be the best at travel planning and for story telling of geographical places. For a road trip, I tried all three mainstream providers and I liked Gemini (also personal preference because Gemini took a verbose approach instead of bullet points from others) for it's responses, ways it discovered stories about places I wanted to explore, places it suggested for me and things it gave me to consider those places in the route.

minimaxir4mo ago

Gemini has an obvious edge over its competitors in one specific area: Google Search. The other LLMs do have a Web Search tool but none of them are as effective.

fatherwavelet4mo ago

I feel like Gemini 3 was incredible on non-software/coding research. I have learned so much systems biology the last two months it blows my mind.

I had only started using Opus 4.6 this week. Sonnet it seems like is much better at having a long conversation with. Gemini is good for knowledge retrieval but I think Opus 4.6 has caught up. The biggest thing that made Gemini worth it for me the last 3 months is I crushed it with questions. I wouldn't have even got 10% of the Opus use that I got from Gemini before being made to slow down.

I have a deep research going right now on 3.1 for the first time and I honestly have no idea how I am going to tell if it is better than 3.

It seems like agentic coding Gemini wasn't as good but just asking it to write a function, I think it only didn't one shot what I asked it twice. Then fixed the problem on the next prompt.

I haven't logged in to bother with chatGPT in about 3 months now.

dekhn4mo ago

I am a professional software developer who has been programming for 40 years (C, C++, Python, assembly, any number of other languages). I work in ML (infrastructure, not research) and spent a decade working at Google.

In short, I consider Gemini to be a highly capable intern (grad student level) who is smarter and more tenacious than me, but also needs significant guidance to reach a useful goal.

I used Gemini to completely replace the software stack I wrote for my self-built microscope. That includes:

writing a brand new ESP32 console application for controlling all the pins of my ESP32 that drives the LED illuminator. It wrote the entire ESP-IDF project and did not make any major errors. I had to guide with updated prompts a few times but otherwise it wrote the entire project from scratch and ran all the build commands, fixing errors along the way. It also easily made a Python shared library so I can just import this object in my Python code. It saved me ~2-3 days of working through all the ESP-IDF details, and did a better job than I would have.

writing a brand new C++-based Qt camera interface (I have a camera with a special SDK that allows controlling strobe and trigger and other details. It can do 500FPS). It handled all the concurrency and message passing details. I just gave it the SDK PDF documentation for the camera (in mixed english/chinese), and asked it to generate an entire project. I had to spend some time guiding it around making shared libraries but otherwise it wrote the entire project from scratch and I was able to use it to make a GUI to control the camera settings with no additional effort. It ran all the build commands and fixed errors along the way. Saved me another 2-3 days and did a better job than I could have.

Finally, I had it rewrite the entire microscope stack (python with qt) using the two drivers I described above- along with complex functionality like compositing multiple images during scanning, video recording during scanning, mesaurement tools, computer vision support, and a number of other features. This involved a lot more testing on my part, and updating prompts to guide it towards my intended destination (fully functional replacement of my original self-written prototype). When I inspect the code, it definitely did a good job on some parts, while it came up with non-ideal solutions for some problems (for example, it does polling when it could use event-driven callbacks). This saved literally weeks worth of work that would have been a very tedious slog.

From my perspective, it's worked extremely well: doing what I wanted in less time than it would take me (I am a bit of a slow programmer, and I'm doing this in hobby time) and doing a better job (With appropriate guidance) than I could have (even if I'd had a lot of time to work on it). This greatly enhances my enjoyment of my hobby by doing tedious work, allowing me to spend more time on the interesting problems (tracking tardigrades across a petri dish for hours at a time). I used gemini pro 3 for this- it seems to do better than 2.5, and flash seemed to get stuck and loop more quickly.

I have only lightly used other tools, such as ChatGPT/Codex and have never used Claude. I tend to stick to the Google ecosystem for several reasons- but mainly, I think they will end up exceeding the capabilities of their competitors, due to their inherent engineering talent and huge computational resources. But they clearly need to catch up in a lot of areas- for example, the VS Code Gemini extension has serious problems (frequent API call errors, messed up formatting of code/text, infinite loops, etc).

1 more reply

jug4mo ago

I personally use it as my general purpose and coding model. It's good enough for my coding tasks most of the time, has very good and rapid web search grounding that makes the Google index almost feel like part of its training set, and Google has a family sharing plan with individual quotas for Google AI Pro at $20/month for 5 users which also includes 2 TB in the cloud. Family sharing is a unique feature for Gemini 3 Flash Thinking (300 prompts per day and user) & Pro (100 prompts per day and user).

1 more reply

mehagar4mo ago

I use Gemini for personal stuff such as travel planning and research on how to fix something, which product to buy, etc. My company has as Pro subscription so I use that instead of ChatGPT.

epolanski4mo ago

Various friends of mine work in non-technology companies (banking, industries, legal, Italy) and in pretty much all of them there's Gemini enterprise + NotebookLM.

In all of them the approach is: this is the solution, now find problems you can apply it to.

thornewolf4mo ago

I have swapped to using gemini over chatgpt for casual conversation and question answering. there are some lacking features in the app but i get faster and more intelligent responses.

esafak4mo ago

I'd use it for planning, knowledge, and anything visual.

verdverm4mo ago

I use gemini for everything because I trust google to keep the data I send them safe, because they know how to run prod at scale, and they are more environmentally friendly than everyone else (tpu,us-central1).

This includes my custom agent / copilot / cowork (which uses vertex ai and all models therein). This is where I do more searching now (with genAi grounding) I'm about to work on several micro projects that will hold Ai a little differently.

All that being said, google Ai products suck hard. I hate using every one of them. This is more a reflection on the continued degradation of PM/Design at Big G, from before Ai, but accellationally worse since. I support removing Logan from the head of this shit show

disclaimer: long time g-stan, not so stan any more

spankalee4mo ago· 9 in thread

I hope this works better than 3.0 Pro

I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.

It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.

Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.

So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.

For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.

karmasimida4mo ago

Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

8 more replies

s3p4mo ago

Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"

4 more replies

Oras4mo ago

Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.

stephen_cagle4mo ago

I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.

tldr; It is great at search, not so much action.

3 more replies

knollimar4mo ago

Is the thinking token stream obfuscated?

Im fully immersed

1 more reply

slopinthebag4mo ago

Hmm, interesting..

My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?

1 more reply

agentifysh4mo ago

Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini

jbellis4mo ago

yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.

1 more reply

varispeed4mo ago

> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

1 more reply

esafak4mo ago· 9 in thread

Has anyone noticed that models are dropping ever faster, with pressure on companies to make incremental releases to claim the pole position, yet making strides on benchmarks? This is what recursive self-improvement with human support looks like.

emp173444mo ago

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

2 more replies

redox994mo ago

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

m_ke4mo ago

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

ankit2194mo ago

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

oliveiracwb4mo ago

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

PlatoIsADisease4mo ago

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

1 more reply

nikcub4mo ago

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

2 more replies

gmerc4mo ago

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

toephu24mo ago

This is what competition looks like.

1024core4mo ago· 8 in thread

It got the car wash question perfectly:

You are definitely going to have to drive it there—unless you want to put it in neutral and push!

While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.

Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.

Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

1024core4mo ago

Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:

> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?

Gemini said:

That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.

Here is the straightforward reality of the situation:

    The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.

    The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.

Grab some elephant-safe soap and head down to the water together!

Would you like some tips on the best methods and tools for scrubbing down an elephant?

4 more replies

embedding-shape4mo ago

GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.

> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.

2 more replies

steve19774mo ago

The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?

2 more replies

upmind4mo ago

The answer here is why I dislike Gemini, though it gets the correct answer, it's far too verbose.

2 more replies

leumon4mo ago

Gemini 3 pro and flash already answered this correctly.

patrickmcnamara4mo ago

Truly we entering the era of AGI.

suddenlybananas4mo ago

They probably had time to toss that example in the training soup.

1 more reply

buttered_toast4mo ago

I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.

The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.

Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.

This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.

1 more reply

Robdel124mo ago· 8 in thread

I really want to use google’s models but they have the classic Google product problem that we all like to complain about.

I am legit scared to login and use Gemini CLI because the last time I thought I was using my “free” account allowance via Google workspace. Ended up spending $10 before realizing it was API billing and the UI was so hard to figure out I gave up. I’m sure I can spend 20-40 more mins to sort this out, but ugh, I don’t want to.

With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?

surgical_fire4mo ago

May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.

I am scared some automated system may just decide I am doing something bad and terminate my account. I have been moving important things to Proton, but there are some stuff that I couldn't change that would cause me a lot of annoyance. It's not trivial to set up an alternative account just for Gemini, because my Google account is basically on every device I use.

I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.

2 more replies

alpineman4mo ago

100% agreed. I wish someone would make a test for how reliably the LLMs follow tool use instructions etc. The pelicans are nice but not useful for me to judge how well a model will slot into a production stack.

1 more reply

phamilton4mo ago

> For those building with a mix of bash and custom tools, Gemini 3.1 Pro Preview comes with a separate endpoint available via the API called gemini-3.1-pro-preview-customtools. This endpoint is better at prioritizing your custom tools (for example view_file or search_code).

It sounds like there was at least a deliberate attempt to improve it.

pdntspa4mo ago

You can delete the billing from a given API key

Stevvo4mo ago

You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.

horsawlarway4mo ago

So much this.

It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.

I want to do relatively simple things like:

1. Buy shit from you

2. For a controllable amount (ex - let me pick a limit on costs)

3. Without spending literally HOURS trying to understand 17 different fucking products, all overlapping, with myriad project configs, api keys that should work, then don't actually work, even though the billing links to the same damn api key page, and says it should work.

And frankly - you can't do any of it. No controls (at best delayed alerts). No clear access. No real product differentiation pages. No guides or onboarding pages to simplify the matter. No support. SHIT LOADS of completely incorrect and outdated docs, that link to dead pages, or say incorrect things.

So I won't buy shit from them. Period.

1 more reply

abiraja4mo ago

I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).

himata41134mo ago

use openrouter instead

1 more reply

dxbednarczyk4mo ago· 7 in thread

Every time I've used Gemini models for anything besides code or agentic work they lean so far into the RLHF induced bold lettering and bullet point list barf that everything they output reads as if the model was talking _at_ me and not _with_ me. In my Openclaw experiment(s) and in the Gemini web UI, I've specifically added instructions to avoid this type of behavior, but it only seemed to obey those rules when I reminded the model of them.

For conversational contexts, I don't think the (in some cases significantly) better benchmark results compared to a model like Sonnet 4.6 can convince me to switch to Gemini 3.1. Has anyone else had a similar experience, or is this just a me issue?

augusto-moura4mo ago

Gemini sounds less personal, but I think that is good. From my experience, the quality of response is much higher than ChatGPT or Grok, and it cites real sources. I want to have a mini-wikipedia response for my questions, not a friend's group chat response

gavinray4mo ago

I have the opposite viewpoint:

If a model doesn't optimize the formatting of its output display for readability, I don't want to read it.

Tables, embedded images, use of bulleted lists and bold/italicizing etc.

markab214mo ago

You just articulated why I struggle to personally connect with Gemini. It feels so unrelatable and exhausting to read its output. I prefer to read Opus/Deepseek/GLM over Gemini, Qwen and the open source GPT models. Maybe it is RLHF that is creating my distaste from using it. (I pay for Gemini; I should be using it more... but the outputs just bug me and feel more work to get actionable insight.)

1 more reply

staticman24mo ago

I'm not familiar with Openclaw and but the trick to solve this would be to embed a style reminder at the bottom of each user message and ideally hide that from the user with the UI.

This is how roleplay apps like Sillytavern customize the experience for power users by allowing hidden style reminders as part of the user message that accompany each chat message.

InkCanon4mo ago

I think they all output that bold lettering, point by point style output. I strongly suspect it's part of a synthetic data pipeline all these AI companies have, and it improves performance. Claude seems to be the least of them, but it will start writing code at the drop of a hat. What annoys me in Gemini is that it has a really strange tendency to come up with weird analogies, especially in Pro mode. You'll be asking it about something like red black trees and it'll say "Red Black Trees (The F1 of Tree Data Structures)".

1 more reply

losvedir4mo ago

It definitely has the worst "voice" in my opinion. Feels very overachieving McKinsey intern to me.

verdverm4mo ago

I have no issues adjusting gemini tone & style with system prompt content

minimaxir4mo ago· 5 in thread

Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing

Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3

Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.

rancar24mo ago

If we don't see a huge gain on the long-term horizon thinking reflected with the Vendor-Bench 2, I'm not going to switch away from CC. Until Google can beat Anthropic on that front, Claude Code paired with the top long-horizon models will continue to pull away with full stack optimizations at every layer.

jbellis4mo ago

still no minimal reasoning in G3.1P :(

(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)

2 more replies

oblio4mo ago

> Knowledge cutoff is unchanged at Jan 2025.

Isn't that a bit old?

1 more reply

agentifysh4mo ago

Looks like its cheaper than codex ??? this might be interesting then

1 more reply

plaidfuji4mo ago

Sounds like the update is mostly system prompt + changes to orchestration / tool use around the core model, if the knowledge cutoff is unchanged

2 more replies

xrd4mo ago· 5 in thread

These models are so powerful.

It's totally possible to build entire software products in the fraction of the time it took before.

But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.

It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.

Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.

I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.

simonw4mo ago

I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20

This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.

4 more replies

jama2114mo ago

Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…

seizethecheese4mo ago

> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

In my experience it’s more like idiot savant engineers. Still remarkable.

1 more reply

worldsavior4mo ago

Sushy dinner? What are you building with AI, a calculator?

WarmWash4mo ago

I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.

Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).

davidguetta4mo ago· 5 in thread

Implementation and Sustainability Hardware: Gemini 3 Pro was trained using Google’s Tensor Processing Units (TPUs). TPUs are specically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution for handling the growing complexity of large foundation models. Training can be distributed across multiple TPU devices for faster and more efficient processing.

So google doesn't use NVIDIA GPUs at all ?

dekhn4mo ago

When I worked there, there was a mix of training on nvidia GPUs (especially for sparse problems when TPUs weren't as capable), CPUs, and TPUs. I've been gone for a few years but I've heard a few anecdotal statements that some of their researchers have to use nvidia GPUs because the TPUs are busy.

rjh294mo ago

I assume that's a Gemini LLM response? You can tell Gemini is bullshitting when it starts using "often" or "usually" - like in this case "TPUs often come with large amounts of memory". Either they did or they didn't. "This (particular) mall often has a Starbucks" was one I encountered recently.

1 more reply

PunchTornado4mo ago

no. only tpus

paride57454mo ago

Another reason to use Gemini then.

Less impact on gamers…

1 more reply

lejalv4mo ago

Bla bla bla yada sustainability yada often come with large better growing faster...

It's such an uninformative piece of marketing crap

mijoharas4mo ago· 4 in thread

Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]

Are Google planning to put any of their models into production any time soon?

Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?

[0] https://ai.google.dev/gemini-api/docs/deprecations

andrewmutz4mo ago

I agree completely. I don't know how anyone can be building on these models when all of them are either deprecated or not actually released yet. As someone who has production systems running on the deprecated models, this situation really causes me grief.

2 more replies

NitpickLawyer4mo ago

You are reading your link wrong. They are deprecating 2.5-preview models. 2.5 (including lite) are up till at least sept/oct 26.

1 more reply

vidarh4mo ago

This feels very Google

1 more reply

jakejohnson4mo ago

I haven't seen any deprecation notices for 2.5 yet, just for 2. I'd expect (and hope) the deprecation timeline for 2.5 is longer since 3.0 is still in preview. Maybe they just default to 1 year here?

> Note: The shutdown dates listed in the table indicate the /earliest/ possible dates on which a model might be retired. We will communicate the exact shutdown date to users with advance notice to ensure a smooth transition to a replacement model.

1 more reply

nickandbro4mo ago· 4 in thread

Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:

"create a svg of a unicorn playing xbox"

https://www.svgviewer.dev/s/NeKACuHj

Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.

simonw4mo ago

Interesting how it went a bit more 3D with the style of that one compared to the pelican I got.

andy12_4mo ago

I'm thinking now that as models get better and better at generating SVGs, there could be a point where we can use them to just make arbitrary UIs and interactive media with raw SVGs in realtime (like flash games).

2 more replies

roryirvine4mo ago

On the other hand, creation of other vector image formats (eg. "create a postscript file showing a walrus brushing its teeth") hasn't improved nearly so much.

Perhaps they're deliberately optimising for SVG generation.

mclau1534mo ago

can we move on from SVG to 3D models at some point?

2 more replies

maxloh4mo ago· 4 in thread

Gemini 3 seems to have a much smaller token output limit than 2.5. I used to use Gemini to restructure essays into an LLM-style format to improve readability, but the Gemini 3 release was a huge step back for that particular use case.

Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response, it still truncates the source text too aggressively, losing vital context and meaning in the restructuring process.

I hope the 3.1 release includes a much larger output limit.

NoahZuniga4mo ago

Output limit has consistently been 64k tokens (including 2.5 pro).

esafak4mo ago

People did find Gemini very talkative so it might be a response to that.

MallocVoidstarOP4mo ago

> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

1 more reply

jayd164mo ago

> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

3 more replies

the_duke4mo ago· 4 in thread

Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!

BUT it is not good at all at tool calling and agentic workflows, especially compared to the recent two mini-generations of models (Codex 5.2/5.3, the last two versions of Anthropic models), and also fell behind a bit in reasoning.

I hope they manage to improve things on that front, because then Flash would be great for many tasks.

verdverm4mo ago

These improvements are one of the things specifically called out on the submitted page

chermi4mo ago

You can really notice the tool use problems. They gotta get on that. The agent trend seems real, and powerful. They can't afford to fall behind on it.

2 more replies

spwa44mo ago

In other words: they just need to motivate their employees while giving in to finance's demands to fire a few thousand every month or so ...

And don't forget, it's not just direct motivation. You can make yourself indispensable by sabotaging or at least not contributing to your colleagues' efforts. Not helping anyone, by the way, is exactly what your managers want you to do. They will decide what happens, thank you very much, and doing anything outside of your org ... well there's a name for that, isn't there? Betrayal, or perhaps death penalty.

anthonypasq4mo ago

yeah, it seems to me like Gemini is a little behind on the current RL patterns and also they dont seem interested in really creating a dedicated coding model. I think they have so much product surface (search, AI mode, gmail, youtube, chrome etc), they are prioritizing making the model very general. but who knows im just talking out of my ass.

saberience4mo ago· 4 in thread

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

user342834mo ago

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

For development I tend to use Antigravity with Sonnet 4.5, or Gemini Flash if it's about a GUI change in React. The layout and design of Gemini has been superior to Claude models in my opinion, at least at the time. Flash also works significantly faster.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.

konart4mo ago

> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.

cmrdporcupine4mo ago

Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.

1 more reply

skerit4mo ago

I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

For the past few weeks, I've also been using XML-like tags in my prompts more often. Sometimes preferring to share previous conversations with `<user>` and `<assistant>` tags. Opus/Sonnet handles this just fine, but Gemini has a mental breakdown. It'll just start talking to itself.

Even in totally out-of-the-ordinary sessions, it goes crazy. After a while, it'll start saying it's going to do something, and then it pretends like it's done that thing, all in the same turn. A turn that never ends. Eventually it just starts spouting repetitive nonsense.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

1 more reply

Topfi4mo ago· 4 in thread

Appears the only difference to 3.0 Pro Preview is Medium reasoning. Model naming has long gone from even trying to make sense, but considering 3.0 is still in preview itself, increasing the number for such a minor change is not a move in the right direction.

GrayShade4mo ago

Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?

xnx4mo ago

> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

1 more reply

argsnd4mo ago

I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.

jannyfer4mo ago

According to the blog post, it should be also great at drawing pelicans riding a bicycle.

tenpoundhammer4mo ago· 3 in thread

In an attempt to get outside of benchmark gaming I had it make Platypus on a Tricycle. It's not as good as pelican on bicycle. https://www.svgviewer.dev/s/BiRht5hX

textlapse4mo ago

To really confuse it, ask it to take that tricycle with the platypus on it to a car wash.

dinosor4mo ago

For a moment I assumed the output would look like Perry the Platipus from the Disney (I think?) show. It's suprising to me (as a layman) that a show with lots of media that would've made it to the training corpus didn't show up.

0_____04mo ago

that's better than i thought it would be

1 more reply

mark_l_watson4mo ago· 3 in thread

Fine, I guess. The only commercial API I use to any great extent is gemini-3-flash-preview: cheap, fast, great for tool use and with agentic libraries. The 3.1-pro-preview is great, I suppose, for people who need it.

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

throwaway20274mo ago

Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.

1 more reply

PlatoIsADisease4mo ago

What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

1 more reply

nurettin4mo ago

I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.

makeavish4mo ago· 3 in thread

Great model until it gets nerfed. I wish they had a higher paid tier to use non nerfed model.

Mond_4mo ago

Bad news, John Google told me they already quantized it immediately after the benchmarks were done and it sucks now.

I miss when Gemini 3.1 was good. :(

spyckie24mo ago

I think there is a pattern it will always be nerfed the few weeks before launching a new model. Probably because they are throwing a bunch of compute at the new model.

1 more reply

xnx4mo ago

What are you talking about?

sigmar4mo ago· 2 in thread

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

ripbozo4mo ago

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

4 more replies

sho_hn4mo ago

The touted SVG improvements make me excited for animated pelicans.

6 more replies

pawelduda4mo ago· 2 in thread

It's safe to assume they'll be releasing improved Gemini Flash soon? The current one is so good & fast I rarely switch to pro anymore

tucnak4mo ago

Gemini 3 Pro (high) is a joke compared to Gemini 3 Flash in Antigravity, except it's not even funny. Flash is insane value, and super capable, too. I've had it implement a decompiler for very obscure bytecode, and it was passing all tests in no time. PITA to refactor later, but not insurmountable. Gemini 3 Pro (high) choked on this problem in the early stages... I'm looking forward to comparing 3.1 Pro vs 3.0 Flash, hopefully they have improved on it enough to finally switch over.

derac4mo ago

When 3 came out they mentioned that flash included many improvements that didn't make it into pro (via an hn comment). I imagine this release includes those.

zokier4mo ago· 2 in thread

> Last week, we released a major update to Gemini 3 Deep Think to solve modern challenges across science, research and engineering. Today, we’re releasing the upgraded core intelligence that makes those breakthroughs possible: Gemini 3.1 Pro.

So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.

WarmWash4mo ago

Deep Think is a few 3.1 models working together. It was suspected last week that Deep Think was composed using the new 3.1 model.

verdverm4mo ago

3.1 == model

deep think == turning up thinking knob (I think)

deep research == agent w/ search

mixel4mo ago· 2 in thread

Google seems to really pull ahead in this AI race. For me personally they offer the best deal and although the software is not quiet there compared to openai or anthropic (in regards to 1. web GUI, 2. agent-cli). I hope they can fix that in the future and I think once Gemini 4 or whatever launches we will see a huge leap again

rubslopes4mo ago

I don't understand this sentiment. It may hold true for other LLM use cases (image generation, creative writing, summarizing large texts), but when it comes to coding specifically, Google is *always* behind OpenAI and Anthropic, despite having virtually infinite processing power, money, and being the ones who started this race in the first place.

Until now, I've only ever used Gemini for coding tests. As long as I have access to GPT models or Sonnet/Opus, I never want to use Gemini. Hell, I even prefer Kimi 2.5 over it. I tried it again last week (Gemini Pro 3.0) and, right at the start of the conversation, it made the same mistake it's been making for years: it said "let me just run this command," and then did nothing.

My sentiment is actually the opposite of yours: how is Google *not* winning this race?

1 more reply

eknkc4mo ago

I hope they fail.

I honestly do not wish Google to have the best model out there and be forced to use their incomprehensible subscription / billing / project management whatever shit ever again.

I don’t know what their stuff cost. I don’t know why would I use vertex or ai studio. What is included in my subscription what is billed per use.

I pray that whatever they build fails and burns.

4 more replies

denysvitali4mo ago· 2 in thread

Where is Simon's pelican?

codethief4mo ago

Not Simon's but here is one: https://news.ycombinator.com/item?id=47075709

1 more reply

saberience4mo ago

Please no, let's not.

jeffybefffy5194mo ago· 2 in thread

Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.

casey24mo ago

We don't need real world benchmarks, if they were good for real world tasks people would use them We need scientific benchmarks that tease out the nature of intelligence. There are plenty of unsaturated benchmarks. Solving chess using "mostly" language modeling is still an open problem. And beyond that creating a machine that can explain why that move is likely optimal at some depth. AI that can predict the output of another AI.

robotpepi4mo ago

this reminds me of that joke of someone saying "it's crazy that we have ten different standards for doing this", and then there're 11 standards

1 more reply

WarmWash4mo ago· 1 in thread

3.1 Pro is the first model to correctly count the number of legs on my "five legged dog" test image. 3.0 flash was the previous best, getting it after a few prompts of poking. 3.1 got it on the first prompt though, with the prompt being "How many legs does the dog have? Count Carefully".

However, it didn't get it on the first try with the original prompt (prompt: "How many legs does the dog have?"). It initially said 4, then with a follow up prompt got it to hesitantly say 5, with one limb must being obfuscated or hidden.

So maybe I'll give it a 90%?

This is without tools as well.

merlindru4mo ago

your question may have become part of the training data with how much coverage there was around it. perhaps you should devise a new test :P

5 more replies

zhyder4mo ago· 1 in thread

Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

I wish Google also updated Flash-lite to 3.0+, would like to use that for the Explore subagent (which Claude Code uses Haiku for). These subagents seem to be Claude Code's strength over Gemini CLI, which still has them only in experimental mode and doesn't have read-only ones like Explore.

WarmWash4mo ago

>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

1 more reply

XCSme4mo ago· 1 in thread

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

XCSme4mo ago

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

qingcharles4mo ago· 1 in thread

I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

nubg4mo ago

> I've been playing with the 3.1 Deep Think version of this

How?

1 more reply

janalsncm4mo ago· 1 in thread

This model says it accepts video inputs. I asked it to transcribe a 5 second video of a digital water curtain which spelled “Boo Happy Halloween”, and it came back with “Happy” which wasn’t the first frame, but also is incomplete.

This kind of test is good because it requires stitching together info from the whole video.

aabhay4mo ago

It reads videos at 1fps by default. You have to set the video resolution to high in ai studio

1 more reply

sergiotapia4mo ago· 1 in thread

To use in OpenCode, you can update the models it has:

    opencode models --refresh

Then /models and choose Gemini 3.1 Pro

You can use the model through OpenCode Zen right away and avoid that Google UI craziness.

---

It is quite pricey! Good speed and nailed all my tasks so far. For example:

    @app-api/app/controllers/api/availability_controller.rb 
    @.claude/skills/healthie/SKILL.md 

    Find Alex's id, and add him to the block list, leave a comment 
    that he has churned and left the company. we can't disable him 
    properly on the Healthie EMR for now so 
    this dumb block will be added as a quick fix.

Result was:

    29,392 tokens
    $0.27 spent

So relatively small task, hitting an API, using one of my skills, but a quarter. Pricey!

gbalduzzi4mo ago

I don't see it even after refresh. Are you using the opencode-gemini-auth plugin as well?

1 more reply

PunchTornado4mo ago· 1 in thread

The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.

shmoogy4mo ago

but is it still terrible at tool calls in actual agentic flows?

josalhor4mo ago· 1 in thread

I speculated that 3 pro was 3.1... I guess I was wrong. Super impressive numbers here. Good job Google.

refulgentis4mo ago

> I speculated that 3 pro was 3.1

1 more reply

dude2507114mo ago· 1 in thread

I hereby allow you to release models not at the same time as your competitors.

sigmar4mo ago

It is super interesting that this is the same thing that happened in November (ie all labs shipping around the same week 11/12-11/23).

1 more reply

impulser_4mo ago· 1 in thread

Seems like they actually fixed some of the problems with the model. Hallucinations rate seems to be much better. Seems like they also tuned the reasoning maybe that were they got most of the improvements from.

whynotminot4mo ago

The hallucination rate with the Gemini family has always been my problem with them. Over the last year they’ve made a lot of progress catching the Gemini models up to/near the frontier in general capability and intelligence, but they still felt very late 2024 in terms of hallucination rate.

Which made the Gemini models untrustworthy for anything remotely serious, at least in my eyes. If they’ve fixed this or at least significantly improved, that would be a big deal.

1 more reply

Murfalo4mo ago· 1 in thread

I like to think that all these pelican riding a bicycle comments are unwittingly iteratively creating the optimal cyclist pelican as these comment threads are inevitably incorporated in every training set.

alpineman4mo ago

More like half of Google's AI team is hanging out on HN, and they can optimise for that outcome to get a good rep among the dev community.

2 more replies

onlyrealcuzzo4mo ago· 1 in thread

We've gone from yearly releases to quarterly releases.

If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.

rubicon334mo ago

But actual progress seems to be slower. These modes are releasing more often but aren’t big leaps.

3 more replies

seizethecheese4mo ago· 1 in thread

I use Gemini flash lite in a side project, and it’s stuck on 2.5. It’s now well behind schedule. Any speculation as to what’s going on?

foruhar4mo ago

Gemini-3.0-flash-preview came out right away with the 3.0 release and I was expecting 3.0-flash-lite before a bump on the pro model. I wonder if they have abandoned that part of the Pareto/price-performance.

johnwheeler4mo ago· 1 in thread

I know Google has anti-gravity but do they have anything like Claude code as far as user interface terminal basically TUI?

alooPotato4mo ago

https://github.com/google-gemini/gemini-cli

1 more reply

nautilus124mo ago· 1 in thread

Ok, why don't you work on getting 3.0 out of preview first? 10 min response time is pretty heinous

mucai824mo ago

I agree, according to Googles terms you are not allowed to use the preview model for production use cases. And 3.0 has been in preview for a loooong time now :(

matrix25964mo ago· 1 in thread

Gemini 3.1 Pro is based on Gemini 3 Pro

skerit4mo ago

Lol, and this line:

> Geminin 3.1 Pro can comprehend vast datasets

Someone was in a hurry to get this out the door.

naiv4mo ago· 1 in thread

ok , so they are scared that 5.3 (pro) will be released today/tomorrow and blow it out of the water and rushed it while they could still reference 5.2 benchmarks.

PunchTornado4mo ago

I don't think models blow other models anymore. We have the big 3 which are neck to neck in most benchmarks and the rest. I doubt that 5.3 will blow the others.

1 more reply

mustaphah4mo ago· 1 in thread

Google is terrible at marketing, but this feels like a big step forward.

As per the announcement, Gemini 3.1 Pro score 68.5% on Terminal-Bench 2.0, which makes it the top performer on the Terminus 2 harness [1]. That harness is a "neutral agent scaffold," built by researchers at Terminal-Bench to compare different LLMs in the same standardized setup (same tools, prompts, etc.).

It's also taken top model place on both the Intelligence Index & Coding Index of Artificial Analysis [2], but on their Agentic Index, it's still lagging behind Opus 4.6, GLM-5, Sonnet 4.6, and GPT-5.2.

---

[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

[2] https://artificialanalysis.ai

saberience4mo ago

Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

Every time they announce the best benchmarks I try again at using their tools and products and each time I immediately go back to Claude and Codex models because Google is just so terrible at building actual products.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.

2 more replies

makeavish4mo ago· 1 in thread

I hope to have great next two weeks before it gets nerfed.

unsupp0rted4mo ago

I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks

2 more replies

throwaw124mo ago· 1 in thread

Can we switch from Claude Code to Google yet?

Benchmarks are saying: just try

But real world could be different

foruhar4mo ago

My sense is that the Gemini models are very capable but the Gemini CLI experience is subpar compared to Claude Code and Codex. I'm guess that it's the harness but since it can get confused, fall into doom loops, and generally lose the plot in a way that the model does not in Gemini Studio or the Gemini app.

I think a bunch of these harnesses are open source so it surprises me that there can be such a gulf between them.

2 more replies

techgnosis4mo ago· 1 in thread

I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.

CamperBob24mo ago

(Shrug) Ask it to write one!

jcims4mo ago· 1 in thread

Pelican on a bicycle in drawio - https://imgur.com/a/tNgITTR

(FWIW I'm finding a lot of utility in LLMs doing diagrams in tools like drawio)

pqdbr4mo ago

How are you prompting it to draw diagrams in drawio

3 more replies

sdeiley4mo ago

People underrate Google's cost effectiveness so much. Half price of Opus. HALF.

Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight

____

Update:

3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.

https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...

22 more replies

sheepscreek4mo ago

If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).

So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.

While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.

UPDATE: I may have spoken too soon.

  > Fixing Truncated Array Syncing Bug
  > I traced the missing array items to a typo I made earlier! 
  > When fixing the GC cast crash, I accidentally deleted the assignment..
  > ..effectively truncating the entire array behind it.

These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!

On a lighter note, every time it happens, I think about this Family Guy: https://youtu.be/HtT2xdANBAY?si=QicynJdQR56S54VL&t=184

2 more replies

faebi4mo ago

I'm doing Ruby and Gemini 3.0 pro has by far been the best model for me. It writes the nicest ruby code, like I would. Further, it either succeeds or fails hard and obviously. I prefer it failing hard instead of of slowly going weird in my code.

Similar in antigravity. Privately it's my absolute favorite.

So I'm actually rooting for this.

1 more reply

ttul4mo ago

What I’m noticing, overall: I’ve never cut so much code in my life. I’ve become a coding monster with one of those dark green GitHub profiles ever since 5.3-Codex gave me the confidence to load in a ridiculous number of tasks every day and let it rip. I have about three coding tasks going at once and in another window, Claude Cowork is ripping through PowerPoints and getting back to lawyers.

This tech is not going to replace us. If anything, I am becoming even more of a workaholic. But the output volume is going to pay off for those who are privileged enough to use these tools.

5 more replies

exabrial4mo ago

You know what would slay right now? A native app.

Not another piece of Electron bloatware, a regular, efficient, fast, snappy, native, app. One that connects to my MCP severs and has local filesystem tools.

Anthropic might fall behind Google/OpenAI eventually, but their Desktop App + MCP/Connectors is unbelievably useful to get real work done.

1 more reply

mbh1594mo ago

77.1% on ARC-AGI-2 and still can't stop adding drive-by refactors. ARC-AGI-2 tests novel pattern induction, it's genuinely hard to fake and the improvement is real. But it doesn't measure task scoping, instruction adherence, or knowing when to stop. Those are the capabilities practitioners actually need from a coding agent. We have excellent benchmarks for reasoning. We have almost nothing that measures reliability in agentic loops. That gap explains this thread.

zapnuk4mo ago

Gemini 3 was:

1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months

2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.

I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.

ArmandoAP4mo ago

Model Card https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

infinitewars4mo ago

I find Gemini is great at generating code that is relatively common on the internet, especially web and algorithms. It is absolutely better at this then OpenAI's models. But Gemini is not as good at reasoning about problems from first principles, or catching subtle bugs. In some ways it is just a better Google that finds exactly what you want, less a general intelligence.

veselin4mo ago

I am actually going to complain about this: that neither of the Gemini models are not preview ones.

Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?

vnglst4mo ago

I asked Gemini 3.1 Pro to generate some of the modern artworks in my "Pelican Art Gallery". I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern

2 more replies

vnglst4mo ago

I asked Gemini 3.1 Pro Preview to generate the modern artworks as SVG for my Pelican Art Gallery. I particularly like the rendition of the Sunflowers: https://pelican.koenvangilst.nl/gallery/category/modern

agentifysh4mo ago

My enthusiasm is a bit muted this cycle because I've been burned by Gemini CLI. These models are very capable but Gemini CLI just doesn't seem to be able to work for one it never follows instructions strictly like its competitors do, and it hallucinates even which is a rarity.

More importantly feels like Google is stretched thin across different Gemini products and pricing reflects this, I still have no idea how to pay for Gemini CLI, in codex/claude its very simple $20/month for entry and $200/month for ton of weekly usage.

I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)

1 more reply

timabdulla4mo ago

Google tends to trumpet preview models that aren't actually production-grade. For instance, both 3 Pro and Flash suffer from looping and tool-calling issues.

I would love for them to eliminate these issues because just touting benchmark scores isn't enough.

upmind4mo ago

In my experience, while Gemini does really well in benchmarks I find it much worse when I actually use the model. It's too verbose / doesn't follow instructions very well. Let's see if that changes with this model.

thallavajhula4mo ago

This is great. I am hopeful that Gemini 3.1 Pro would be great. So far, I'm almost always pulled away from Gemini models by Claude. Having used Claude Opus High for a while now, Claude Opus seems to be fantastic at coding. Even Gemini's comparison chart says so. OpenAI's 5.3-codex is by far the weakest (of the 3) for my coding purposes. Claude Opus really shines at explanations and generating code.

Gemini is almost great. Claude Opus is great. I keep switching among these subscriptions every month to not miss out on any of the offerings for too long; ChatGPT Plus <-> Gemini Pro <-> Claude.

2 more replies

WarmWash4mo ago

It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

carpe__diem4mo ago

One thing I’d like to see in these releases is stronger emphasis on regression behavior, not just headline capability.

In production, the costly failures are usually "almost right" edits that quietly shift semantics across large diffs.

We now gate model upgrades behind a fixed eval set of our own repos + prompts and compare pass rates by task category (refactor, test repair, API migration). Raw benchmark gains matter less to us than variance and rollback safety. If 3.1 improves consistency on long multi-file edits, that’s a bigger win than a small jump on one-shot tasks.

XCSme4mo ago

Funnily, on my tests, 3 flash with medium reasoning does better. Seems like 3.1 pro reasoned about the correct answer, but chose to go with a different (wrong) one: https://aibenchy.com/compare/?left=google-gemini-3-flash-pre...

EDIT: while also being 3x cheaper

dudeinhawaii4mo ago

After 2 days of giving it a go, I find that Gemini CLI is still considerably worse than both Codex and Claude Code.

The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.

Once agentic coding was a bust, I gave it a run as a daily driver for AI assistant. It performed fairly well but then began behaving strangely. It would lose context mid conversation. For instance, I said "In san francisco I'm looking for XYZ". Two turns later I'm asking about food and it gives me suggestions all over the world.

Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).

Both GPT-5.2 and Claude crunched and came back with high quality estimations and rationale. Gemini 3.1 Pro... 5 times, returned a weather forecast widget for either the layover or final destination. This was on "Pro" reasoning, the highest exposed on the Gemini App/WebApp. I've always suspected Google swaps out models randomly so this.. wasn't surprising.

I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.

This tells me that a Google AI Ultra subscription gives me a sub-par coding agent which often swaps in Flash models, a sub-par web/app AI experience that also isn't using the advertised SOTA models, and a bunch of preview apps for video gen, audio gen (crashed every time I attempted), and world gen (Genie was interesting but a toy).

This will be a quick cancel as soon as the intro rate is done.

It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.

Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.

hackrmn4mo ago

I am reading opinions here from agent users, but I haven't adopted the "agentic workflow" myself because I believe I am (for now) now getting a lot of my trouble's worth using Gemini (3 Pro) in the traditional conversational manner. It is adequate at suggesting solutions in the form of code, or reasoning in general. My problems are software engineering but also everything that is not, since I have a subscription it's my go to problem solving partner. I see no reasons to switch to another product for now either, I am constantly in the loop getting samples of chats with Grok and ChatGPT and it seems a very close race. If Claude is that one race horse that's built different -- and I absolutely can believe it is so because they have rightfully tuned it -- I am not convinced I am missing out much. But maybe because I am more traditionalist to most of everyone's having embraced the idea of having an agent run a loop on their workstation(s) and trusting it to deliver. Perhaps if I were in more of a tight time frame, I'd be pressed to do so myself, but for now I am already benefiting from the extra speed "rubberducking" with Gemini all manner of software engineering problems that I need to solve, so I simply have no reasons to abandon it. I think this is also Google's strength -- they have the data, they've already integrated Gemini or a variant of it anyway, into google.com which is one of their prized cash cows, and it's everywhere else too. Like others here have said, Google may not have the absolute best in class at all times, but they're fairly good and they still have the brains that gave us DeepMind and GPT, unless there's some sort of stagnation going on in their ranks, I expect they're not resting on the laurels. With their capital they're still at the head of the race. Anthropic and OpenAI have the benefit of being nimble, though, and it shows too. Anyway, competition is good, the cat's out of the bag and on the greener side of the river :-)

nobrains4mo ago

In the "Intelligence applied" section, where they show the comparison animations, they are shown using a non-optimal UI.

There is not enough time to read the text, see old animation, and see new animation. Better would have been to keep the same animation on repeat, so that people have unlimited time to read the text and observer the animations.

Also, it jumps from example to example in the same video. Better would have been to show each separately, so that once user is done observing one example at their own pace, they can proceed to the next.

As a workaround, I had to open the video (just the video) in a new tab, pause once an example came up, read the text, then rewind to the start of the animation to see the old animation example, then rewind again, then see the new animation example, and then sometimes rewind again if I wanted to see the animation again. Then, once done with the example, I had to forward to the next example and repeat the above process again.

Somewhere along that process, they lost me.

jeffbee4mo ago

Relatedly, Gemini chat seems to be if not down then extremely slow.

ETA: They apparently wiped out everyone's chats (including mine). "Our engineering team has identified a background process that was causing the missing user conversation metadata and has successfully stopped the process to prevent further impact." El Mao.

ponyous4mo ago

Ran a bunch of 3D Modeling benchmarks on Gemini 3.1 vs Gemini 3.

Unsurprisingly 3.1 performs a bit better. But surprisingly it costs 2.6x as much ($0.14 vs. $0.37 per 3D Model Generation) and is 2.5x slower (1m 24s vs. 3m 28s).

To me it feels like "lets increase our thinking budget and call it an improved model!"

rahulroy4mo ago

In the meantime, I'm trying to update Antigravity to use the latest version, but it just wouldn't update itself, nor would it let me use 3.0 model. I restarted multiple times with the same result.

I tried telling this to agent, and it keeps repeating the same phrase "Gemini 3.1 Pro is not available on this version. Please upgrade to the latest version."

Congratulations on beating the benchmarks, but I wonder how much effort is devoted on improving DX?

Edit: It's updated now, I can confirm with "There are currently no updates available.". It still doesn't let me continue with the conversation. I'm able to create new session though.

markerbrod4mo ago

Blogpost: https://blog.google/innovation-and-ai/models-and-research/ge...

vinhnx4mo ago

Model card https://deepmind.google/models/model-cards/gemini-3-1-pro/

brap4mo ago

I had it coding autonomously for about an hour (including lots of tool wait time) on a difficult task, and it actually produced good results.

What’s most surprising is that I had it follow a strict loop/workflow and it did that perfectly. Normally these things go off the rails after a while with complex workflows. It’s something I have to usually enforce with some orchestration script and multiple agents, but this time it was just one session meticulously following orders.

Impressive, and saves a lot of time on building the orchestration glue.

conception4mo ago

My current AI test. There was a BBS I was on in the 90s and there was this door game I hadn't seen anywhere else. I simply describe the BBS, where it was popular, its name, the year it was around, and the BBS game and a description of it mechanics, etc.

OpenAI and Google's Deep Research produce a very long, 100% made up report. If I question the AI on the report, they both admit they just made it up.

Claude just returns, "I couldn't find anything on the BBS or the game."

cmrdporcupine4mo ago

Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.

metavolvelabs4mo ago

They crushed it with Gemini 3.1... especially when in Thinking Mode with Deep Think initiated. If you are working towards something with code, research etc. and hit a snag, run it by Gemini with these settings. Here's another KILLER trick: In Gemini Thinking mode select Nano Banana and have it put together a comprehensive slide with paragraph length text portions. It'll nail it.

ChrisArchitect4mo ago

More discussion: https://news.ycombinator.com/item?id=47075318

0xcb04mo ago

I'm trying to find the information, is this available on the Gemini CLI script, or is this just the web front-end where I can use this new model?

pRusya4mo ago

I'm using gemini.google.com/app with AI Pro subscription. "Something went wrong" in FF, works in Chrome.

Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.

> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.

panarchy4mo ago

I had it make a simple HTML/JS canvas game (think flappy bird) and while it did some things mildly better (and others noticeably worse) it still fell into the exact same traps as earlier models. It also had a lot of issues generating valid JS at parts and asking it what the code should be just made it endlessly generate the same exact incorrect code.

datakazkn4mo ago

One underappreciated reason for the agentic gap: Gemini tends to over-explain its reasoning mid-tool-call in a way that breaks structured output expectations. Claude and GPT-4o have both gotten better at treating tool calls as first-class operations. Gemini still feels like it's narrating its way through them rather than just executing.

1 more reply

rishabhaiover4mo ago

I think we're past the point where benchmarks hold real value. All models are above a certain threshold of intelligence but Gemini somehow borrows the worst of both worlds. It's neither good with long-horizon coding tasks nor does it offer a likable personality (like Claude which is much more beloved)

azuanrb4mo ago

The CLI needs work, or they should officially allow third-party harnesses. Right now, the CLI experience is noticeably behind other SOTA models. It actually works much better when paired with Opencode.

But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.

tskulbru4mo ago

Off-topic but, what are people using to create those video animations seen in the "ISS orbit tracking dashboard" example? Looks pretty nice! Im guessing Google uses a whole building of UX people but ive seen similar videos from small indie startups too, or even 1 person SaaS.

MASNeo4mo ago

At risk to be unpopular Gemini 3.0 Pro made a huge difference for me when I moved some workflow to Antigravity, especially compared to ChatGPT.

The latest update? I simply don’t care. I am not paid to evaluate models, I am paid to build. Not sure 4 benchmark points are making the difference.

6d6b734mo ago

In these discussions we see some people hating the models, while others love them. What I find interesting is that this is exactly how we feel about other people - some people will love working with you while others can't stand being in the same room you're in.

barfingclouds4mo ago

I’m no tech expert like a lot of people here, but I find Gemini 3.0 insanely good for my regular daily questions. Hoping this one is great too. I’m kind of at the point where many answers are essentially perfect and I don’t know if I need much more

clhodapp4mo ago

There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...

hsaliak4mo ago

The eventual nerfing gives me pause. Flash is awesome. What we really want is gemini-3.1-flash :)

d4rkp4ttern4mo ago

Yes people are too fixated on just the model. The real question for coding use cases is - does Gemini X + Gemini CLI outperform Opus + Claude Code? With 3.0 the answer was no. I won’t waste time checking 3.1 until I hear otherwise.

ChrisArchitect4mo ago

Blog post: https://blog.google/innovation-and-ai/models-and-research/ge...

n4pw01f4mo ago

I created a nice harness and visual workflow builder for my Gemini agent chains, works very well. I did this so it would create code the way I do, that is very editable.

In contrast, the vs code plugin was pretty bad, and did crazy things like mix languages

attentive4mo ago

A lot of gemini bashing. But flash 3.0 with opencode is reasonably good and reliable coder.

I'd rate it between haiku 4.5 (also pretty good for a price) and sonnet. Closer to sonnet.

Sure, if I am not cost-sensitive I'd run everything in opus 4.6 but alas.

robviren4mo ago

I have run into a surprising number of basic syntax errors on this one. At least in the few runs I have tried it's a swing and a miss. Wonder if the pressure of the Claude release is pushing these stop gap releases.

mrcwinn4mo ago

It's fascinating to watch this community react to positively to Google model releases and so negatively toward OpenAI's. You all do understand that an ad revenue model is exactly where Google will go, right?

3 more replies

syspec4mo ago

Does anyone know if this is in GA immediately or if it is in preview?

On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.

Does this mean that 3.0 is now GA at least?

0x1101111014mo ago

Relevant: Scanned diaries from 1945 of USFS Ranger. Had this transcribed in Claude.

[1]:https://news.ycombinator.com/item?id=47041836

Drblessing4mo ago

Gemini is the smartest model currently available. It is the only model out of the big ones that correcly identifies the specific versions of superhers in a collage I tested them with.

siliconc0w4mo ago

Google has a hugely valuable dataset of changes from decades of changes from top tier software engineers but it's so proprietary they can't use it to train their external models.

Grisu_FTP4mo ago

Somehow the models apparently get better and better every week, but every time i try to use them they get worse.

Am I the issue? Am i just misremembering the early times because it was a new thing?

1 more reply

holografix4mo ago

I think it begs the question:

Is Gemini meant to be be a revenue making product or strictly a cost centre to defend against Search and Ads erosion by OpenAI?

Why does the Gemini web app not support MCP Servers?

__jl__4mo ago

Another preview release. Does that mean the recommended model by Google for production is 2.5 Flash and Pro? Not talking about what people are actually doing but the google recommendation. Kind of crazy if that is the case

Jirach054mo ago

Can anyone explain why these models decrease in performance on this "MCRC v2 (8-needle)" long context benchmark when thinking is turned on?

alwinaugustin4mo ago

I use gemini if i need to write something in my native language- Malayalam or translation. it works very well in writing in Indian regional languages.

SrFil4mo ago

For me, Gemini has been by far the best model for document understanding tasks. I look forward to seeing how much more capable this version is.

ismailmaj4mo ago

3.1 feels to me like 3.0 but that takes a long time to think, it didn't feel like a leap in raw intelligence like 2.5 pro was.

eric153423354mo ago

My first impression is that the model sounds slightly more human and a little more praising. Still comparing the ability.

getcrunk4mo ago

Gemini is so stubborn, and often doesn’t follow explicit and simple instructions. So annoying

atleastoptimal4mo ago

Writing style wise, 3.1 seems very verbose, but somehow less creative compared to 3.

1024core4mo ago

It's been hugged to death. I keep getting "Something went wrong".

yuvalmer4mo ago

Gemini 3.0 Pro is bad model for its class. I really hope 3.1 is a leap forward.

msavara4mo ago

Somehow doesn't work for me :) "An internal error has occurred"

andrewstuart4mo ago

Gemini current version drops most of the code every time I try to use it.

Useless.

LZ_Khan4mo ago

biggest problem is that it's slow. also safety seems overtuned at the moment. getting some really silly refusals. everything else is pretty good.

trilogic4mo ago

Humanity last exam 44%, Scicode 59, and that 80, and this 78 but not 100% ever.

Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?

kuprel4mo ago

Why don't they show Grok benchmarks?

1 more reply

jdthedisciple4mo ago

Why should I be excited?

BMFXX4mo ago

Just wish iI could get 2.5 daily limit above 1000 requests easily. Driving me insane...

hn_throw20254mo ago

Yeah great, now can I have my pinned chats back please?

https://www.google.com/appsstatus/dashboard/incidents/nK23Zs...

himata41134mo ago

The visual capabilities of this model are frankly kind of ridicioulus what the hell.

lysecret4mo ago

Please I need 3 in ga…

leecommamichael4mo ago

Whoa, I think Gemini 3 Pro was a disappointment, but Gemini 3.1 Pro is definitely the future!

pickle-pixel4mo ago

does it still crash out after couple prompts?

taytus4mo ago

Another preview model? Why google keep doing this?

solarisos4mo ago

The speed of these 3.1 and Preview releases is starting to feel like the early days of web frameworks. It’s becoming less about the raw benchmarks and more about which model handles long-context 'hallucination' well enough to be actually used in a production pipeline without constant babysitting.

j / k navigate · click thread line to collapse

914 comments

282 comments · 127 top-level

simonw4mo ago· 19 in thread

Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.

embedding-shape4mo ago

11 more replies

jasonjmcghee4mo ago

What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.

The most absurd benchmaxxing.

https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...

7 more replies

MrCheeze4mo ago

5 more replies

sam_14214mo ago

Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes

4 more replies

SoKamil4mo ago

It seems they trained the model to output good svg’s.

In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.

[1] https://blog.google/innovation-and-ai/models-and-research/ge...

Arcuru4mo ago

Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/

1 more reply

WarmWash4mo ago

Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.

https://www.svgviewer.dev/s/dEdbH8Sw

2 more replies

AmazingTurtle4mo ago

At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?

2 more replies

steve_adams_864mo ago

Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub

3 more replies

bredren4mo ago

What is that, a snack in the basket?

3 more replies

TZubiri4mo ago

You think they are able to see their output and iterate on it? Or is it pure token generation?

benatkin4mo ago

I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output

tarr114mo ago

What do you think this particular prompt is evaluating for?

The more popular these particular evals are, the more likely the model will be trained for them.

1 more reply

saberience4mo ago

I hope we keep beating this dead horse some more, I'm still not tired of it.

calny4mo ago

Great pelican but what’s up with that fish in the basket?

3 more replies

infthi4mo ago

Wonder when will we get something other than a side view

2 more replies

DonHopkins4mo ago

How about STL files for 3d printing pelicans!

1 more reply

mohsen14mo ago

is there something in your prompt about hats? why the pelican always wearing a hat recently?!

1 more reply

xnx4mo ago

Not even animated? This is 2026.

1 more reply

quacky_batak4mo ago· 13 in thread

I’m keen to know how and where are you using Gemini.

jdc05894mo ago

I use it as my main platform right now both for work/swe stuff, and person stuff. It works pretty well, they have the full suite of tools I want from general LLM chat, to notebookLM, to antigravity.

Also I got a year of pro for free with my phone....so thats a big part.

ggregoire4mo ago

hunta20974mo ago

I use the Gemini web interface just as I would ChatGPT. They also have coding environment analogues of Claude-Code in Anti-gravity and Gemini-CLI.

When you sign up for the pro tier you also get 2TB of storage, Gemini for workspace and Nest Camera history.

If you're in the Google sphere it offers good value for money.

dinosor4mo ago

minimaxir4mo ago

Gemini has an obvious edge over its competitors in one specific area: Google Search. The other LLMs do have a Web Search tool but none of them are as effective.

fatherwavelet4mo ago

I feel like Gemini 3 was incredible on non-software/coding research. I have learned so much systems biology the last two months it blows my mind.

I have a deep research going right now on 3.1 for the first time and I honestly have no idea how I am going to tell if it is better than 3.

It seems like agentic coding Gemini wasn't as good but just asking it to write a function, I think it only didn't one shot what I asked it twice. Then fixed the problem on the next prompt.

I haven't logged in to bother with chatGPT in about 3 months now.

dekhn4mo ago

In short, I consider Gemini to be a highly capable intern (grad student level) who is smarter and more tenacious than me, but also needs significant guidance to reach a useful goal.

I used Gemini to completely replace the software stack I wrote for my self-built microscope. That includes:

1 more reply

jug4mo ago

1 more reply

mehagar4mo ago

I use Gemini for personal stuff such as travel planning and research on how to fix something, which product to buy, etc. My company has as Pro subscription so I use that instead of ChatGPT.

epolanski4mo ago

Various friends of mine work in non-technology companies (banking, industries, legal, Italy) and in pretty much all of them there's Gemini enterprise + NotebookLM.

In all of them the approach is: this is the solution, now find problems you can apply it to.

thornewolf4mo ago

I have swapped to using gemini over chatgpt for casual conversation and question answering. there are some lacking features in the app but i get faster and more intelligent responses.

esafak4mo ago

I'd use it for planning, knowledge, and anything visual.

verdverm4mo ago

disclaimer: long time g-stan, not so stan any more

spankalee4mo ago· 9 in thread

I hope this works better than 3.0 Pro

I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.

It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.

So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.

karmasimida4mo ago

Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

8 more replies

s3p4mo ago

Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"

4 more replies

Oras4mo ago

Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.

stephen_cagle4mo ago

tldr; It is great at search, not so much action.

3 more replies

knollimar4mo ago

Is the thinking token stream obfuscated?

Im fully immersed

1 more reply

slopinthebag4mo ago

Hmm, interesting..

1 more reply

agentifysh4mo ago

Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini

jbellis4mo ago

yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.

1 more reply

varispeed4mo ago

> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

1 more reply

esafak4mo ago· 9 in thread

emp173444mo ago

2 more replies

redox994mo ago

I don't think there's much recursive improvement yet.

I'd say it's a combination of

B) There's more compute online

C) Competition is more fierce.

m_ke4mo ago

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

ankit2194mo ago

oliveiracwb4mo ago

PlatoIsADisease4mo ago

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

1 more reply

nikcub4mo ago

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

2 more replies

gmerc4mo ago

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

toephu24mo ago

This is what competition looks like.

1024core4mo ago· 8 in thread

It got the car wash question perfectly:

You are definitely going to have to drive it there—unless you want to put it in neutral and push!

Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.

Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

1024core4mo ago

Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:

> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?

Gemini said:

That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.

Here is the straightforward reality of the situation:

    The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.

    The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.

Grab some elephant-safe soap and head down to the water together!

Would you like some tips on the best methods and tools for scrubbing down an elephant?

4 more replies

embedding-shape4mo ago

GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.

> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?

2 more replies

steve19774mo ago

The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?

2 more replies

upmind4mo ago

The answer here is why I dislike Gemini, though it gets the correct answer, it's far too verbose.

2 more replies

leumon4mo ago

Gemini 3 pro and flash already answered this correctly.

patrickmcnamara4mo ago

Truly we entering the era of AGI.

suddenlybananas4mo ago

They probably had time to toss that example in the training soup.

1 more reply

buttered_toast4mo ago

I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.

1 more reply

Robdel124mo ago· 8 in thread

I really want to use google’s models but they have the classic Google product problem that we all like to complain about.

With alllll that said.. is Gemini 3.1 more agentic now? That’s usually where it failed. Very smart and capable models, but hard to apply them? Just me?

surgical_fire4mo ago

May be very silly of me, but I avoid using Gemini on my personal Google account. I use it at work, because my employer provides it.

I mostly use LLMs as coding assistant, learning assistant, and general queries (e.g.: It helped me set up a server for self hosting), so nothing weird.

2 more replies

alpineman4mo ago

1 more reply

phamilton4mo ago

It sounds like there was at least a deliberate attempt to improve it.

pdntspa4mo ago

You can delete the billing from a given API key

Stevvo4mo ago

You could always use it through Copilot. The credits based billing is pretty simple without surprise charges.

horsawlarway4mo ago

So much this.

It's absolutely amazing how hostile Google is to releasing billing options that are reasonable, controllable, or even fucking understandable.

I want to do relatively simple things like:

1. Buy shit from you

2. For a controllable amount (ex - let me pick a limit on costs)

So I won't buy shit from them. Period.

1 more reply

abiraja4mo ago

I've been using it lately with OpenCode and it's working pretty well (except for API reliability issues).

himata41134mo ago

use openrouter instead

1 more reply

dxbednarczyk4mo ago· 7 in thread

augusto-moura4mo ago

gavinray4mo ago

I have the opposite viewpoint:

If a model doesn't optimize the formatting of its output display for readability, I don't want to read it.

Tables, embedded images, use of bulleted lists and bold/italicizing etc.

markab214mo ago

1 more reply

staticman24mo ago

I'm not familiar with Openclaw and but the trick to solve this would be to embed a style reminder at the bottom of each user message and ideally hide that from the user with the UI.

This is how roleplay apps like Sillytavern customize the experience for power users by allowing hidden style reminders as part of the user message that accompany each chat message.

InkCanon4mo ago

1 more reply

losvedir4mo ago

It definitely has the worst "voice" in my opinion. Feels very overachieving McKinsey intern to me.

verdverm4mo ago

I have no issues adjusting gemini tone & style with system prompt content

minimaxir4mo ago· 5 in thread

Price is unchanged from Gemini 3 Pro: $2/M input, $12/M output. https://ai.google.dev/gemini-api/docs/pricing

Knowledge cutoff is unchanged at Jan 2025. Gemini 3.1 Pro supports "medium" thinking where Gemini 3 did not: https://ai.google.dev/gemini-api/docs/gemini-3

Compare to Opus 4.6's $5/M input, $25/M output. If Gemini 3.1 Pro does indeed have similar performance, the price difference is notable.

rancar24mo ago

jbellis4mo ago

still no minimal reasoning in G3.1P :(

(this is why Opus 4.6 is worth the price -- turning off thinking makes it 3x-5x faster but it loses only a small amount of intelligence. nobody else has figured that out yet)

2 more replies

oblio4mo ago

> Knowledge cutoff is unchanged at Jan 2025.

Isn't that a bit old?

1 more reply

agentifysh4mo ago

Looks like its cheaper than codex ??? this might be interesting then

1 more reply

plaidfuji4mo ago

Sounds like the update is mostly system prompt + changes to orchestration / tool use around the core model, if the knowledge cutoff is unchanged

2 more replies

xrd4mo ago· 5 in thread

These models are so powerful.

It's totally possible to build entire software products in the fraction of the time it took before.

But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.

It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.

I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.

simonw4mo ago

This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.

4 more replies

jama2114mo ago

seizethecheese4mo ago

> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.

In my experience it’s more like idiot savant engineers. Still remarkable.

1 more reply

worldsavior4mo ago

Sushy dinner? What are you building with AI, a calculator?

WarmWash4mo ago

I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.

Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).

davidguetta4mo ago· 5 in thread

So google doesn't use NVIDIA GPUs at all ?

dekhn4mo ago

rjh294mo ago

1 more reply

PunchTornado4mo ago

no. only tpus

paride57454mo ago

Another reason to use Gemini then.

Less impact on gamers…

1 more reply

lejalv4mo ago

Bla bla bla yada sustainability yada often come with large better growing faster...

It's such an uninformative piece of marketing crap

mijoharas4mo ago· 4 in thread

Gemini 3 is still in preview (limited rate limits) and 2.5 is deprecated (still live but won't be for long).[0]

Are Google planning to put any of their models into production any time soon?

Also somewhat funny that some models are deprecated without a suggested alternative(gemini-2.5-flash-lite). Do they suggest people switch to Claude?

[0] https://ai.google.dev/gemini-api/docs/deprecations

andrewmutz4mo ago

2 more replies

NitpickLawyer4mo ago

You are reading your link wrong. They are deprecating 2.5-preview models. 2.5 (including lite) are up till at least sept/oct 26.

1 more reply

vidarh4mo ago

This feels very Google

1 more reply

jakejohnson4mo ago

I haven't seen any deprecation notices for 2.5 yet, just for 2. I'd expect (and hope) the deprecation timeline for 2.5 is longer since 3.0 is still in preview. Maybe they just default to 1 year here?

1 more reply

nickandbro4mo ago· 4 in thread

Does well on SVGs outside of "pelican riding on a bicycle" test. Like this prompt:

"create a svg of a unicorn playing xbox"

https://www.svgviewer.dev/s/NeKACuHj

Still some tweaks to the final result, but I am guessing with the ARC-AGI benchmark jumping so much, the model's visual abilities are allowing it to do this well.

simonw4mo ago

Interesting how it went a bit more 3D with the style of that one compared to the pelican I got.

andy12_4mo ago

2 more replies

roryirvine4mo ago

On the other hand, creation of other vector image formats (eg. "create a postscript file showing a walrus brushing its teeth") hasn't improved nearly so much.

Perhaps they're deliberately optimising for SVG generation.

mclau1534mo ago

can we move on from SVG to 3D models at some point?

2 more replies

maxloh4mo ago· 4 in thread

I hope the 3.1 release includes a much larger output limit.

NoahZuniga4mo ago

Output limit has consistently been 64k tokens (including 2.5 pro).

esafak4mo ago

People did find Gemini very talkative so it might be a response to that.

MallocVoidstarOP4mo ago

> Even when the model is explicitly instructed to pause due to insufficient tokens rather than generating an incomplete response

AI models can't do this. At least not with just an instruction, maybe if you're writing some kind of custom 'agentic' setup.

1 more reply

jayd164mo ago

> Even when the model is explicitly instructed to pause due to insufficient tokens

Is there actually a chance it has the introspection to do anything with this request?

3 more replies

the_duke4mo ago· 4 in thread

Gemini 3 is pretty good, even Flash is very smart for certain things, and fast!

I hope they manage to improve things on that front, because then Flash would be great for many tasks.

verdverm4mo ago

These improvements are one of the things specifically called out on the submitted page

chermi4mo ago

You can really notice the tool use problems. They gotta get on that. The agent trend seems real, and powerful. They can't afford to fall behind on it.

2 more replies

spwa44mo ago

In other words: they just need to motivate their employees while giving in to finance's demands to fire a few thousand every month or so ...

anthonypasq4mo ago

saberience4mo ago· 4 in thread

I always try Gemini models when they get updated with their flashy new benchmark scores, but always end up using Claude and Codex again...

I get the impression that Google is focusing on benchmarks but without assessing whether the models are actually improving in practical use-cases.

I.e. they are benchmaxing

Gemini is "in theory" smart, but in practice is much, much worse than Claude and Codex.

user342834mo ago

I exclusively use Gemini for Chat nowadays, and it's been great mostly. It's fast, it's good, and the app works reliably now. On top of that I got it for free with my Pixel phone.

And all of it is essentially free for now. I can even select Opus 4.6 in Antigravity, but I did not yet give it a try.

konart4mo ago

> but without assessing whether the models are actually improving in practical use-cases

Which cases? Not trying to sound bad but you didn't even provide of cases you are using Claude\Codex\Gemini for.

cmrdporcupine4mo ago

Honestly doesn't feel like Google is targeting the agentic coding crowd so much as they are the knowledge worker / researcher / search-engine-replacement market?

Agree Gemini as a model is fairly incompetent inside their own CLI tool as well as in opencode. But I find it useful as a research and document analysis tool.

1 more reply

skerit4mo ago

I'm glad someone else is finally saying this, I've been mentioning this left and right and sometimes I feel like I'm going crazy that not more people are noticing it.

Gemini can go off the rails SUPER easily. It just devolves into a gigantic mess at the smallest sign of trouble.

And you would think this is just because the bigger the context grows, the worse models tend to get. But no! This can happen well below even the 200.000 token mark.

1 more reply

Topfi4mo ago· 4 in thread

GrayShade4mo ago

Maybe that's the only API-visible change, saying nothing about the actual capabilities of the model?

xnx4mo ago

> increasing the number for such a minor change is not a move in the right direction

A .1 model number increase seems reasonable for more than doubling ARC-AGI 2 score and increasing so many other benchmarks.

What would you have named it?

1 more reply

argsnd4mo ago

I disagree. Incrementing the minor number makes so much more sense than “gemini-3-pro-preview-1902” or something.

jannyfer4mo ago

According to the blog post, it should be also great at drawing pelicans riding a bicycle.

tenpoundhammer4mo ago· 3 in thread

In an attempt to get outside of benchmark gaming I had it make Platypus on a Tricycle. It's not as good as pelican on bicycle. https://www.svgviewer.dev/s/BiRht5hX

textlapse4mo ago

To really confuse it, ask it to take that tricycle with the platypus on it to a car wash.

dinosor4mo ago

0_____04mo ago

that's better than i thought it would be

1 more reply

mark_l_watson4mo ago· 3 in thread

Off topic, but I like to run small models on my own hardware, and some small models are now very good for tool use and with agentic libraries - it just takes a little more work to get good results.

throwaway20274mo ago

Seconded. Gemini used to be trash and I used Claude and Codex a lot but gemini-3-flash-preview punches above it's weight, it's decent and I rarely if ever run into any token limit either.

1 more reply

PlatoIsADisease4mo ago

What models are you running locally? Just curious.

I am mostly restricted to 7-9B. I still like ancient early llama because its pretty unrestricted without having to use an abliteration.

1 more reply

nurettin4mo ago

I like to ask claude how to prompt smaller models for the given task. With one prompt it was able to make a low quantized model call multiple functions via json.

makeavish4mo ago· 3 in thread

Great model until it gets nerfed. I wish they had a higher paid tier to use non nerfed model.

Mond_4mo ago

Bad news, John Google told me they already quantized it immediately after the benchmarks were done and it sucks now.

I miss when Gemini 3.1 was good. :(

spyckie24mo ago

I think there is a pattern it will always be nerfed the few weeks before launching a new model. Probably because they are throwing a bunch of compute at the new model.

1 more reply

xnx4mo ago

What are you talking about?

sigmar4mo ago· 2 in thread

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

ripbozo4mo ago

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

4 more replies

sho_hn4mo ago

The touted SVG improvements make me excited for animated pelicans.

6 more replies

pawelduda4mo ago· 2 in thread

It's safe to assume they'll be releasing improved Gemini Flash soon? The current one is so good & fast I rarely switch to pro anymore

tucnak4mo ago

derac4mo ago

When 3 came out they mentioned that flash included many improvements that didn't make it into pro (via an hn comment). I imagine this release includes those.

zokier4mo ago· 2 in thread

So this is same but not same as Gemini 3 Deep Think? Keeping track of these different releases is getting pretty ridiculous.

WarmWash4mo ago

Deep Think is a few 3.1 models working together. It was suspected last week that Deep Think was composed using the new 3.1 model.

verdverm4mo ago

3.1 == model

deep think == turning up thinking knob (I think)

deep research == agent w/ search

mixel4mo ago· 2 in thread

rubslopes4mo ago

My sentiment is actually the opposite of yours: how is Google *not* winning this race?

1 more reply

eknkc4mo ago

I hope they fail.

I honestly do not wish Google to have the best model out there and be forced to use their incomprehensible subscription / billing / project management whatever shit ever again.

I don’t know what their stuff cost. I don’t know why would I use vertex or ai studio. What is included in my subscription what is billed per use.

I pray that whatever they build fails and burns.

4 more replies

denysvitali4mo ago· 2 in thread

Where is Simon's pelican?

codethief4mo ago

Not Simon's but here is one: https://news.ycombinator.com/item?id=47075709

1 more reply

saberience4mo ago

Please no, let's not.

jeffybefffy5194mo ago· 2 in thread

Someone needs to make an actual good benchmark for LLM's that matches real world expectations, theres more to benchmarks than accuracy against a dataset.

casey24mo ago

robotpepi4mo ago

this reminds me of that joke of someone saying "it's crazy that we have ten different standards for doing this", and then there're 11 standards

1 more reply

WarmWash4mo ago· 1 in thread

So maybe I'll give it a 90%?

This is without tools as well.

merlindru4mo ago

your question may have become part of the training data with how much coverage there was around it. perhaps you should devise a new test :P

5 more replies

zhyder4mo ago· 1 in thread

Surprisingly big jump in ARC-AGI-2 from 31% to 77%, guess there's some RLHF focused on the benchmark given it was previously far behind the competition and is now ahead.

Apart from that, the usual predictable gains in coding. Still is a great sweet-spot for performance, speed and cost. Need to hack Claude Code to use their agentic logic+prompts but use Gemini models.

WarmWash4mo ago

>I wish Google also updated Flash-lite to 3.0+

I hope every day that they have made gains on their diffusion model. As a sub agent it would be insane, as it's compute light and cranks 1000+ tk/s

1 more reply

XCSme4mo ago· 1 in thread

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

XCSme4mo ago

Now I need to write more tests.

qingcharles4mo ago· 1 in thread

I've been playing with the 3.1 Deep Think version of this for the last couple of weeks and it was a big step up for coding over 3.0 (which I already found very good).

It's only February...

nubg4mo ago

> I've been playing with the 3.1 Deep Think version of this

How?

1 more reply

janalsncm4mo ago· 1 in thread

This kind of test is good because it requires stitching together info from the whole video.

aabhay4mo ago

It reads videos at 1fps by default. You have to set the video resolution to high in ai studio

1 more reply

sergiotapia4mo ago· 1 in thread

To use in OpenCode, you can update the models it has:

    opencode models --refresh

Then /models and choose Gemini 3.1 Pro

You can use the model through OpenCode Zen right away and avoid that Google UI craziness.

---

It is quite pricey! Good speed and nailed all my tasks so far. For example:

    @app-api/app/controllers/api/availability_controller.rb 
    @.claude/skills/healthie/SKILL.md 

    Find Alex's id, and add him to the block list, leave a comment 
    that he has churned and left the company. we can't disable him 
    properly on the Healthie EMR for now so 
    this dumb block will be added as a quick fix.

Result was:

    29,392 tokens
    $0.27 spent

So relatively small task, hitting an API, using one of my skills, but a quarter. Pricey!

gbalduzzi4mo ago

I don't see it even after refresh. Are you using the opencode-gemini-auth plugin as well?

1 more reply

PunchTornado4mo ago· 1 in thread

The biggest increase is LiveCodeBench Pro: 2887. The rest are in line with Opus 4.6 or slightly better or slightly worse.

shmoogy4mo ago

but is it still terrible at tool calls in actual agentic flows?

josalhor4mo ago· 1 in thread

I speculated that 3 pro was 3.1... I guess I was wrong. Super impressive numbers here. Good job Google.

refulgentis4mo ago

> I speculated that 3 pro was 3.1

1 more reply

dude2507114mo ago· 1 in thread

I hereby allow you to release models not at the same time as your competitors.

sigmar4mo ago

It is super interesting that this is the same thing that happened in November (ie all labs shipping around the same week 11/12-11/23).

1 more reply

impulser_4mo ago· 1 in thread

whynotminot4mo ago

Which made the Gemini models untrustworthy for anything remotely serious, at least in my eyes. If they’ve fixed this or at least significantly improved, that would be a big deal.

1 more reply

Murfalo4mo ago· 1 in thread

alpineman4mo ago

More like half of Google's AI team is hanging out on HN, and they can optimise for that outcome to get a good rep among the dev community.

2 more replies

onlyrealcuzzo4mo ago· 1 in thread

We've gone from yearly releases to quarterly releases.

If the pace of releases continues to accelerate - by mid 2027 or 2028 we're headed to weekly releases.

rubicon334mo ago

But actual progress seems to be slower. These modes are releasing more often but aren’t big leaps.

3 more replies

seizethecheese4mo ago· 1 in thread

I use Gemini flash lite in a side project, and it’s stuck on 2.5. It’s now well behind schedule. Any speculation as to what’s going on?

foruhar4mo ago

johnwheeler4mo ago· 1 in thread

I know Google has anti-gravity but do they have anything like Claude code as far as user interface terminal basically TUI?

alooPotato4mo ago

https://github.com/google-gemini/gemini-cli

1 more reply

nautilus124mo ago· 1 in thread

Ok, why don't you work on getting 3.0 out of preview first? 10 min response time is pretty heinous

mucai824mo ago

I agree, according to Googles terms you are not allowed to use the preview model for production use cases. And 3.0 has been in preview for a loooong time now :(

matrix25964mo ago· 1 in thread

Gemini 3.1 Pro is based on Gemini 3 Pro

skerit4mo ago

Lol, and this line:

> Geminin 3.1 Pro can comprehend vast datasets

Someone was in a hurry to get this out the door.

naiv4mo ago· 1 in thread

ok , so they are scared that 5.3 (pro) will be released today/tomorrow and blow it out of the water and rushed it while they could still reference 5.2 benchmarks.

PunchTornado4mo ago

I don't think models blow other models anymore. We have the big 3 which are neck to neck in most benchmarks and the rest. I doubt that 5.3 will blow the others.

1 more reply

mustaphah4mo ago· 1 in thread

Google is terrible at marketing, but this feels like a big step forward.

---

[1] https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

[2] https://artificialanalysis.ai

saberience4mo ago

Benchmarks aren't everything.

Gemini consistently has the best benchmarks but the worst actual real-world results.

They are good at research and benchmaxxing, but the day to day usage of the products and tools is horrible.

Try using Google Antigravity and you will not make it an hour before switching back to Codex or Claude Code, it's so incredibly shitty.

2 more replies

makeavish4mo ago· 1 in thread

I hope to have great next two weeks before it gets nerfed.

unsupp0rted4mo ago

I've found Google (at least in AI Studio) are the only provider NOT to nerf their models after a few weeks

2 more replies

throwaw124mo ago· 1 in thread

Can we switch from Claude Code to Google yet?

Benchmarks are saying: just try

But real world could be different

foruhar4mo ago

I think a bunch of these harnesses are open source so it surprises me that there can be such a gulf between them.

2 more replies

techgnosis4mo ago· 1 in thread

I'd love a new Gemini agent that isn't written with Node.js. Not sure why they think that's a good distribution model.

CamperBob24mo ago

(Shrug) Ask it to write one!

jcims4mo ago· 1 in thread

Pelican on a bicycle in drawio - https://imgur.com/a/tNgITTR

(FWIW I'm finding a lot of utility in LLMs doing diagrams in tools like drawio)

pqdbr4mo ago

How are you prompting it to draw diagrams in drawio

3 more replies

sdeiley4mo ago

People underrate Google's cost effectiveness so much. Half price of Opus. HALF.

Think about ANY other product and what you'd expect from the competition thats half the price. Yet people here act like Gemini is dead weight

____

Update:

3.1 was 40% of the cost to run AA index vs Opus Thinking AND SONNET, beat Opus, and still 30% faster for output speed.

https://artificialanalysis.ai/?speed=intelligence-vs-speed&m...

22 more replies

sheepscreek4mo ago

If it’s any consolation, it was able to one-shot a UI & data sync race condition that even Opus 4.6 struggled to fix (across 3 attempts).

So far I like how it’s less verbose than its predecessor. Seems to get to the point quicker too.

While it gives me hope, I am going to play it by the ear. Otherwise it’s going to be - Gemini for world knowledge/general intelligence/R&D and Opus/Sonnet 4.6 to finish it off.

UPDATE: I may have spoken too soon.

  > Fixing Truncated Array Syncing Bug
  > I traced the missing array items to a typo I made earlier! 
  > When fixing the GC cast crash, I accidentally deleted the assignment..
  > ..effectively truncating the entire array behind it.

These errors should not be happening! They are not the result of missing knowledge or a bad hunch. They are coming from an incorrect find/replace, which makes them completely avoidable!

On a lighter note, every time it happens, I think about this Family Guy: https://youtu.be/HtT2xdANBAY?si=QicynJdQR56S54VL&t=184

2 more replies

faebi4mo ago

Similar in antigravity. Privately it's my absolute favorite.

So I'm actually rooting for this.

1 more reply

ttul4mo ago

This tech is not going to replace us. If anything, I am becoming even more of a workaholic. But the output volume is going to pay off for those who are privileged enough to use these tools.

5 more replies

exabrial4mo ago

You know what would slay right now? A native app.

Not another piece of Electron bloatware, a regular, efficient, fast, snappy, native, app. One that connects to my MCP severs and has local filesystem tools.

Anthropic might fall behind Google/OpenAI eventually, but their Desktop App + MCP/Connectors is unbelievably useful to get real work done.

1 more reply

mbh1594mo ago

zapnuk4mo ago

Gemini 3 was:

1. unreliable in GH copilot. Lots of 500 and 4XX errors. Unusable in the first 2 months

2. not available in vertex ai (europe). We have requirements regarding data residency. Funny enough anthropic is on point with releasing their models to vertex ai. We already use opus and sonnet 4.6.

I hope google gets their stuff together and understands that not everyone wants/can use their global endpoint. We'd like to try their models.

ArmandoAP4mo ago

Model Card https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

infinitewars4mo ago

veselin4mo ago

I am actually going to complain about this: that neither of the Gemini models are not preview ones.

vnglst4mo ago

2 more replies

vnglst4mo ago

agentifysh4mo ago

I hope whoever is reading this from Google they can redeem Gemini CLI by focusing on being competitive instead of making it look pretty (that seems to be the impression I got from the updates on X)

1 more reply

timabdulla4mo ago

Google tends to trumpet preview models that aren't actually production-grade. For instance, both 3 Pro and Flash suffer from looping and tool-calling issues.

I would love for them to eliminate these issues because just touting benchmark scores isn't enough.

upmind4mo ago

thallavajhula4mo ago

Gemini is almost great. Claude Opus is great. I keep switching among these subscriptions every month to not miss out on any of the offerings for too long; ChatGPT Plus <-> Gemini Pro <-> Claude.

2 more replies

WarmWash4mo ago

It seems google is having a disjointed roll out, and there will likely be an official announcement in a few hours. Apparently 3.1 showed up unannounced in vertex at 2am or something equally odd.

Either way early user tests look promising.

carpe__diem4mo ago

One thing I’d like to see in these releases is stronger emphasis on regression behavior, not just headline capability.

In production, the costly failures are usually "almost right" edits that quietly shift semantics across large diffs.

XCSme4mo ago

EDIT: while also being 3x cheaper

dudeinhawaii4mo ago

After 2 days of giving it a go, I find that Gemini CLI is still considerably worse than both Codex and Claude Code.

The model itself also has strange behaviors that seem like it gets randomly replaced with Gemini-3-Flash or something else. I'll explain.

Another time, I asked it about the likelihood of the pending east coast winter storm of affecting my flight. I gave it all the details (flight, stops, time, cities).

I then asked Gemini 3.1 Pro via the API and it returned a response similar to Claude and GPT-5.2 -- carefully considering all factors.

This will be a quick cancel as soon as the intro rate is done.

It's like Google doesn't ACTUALLY want to be the leader in AI or serve people their best models. They want to generate hype around benchmarks and then nerf the model and go silent.

Gemini 3 Pro Preview went from exceptional in the first month to mediocre and then out of my rotation within a month.

hackrmn4mo ago

nobrains4mo ago

In the "Intelligence applied" section, where they show the comparison animations, they are shown using a non-optimal UI.

Somewhere along that process, they lost me.

jeffbee4mo ago

Relatedly, Gemini chat seems to be if not down then extremely slow.

ponyous4mo ago

Ran a bunch of 3D Modeling benchmarks on Gemini 3.1 vs Gemini 3.

Unsurprisingly 3.1 performs a bit better. But surprisingly it costs 2.6x as much ($0.14 vs. $0.37 per 3D Model Generation) and is 2.5x slower (1m 24s vs. 3m 28s).

To me it feels like "lets increase our thinking budget and call it an improved model!"

rahulroy4mo ago

In the meantime, I'm trying to update Antigravity to use the latest version, but it just wouldn't update itself, nor would it let me use 3.0 model. I restarted multiple times with the same result.

I tried telling this to agent, and it keeps repeating the same phrase "Gemini 3.1 Pro is not available on this version. Please upgrade to the latest version."

Congratulations on beating the benchmarks, but I wonder how much effort is devoted on improving DX?

Edit: It's updated now, I can confirm with "There are currently no updates available.". It still doesn't let me continue with the conversation. I'm able to create new session though.

markerbrod4mo ago

Blogpost: https://blog.google/innovation-and-ai/models-and-research/ge...

vinhnx4mo ago

Model card https://deepmind.google/models/model-cards/gemini-3-1-pro/

brap4mo ago

I had it coding autonomously for about an hour (including lots of tool wait time) on a difficult task, and it actually produced good results.

Impressive, and saves a lot of time on building the orchestration glue.

conception4mo ago

OpenAI and Google's Deep Research produce a very long, 100% made up report. If I question the AI on the report, they both admit they just made it up.

Claude just returns, "I couldn't find anything on the BBS or the game."

cmrdporcupine4mo ago

Doesn't show as available in gemini CLI for me. I have one of those "AI Pro" packages, but don't see it. Typical for Google, completely unclear how to actually use their stuff.

metavolvelabs4mo ago

ChrisArchitect4mo ago

More discussion: https://news.ycombinator.com/item?id=47075318

0xcb04mo ago

I'm trying to find the information, is this available on the Gemini CLI script, or is this just the web front-end where I can use this new model?

pRusya4mo ago

I'm using gemini.google.com/app with AI Pro subscription. "Something went wrong" in FF, works in Chrome.

Below is one of my test prompts that previous Gemini models were failing. 3.1 Pro did a decent job this time.

> use c++, sdl3. use SDL_AppInit, SDL_AppEvent, SDL_AppIterate callback functions. use SDL_main instead of the default main function. make a basic hello world app.

panarchy4mo ago

datakazkn4mo ago

1 more reply

rishabhaiover4mo ago

azuanrb4mo ago

But with accounts reportedly being banned over ToS issues, similar to Claude Code, it feels risky to rely on it in a serious workflow.

tskulbru4mo ago

MASNeo4mo ago

At risk to be unpopular Gemini 3.0 Pro made a huge difference for me when I moved some workflow to Antigravity, especially compared to ChatGPT.

The latest update? I simply don’t care. I am not paid to evaluate models, I am paid to build. Not sure 4 benchmark points are making the difference.

6d6b734mo ago

barfingclouds4mo ago

clhodapp4mo ago

There's a very short blog post up: https://blog.google/innovation-and-ai/models-and-research/ge...

hsaliak4mo ago

The eventual nerfing gives me pause. Flash is awesome. What we really want is gemini-3.1-flash :)

d4rkp4ttern4mo ago

ChrisArchitect4mo ago

Blog post: https://blog.google/innovation-and-ai/models-and-research/ge...

n4pw01f4mo ago

I created a nice harness and visual workflow builder for my Gemini agent chains, works very well. I did this so it would create code the way I do, that is very editable.

In contrast, the vs code plugin was pretty bad, and did crazy things like mix languages

attentive4mo ago

A lot of gemini bashing. But flash 3.0 with opencode is reasonably good and reliable coder.

I'd rate it between haiku 4.5 (also pretty good for a price) and sonnet. Closer to sonnet.

Sure, if I am not cost-sensitive I'd run everything in opus 4.6 but alas.

robviren4mo ago

mrcwinn4mo ago

3 more replies

syspec4mo ago

Does anyone know if this is in GA immediately or if it is in preview?

On our end, Gemini 3.0 Preview was very flakey (not model quality, but as in the API responses sometimes errored out), making it unreliable.

Does this mean that 3.0 is now GA at least?

0x1101111014mo ago

Relevant: Scanned diaries from 1945 of USFS Ranger. Had this transcribed in Claude.

[1]:https://news.ycombinator.com/item?id=47041836

Drblessing4mo ago

Gemini is the smartest model currently available. It is the only model out of the big ones that correcly identifies the specific versions of superhers in a collage I tested them with.

siliconc0w4mo ago

Google has a hugely valuable dataset of changes from decades of changes from top tier software engineers but it's so proprietary they can't use it to train their external models.

Grisu_FTP4mo ago

Somehow the models apparently get better and better every week, but every time i try to use them they get worse.

Am I the issue? Am i just misremembering the early times because it was a new thing?

1 more reply

holografix4mo ago

I think it begs the question:

Is Gemini meant to be be a revenue making product or strictly a cost centre to defend against Search and Ads erosion by OpenAI?

Why does the Gemini web app not support MCP Servers?

__jl__4mo ago

Jirach054mo ago

Can anyone explain why these models decrease in performance on this "MCRC v2 (8-needle)" long context benchmark when thinking is turned on?

alwinaugustin4mo ago

I use gemini if i need to write something in my native language- Malayalam or translation. it works very well in writing in Indian regional languages.

SrFil4mo ago

For me, Gemini has been by far the best model for document understanding tasks. I look forward to seeing how much more capable this version is.

ismailmaj4mo ago

3.1 feels to me like 3.0 but that takes a long time to think, it didn't feel like a leap in raw intelligence like 2.5 pro was.

eric153423354mo ago

My first impression is that the model sounds slightly more human and a little more praising. Still comparing the ability.

getcrunk4mo ago

Gemini is so stubborn, and often doesn’t follow explicit and simple instructions. So annoying

atleastoptimal4mo ago

Writing style wise, 3.1 seems very verbose, but somehow less creative compared to 3.

1024core4mo ago

It's been hugged to death. I keep getting "Something went wrong".

yuvalmer4mo ago

Gemini 3.0 Pro is bad model for its class. I really hope 3.1 is a leap forward.

msavara4mo ago

Somehow doesn't work for me :) "An internal error has occurred"

andrewstuart4mo ago

Gemini current version drops most of the code every time I try to use it.

Useless.

LZ_Khan4mo ago

biggest problem is that it's slow. also safety seems overtuned at the moment. getting some really silly refusals. everything else is pretty good.

trilogic4mo ago

Humanity last exam 44%, Scicode 59, and that 80, and this 78 but not 100% ever.

Would be nice to see that this models, Plus, Pro, Super, God mode can do 1 Bench 100%. I am missing smth here?

kuprel4mo ago

Why don't they show Grok benchmarks?

1 more reply

jdthedisciple4mo ago

Why should I be excited?

BMFXX4mo ago

Just wish iI could get 2.5 daily limit above 1000 requests easily. Driving me insane...

hn_throw20254mo ago

Yeah great, now can I have my pinned chats back please?

https://www.google.com/appsstatus/dashboard/incidents/nK23Zs...

himata41134mo ago

The visual capabilities of this model are frankly kind of ridicioulus what the hell.

lysecret4mo ago

Please I need 3 in ga…

leecommamichael4mo ago

Whoa, I think Gemini 3 Pro was a disappointment, but Gemini 3.1 Pro is definitely the future!

pickle-pixel4mo ago

does it still crash out after couple prompts?

taytus4mo ago

Another preview model? Why google keep doing this?

solarisos4mo ago

j / k navigate · click thread line to collapse