Gemini 3

qsort5mo ago

To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.

rbjorklin5mo ago

Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.

sedatk5mo ago

Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970

thomasahle5mo ago

I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p

irthomasthomas5mo ago

Are you sure it did not retrieve the answer using websearch?

id5mo ago

gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.

j2kun5mo ago

Did it search the web?

jamilton5mo ago

Yeah, LLMs used to not be up to par for new Project Euler problems, but GPT-5 was able to do a few of the recent ones which I tried a few weeks ago.

bgwalter5mo ago

Does it matter if it is out of the training data? The models integrate web search quite well.

What if they have an internal corpus of new and curated knowledge that is constantly updated by humans and accessed in a similar manner? It could be active even if web search is turned off.

They would surely add the latest Euler problems with solutions in order to show off in benchmarks.

bumling5mo ago

I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.

blubber5mo ago

Is this a problem for which the (human) solution is well documented an known and was learned during the training phase? Or is it a novel problem?

I personally think anthropomorphizing LLMs is a bad idea.

NaomiLehman5mo ago

definitely uses a lot of tooling. From "thinking":

> I'm now writing a Python script to automate the summation computation. I'm implementing a prime sieve and focusing on functions for Rm and Km calculation [...]

mistercheph5mo ago

If using through the chat interface are these models not doing some RAG?

ivape5mo ago

So when does the developer admit defeat? Do we have a benchmark for that yet?

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

motbus35mo ago

We need to wait and see. According to Google they have solved AI 10 years ago with Google Duo but somehow they keep smashing records despite being the worst coding tool until Gemini 2.5. Google internal benchmarks are irrelevant

panarky5mo ago

[flagged]

11 more replies

orly015mo ago

Wow. Sounds pretty impressive.

lofaszvanitt5mo ago

The problem is these models are optimized to solve the benchmarks, not real world problems.

davidpolberger5mo ago

This is wild. I gave it some legacy XML describing a formula-driven calculator app, and it produced a working web app in under a minute:

I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.

I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

dwringer5mo ago

Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.

kjgkjhfkjf5mo ago

This is quite likely to be in the training data, since it's one of the projects in Wes Bos's free 30 days of Javascript course[0].

[0] https://javascript30.com/

stalfie5mo ago

The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.

https://ai.studio/apps/drive/1oGzK7yIEEHvfPqxBGbsue-wLQEhfTP...

kldg5mo ago

in defense of 2.5 (Pro, at least), it was able to generate for me a metric UNIX clock as a webpage which I was amused by. it uses kiloseconds/megaseconds/etc. there are 86.4ks/day. The "seconds" hand goes around 1000 seconds, which ticks over the "hour" hand. Instead of saying 4am, you'd say it's 14.

as a calendar or "date" system, we start at UNIX time's creation, so it's currently 1.76 gigaseconds AUNIX. You might use megaseconds as the "week" and gigaseconds more like an era, e.g. Queen Elizabeth III's reign, persisting through the entire fourth gigasecond and into the fifth. The clock also displays teraseconds, though this is just a little purple speck atm. of course, this can work off-Earth where you would simply use 88.775ks as the "day"; the "dates" a Martian and Earthling share with each other would be interchangeable.

I can't seem to get anyone interested in this very serious venture, though... I guess I'll have to wait until the 50th or so iteration of Figure, whenever it becomes useful, to be able to build a 20-foot-tall physical metric UNIX clock in my front yard.

xnx5mo ago

This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2

"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo

farazbabar5mo ago

https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).

pmarreck5mo ago

I made a few improvements... which all worked on the first try... except the ticking sound, which worked on the second try (the first try was too much like a "blip")

thegrim335mo ago

"Allow access to Google Drive to load this Prompt."

.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.

malfist5mo ago

That is not the same prompt as the other person was using. In particular this doesn't provide the time to set the clock to, which makes the challenge a lot simpler. This also includes javascript.

The prompt the other person was using is:

``` Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting. ```

Which is much more difficult.

For what it's worth, I supplied the same prompt as the OG clock challenge and it utterly failed, not only generating a terrible clock, but doing so with a fair bit of typescript: https://ai.studio/apps/drive/1c_7C5J5ZBg7VyMWpa175c_3i7NO7ry...

dyauspitr5mo ago

Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?

skybrian5mo ago

It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.

https://codepen.io/Runway/pen/WbwOXRO

pmarreck5mo ago

holy shit! This is actually a VERY NICE clock!

SXX5mo ago

Static Pelican is boring. First attempt:

Generate SVG animation of following:

1 - There is High fantasy mage tower with a top window a dome

2 - Green goblin come in front of tower with a torch

3 - Grumpy old mage with beard appear in a tower window in high purple hat

4 - Mage sends fireball that burns goblin and all screen is covered in fire.

Camera view must be from behind of goblin back so we basically look at tower in front of us:

SXX5mo ago

After few more attempts longer animation with a story from my gamedev inspired mind:

https://codepen.io/Runway/pen/zxqzPyQ

PS: but yeah thats attempt #20 or something.

5 more replies

mclau1535mo ago

we are returning to flash animations after 20 years

udkl5mo ago

This reminded me of https://youtube.com/playlist?list=PLSq76P-lbX8VQmtv7gcAPkqlj...

sosodev5mo ago

Wow, that's very impressive

Rudybega5mo ago

Holy crap. That's actually kind of incredible for a first attempt.

prodigycorp5mo ago

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.

WhitneyLand5mo ago

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

dekhn5mo ago

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

thefourthchime5mo ago

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

sosodev5mo ago

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

benterix5mo ago

> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

ddalex5mo ago

I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

mring336215mo ago

I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

t0mas885mo ago

Google reports a lower score for Gemini 3 Pro on SWEBench than Claude Sonnet 4.5, which is comparing a top tier model with a smaller one. Very curious to see whether there will be an Opus 4.5 that does even better.

testartr5mo ago

and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

it's easy to focus on what they can't do

luckydata5mo ago

I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

mupuff12345mo ago

Could also just be rollout issues.

Rover2225mo ago

curious if you tried grok 4.1 too

Filligree5mo ago

What's the benchmark?

m00dy5mo ago

that's why everyone using AI for code should code in rust only.

simonw5mo ago

Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/

6 more replies

ttul5mo ago

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

rfw3005mo ago

My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:

- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts

- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.

- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.

Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.

satvikpendem5mo ago

I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

iagooar5mo ago

What prompt do you use for that?

renegade-otter5mo ago

It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

valtism5mo ago

Parakeet TDT v3 would be really good at that

Workaccount25mo ago

It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.

Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.

Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.

GuB-425mo ago

> Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed.

I see that AI is reaching the level of a middle school boy...

recitedropper5mo ago

Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.

column5mo ago

"[a photoshopped picture of a dog with 5 legs]...please count the legs"

Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?

lukebechtel5mo ago

ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.

ponyous5mo ago

Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.

Will run extended benchmarks later, let me know if you want to see actual data.

lfx5mo ago

Just hand sketched what 5 year old would do on the paper - the house, trees, sun. And asked to generate 3d model with tree.js.

Results are amazing! 2.5 and 3 seems way way head.

mindlessg5mo ago

I'm interested in seeing the data.

layer85mo ago

Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

giancarlostoro5mo ago

I'm not familiar enough with CAD what type of format is it?

falcor845mo ago

I love it that there's a "Read AI-generated summary" button on their post about their new AI.

I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:

> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.

- from "Dirk Gently's Holistic Detective Agency"

6 more replies

tylervigen5mo ago

I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

The ARC puzzles in question: https://arcprize.org/arc-agi/2/

stephc_int135mo ago

What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.

8 more replies

grantpitt5mo ago

Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

tylervigen5mo ago

This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3

HarHarVeryFunny5mo ago

There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.

m3kw95mo ago

that looks great, but we all care how it translate to real world problems like programming where it isn't really excelling by 2x.

syspec5mo ago

I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.

From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.

When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.

I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard

plaidfuji5mo ago

I have the same experience with Gemini, that it’s incredibly accurate but puts in defensive code and error handling to a fault. It’s pretty easy to just tell it “go easy on the defensive code” / “give me the punchy version” and it cleans it up

poyu5mo ago

    but I would never want to have to interact with

That is its job security ;)

jmkni5mo ago

I can relate to this, it's doing exactly what I want, but it ain't pretty.

It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself

eitally5mo ago

I have had a similar experience vibe coding with Copilot (ChatGPT) in VSCode, against the Gemini API. I wanted to create a dad joke generator and then have it also create a comic styled 4 cel interpretation of the joke. Simple, right? I was able to easily get it to create the joke, but it repeatedly failed on the API call for the image generation. What started as perhaps 100 lines of total code in two files ended up being about 1500 LOC with an enormous built-in self-testing mechanism ... and it still didn't work.

coffeecoders5mo ago

Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.

Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).

Workaccount25mo ago

AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.

bitpush5mo ago

> Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...

acoustics5mo ago

Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.

[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...

int_19h5mo ago

Gemini genuinely has an edge over the others in its super-long context size, though. There are some tasks where this is the deal breaker, and others where you can get by with a smaller size, but the results just aren't as good.

ehsankia5mo ago

> The winners aren’t necessarily those with the best models

Is there evidence that's true? That the other models are significantly better than the ones you named?

stevesimmons5mo ago

A nice Easter egg in the Gemini 3 docs [1]:

    If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:

    "thoughtSignature": "context_engineering_is_the_way_to_go"

bijant5mo ago

It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.

CMay5mo ago

I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.

Very good. Nice work! These things will definitely change lives.

__jl__5mo ago

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

raincole5mo ago

Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

jhack5mo ago

With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

dktp5mo ago

It's interesting that grounding with search cost changed from

* 1,500 RPD (free), then $35 / 1,000 grounded prompts

* 1,500 RPD (free), then (Coming soon) $14 / 1,000 search queries

It looks like the pricing changed from per-prompt (previous models) to per-search (Gemini 3)

fosterfriends5mo ago

Thrilled to see the cost is competitive with Anthropic.

siva75mo ago

I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).

MaxL935mo ago

> It seems very US centric in its thinking

I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.

meetpateltech5mo ago

DeepMind page: https://deepmind.google/models/gemini/

Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/

Developer blog: https://blog.google/technology/developers/gemini-3-developer...

Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3

Google Antigravity: https://antigravity.google/

fuzzythinker5mo ago

Also recently: Code Wiki: https://codewiki.google/

crawshaw5mo ago

Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.

film425mo ago

At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.

Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.

mmaunder5mo ago

Yes. I am. It is spectacular in raw cognitive horsepower. Smarter than gpt5-codex-high but Gemini CLI is still buggy as hell. But yes, 3 has been a game changer for me today on hardcore Rust, CUDA and Math projects. Unbelievable what they’ve accomplished.

Szpadel5mo ago

I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).

Code quality was fine for my very limited tests but I was disappointed with instruction following.

I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.

I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.

this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.

for codex those instructions make plan mode redundant.

Narciss5mo ago

I've been working with it, and so far it's been very impressive. Better than Opus in my feels, but I have to test more, it's super early days

bnchrch5mo ago

I've been so happy to see Google wake up.

Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.

- Gmail vs Outlook

- Drive vs Word

- Android vs iOS

- Worklife balance and high pay vs the low salary grind of before.

Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.

digbybk5mo ago

Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.

ThrowawayR25mo ago

They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.

7 more replies

epolanski5mo ago

Outlook is much better than Gmail and so is the office suite.

It's good there's competition in the space though.

redbell5mo ago

> Drive vs Word

You mean Drive vs OneDrive or, maybe Docs vs Word?

[0] https://news.ycombinator.com/item?id=34713073

rvz5mo ago

Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.

You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.

63stack5mo ago

- Making money vs general computing

drewda5mo ago

For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."

IlikeKitties5mo ago

Something about bringing balance to the force not destroying it.

storus5mo ago

If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".

qweiopqweiop5mo ago

Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!

kevstev5mo ago

All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?

They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.

samdoesnothing5mo ago

Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.

stephc_int135mo ago

Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.

But at least this company is not run by a narcissistic sociopath.

tim3335mo ago

Hassabis interview on Gemini 3, with Hard Fork (nyt podcast), also Josh Woodward https://youtu.be/rq-2i1blAlU?t=428 Some points -

Good at vibe coding 10:30 - step change where it's actually useful

AGI still 5-10 years. Needs reasoning, memory, world models.

Is it a bubble? - Partly 22:00

What's fun to do with Gemini to show the relatives? Suggested taking a selfie with the app and having it edit. 24:00 (I tried and said make me younger. Worked pretty well.)

Also interesting - apparently they are doing an agent to go through your email inbox and propose replies automatically 4:00. I could see that getting some use.

aliljet5mo ago

Understanding precisely why Gemini 3 isn't front of the pack on SWE Bench is really what I was hoping to understand here. Especially for a blog post targeted at software developers...

Workaccount25mo ago

It doesn't matter, the real benchmark is taking the community temperature on the model after a few weeks of usage.

svantana5mo ago

SWEBench-Verified is probably benchmaxxed at this stage. Claude isn't even the top performer, that honor goes to Doubao [1].

Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.

[1] https://www.swebench.com/

cube22225mo ago

Yeah, they mention a benchmark I'm seeing the first time (Terminal-Bench 2.0) and are supposedly leading in, while for some reason SWE Bench is down from Sonnet 4.5.

Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.

pawelduda5mo ago

Why is this particular benchmark important?

spookie5mo ago

Does anyone trust benchmarks at this point? Genuine question. Isn't the scientific consensus that they are broken and poor evaluation tools?

ezekiel685mo ago

I mean... it achieved 76.2% vs the leader (Claude Sonnet) at 77.2%.

That's a "loss" I can deal with.

zone4115mo ago

Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).

Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.

Gemini 2.5 Pro scored 57.6, so this is a huge improvement.

mparis5mo ago

I've been playing with the Gemini CLI w/ the gemini-pro-3 preview. First impressions are that its still not really ready for prime time within existing complex code bases. It does not follow instructions.

The pattern I keep seeing is that I ask it to iterate on a design document. It will, but then it will immediately jump into changing source files despite explicit asks to only update the plan. It may be a gemini CLI problem more than a model problem.

Also, whoever at these labs is deciding to put ASCII boxes around their inputs needs to try using their own tool for a day.

People copy and paste text in terminals. Someone at Gemini clearly thought about this as they have an annoying `ctrl-s` hotkey that you need to use for some unnecessary reason.. But they then also provide the stellar experience of copying "a line of text where you then get | random pipes | in the middle of your content".

Codex figured this out. Claude took a while but eventually figured it out. Google, you should also figure it out.

Despite model supremacy, the products still matter.

mccoyb5mo ago

I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.

Where do I go for this for these Google models?

dktp5mo ago

Google actually changed it somewhat recently (3 months ago, give or take) and you can use Gemini CLI with the "regular" Google AI Pro subscription (~22eur/month). Before that, it required a separate subscription

I can't find the announcement anymore, but you can see it under benefits here https://support.google.com/googleone/answer/14534406?hl=en

The initial separate subscriptions were confusing at best. Current situation is pretty much same as Anthropic/OpenAI - straightforward

Edit: changed ~1 month ago (https://old.reddit.com/r/Bard/comments/1npiv2o/google_ai_pro...)

mantenpanther5mo ago

I am paying for AI ultra - no idea how to use it in the CLI. It says i dont‘t have access. The google admin/payment backend is pure evil. What a mess.

fschuett5mo ago

Update VSCode to the latest version and click the small "Chat" button at the top bar. GitHub gives you like $20 for free per month and I think they have a deal with the larger vendors because their pricing is insanely cheap. One week of vibe-coding costs me like $15, only downside to Copilot is that you can't work on multiple projects at the same time because of rate-limiting.

closewith5mo ago

Yeah, it truly is an outstandingly bad UX. To use Gemini CLI as a business user like I would Codex or Claude Code, how much and how do I pay?

ur-whale5mo ago

> I truly do not understand what plan to use so I can use this model for longer than ~2 minutes.

I had the exact same experience and walked away to chatgpt.

What a mess.

kachapopopow5mo ago

ai studio, you get a bunch of usage free if you want more you buy credits (google one subscriptions also give you some additional usage)

dboreham5mo ago

Also Google discontinues everything in short order, so personally I'm waiting until they haven't discontinued this for, say 6 months, before wasting time evaluating it.

https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...

golfer5mo ago

Supposedly this is the model card. Very impressive results.

Also, the full document:

https://archive.org/details/gemini-3-pro-model-card/page/n3/...

tweakimp5mo ago

Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

HardCodedBias5mo ago

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

bityard5mo ago

> Whether you’re an experienced developer or a vibe coder

I absolutely LOVE that Google themselves drew a sharp distinction here.

rafaquintanilha5mo ago

You realize this is copy to attract more people to the product, right?

svantana5mo ago

Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.

[1] https://lmarena.ai/leaderboard/text

inkysigma5mo ago

Is it just me or is that link broken because of the cloudflare outage?

Edit: nvm it looks to be up for me again

dyauspitr5mo ago

Grok is heavily censored though

ogig5mo ago

I just gave it a short description of a small game I had an idea for. It was 7 sentences. It pretty much nailed a working prototype, using React, clean css, Typescript and state management. It event implemented a Gemini query using the API for strategic analysis given a game state. I'm more than impressed, I'm terrified. Seriously thinking of a career change.

yomismoaqui5mo ago

From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.

My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)

The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.

It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...

mpeg5mo ago

Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.

Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)

sd95mo ago

How long does it typically take after this to become available on https://gemini.google.com/app ?

I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

mpeg5mo ago

Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

> https://gemini.google.com/app

netdur5mo ago

On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro

magicalhippo5mo ago

How come I can't even see prices without logging in... they doing regional pricing?

csomar5mo ago

It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...

Squarex5mo ago

Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

Romario775mo ago

It's available in cursor. Should be there pretty soon as well.

santhoshr5mo ago

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

xnx5mo ago

2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

mohsen15mo ago

Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

robterrell5mo ago

At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

arresin5mo ago

It’s a good pelican. Not great but good.

markdog125mo ago

I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047

bilekas5mo ago

> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing

Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.

lalitmaganti5mo ago

> Gemini app surpasses 650 million users per month

Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.

Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.

Yizahi5mo ago

This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.

For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.

joaogui15mo ago

It says Gemini App, not AI Overviews, AI Mode, etc

alecco5mo ago

Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.

I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.

blinding-streak5mo ago

Gemini app != Google search.

You're implying they're lying?

wohoef5mo ago

Curious to see it in action. Gemini 2.5 has already been very impressive as a study buddy for courses like set theory, information theory, and automata. Although I’m always a bit skeptical of these benchmarks. Seems quite unlikely that all of the questions remain out of their training data.

DanMcInerney5mo ago

A 50% increase over ChatGPT 5.1 on ARC-AGI2 is astonishing. If that's true and representative (a big if), it lends credence to this being the first of the very consistent agentically-inclined models because it's able to follow a deep tree of reasoning to solve problems accurately. I've been building agents for a while and thus far have had to add many many explicit instructions and hardcoded functions to help guide the agents in how to complete simple tasks to achieve 85-90% consistency.

machiaweliczny5mo ago

I think it's due to improvements in vision basically, the arc agi 2 is very visual

puttycat5mo ago

Where is this figure taken from?

srameshc5mo ago

I think I am in this AI fatigue phase. I am past all hype with models, tools and agents and back to problem and solution approach, sometimes code gen with AI , sometimes think and ask for a piece of code. But not offloading to AI and buying all the bs, waiting it to do magic with my codebase.

amelius5mo ago

Yeah, at this point I want to see the failure modes. Show me at least as many cases where it breaks. Otherwise, I'll assume it's an advertisement and I'll skip to the next headline. I'm not going to waste my time on it anymore.

jstummbillig5mo ago

I think it's fun to see what is not even considered magic anymore today.

Kiro5mo ago

I agree but if Gemini 3 is as good as people on HN said about the preview, then this is the wrong announcement to sleep on.

SchemaLoad5mo ago

My test for the state of AI is "Does Microsoft Teams still suck?", if it does still suck, then clearly the AIs were not capable of just fixing the bugs and we must not be there yet.

m3kw95mo ago

it's not AI fatigue, its that you just need to shift mode to not pay attention too much to the latest and greatest as they all leap frog each other each month. Just stick to one and ride it thru ups and downs.

strangescript5mo ago

And by this time next year, this comment is going to look very silly

mil225mo ago

It's available to be selected, but the quota does not seem to have been enabled just yet.

"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

"You've reached your rate limit. Please try again later."

Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

sarreph5mo ago

Looks to be available in Vertex.

I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

CjHuber5mo ago

For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

lousken5mo ago

I hope some users will switch from cerebras to free up those resources

r0fl5mo ago

Works for me.

misiti37805mo ago

seeing the same issue.

https://www.svgviewer.dev/s/FfhmhTK1

nickandbro5mo ago

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

Thev00d005mo ago

That is pretty impressive.

So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

bitshiftfaced5mo ago

It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

dudeinhawaii5mo ago

Gemini has been so far behind agentically it's comical. I'll be giving it a shot but it has a herculean task ahead of itself. It has to not only be "good enough" but a "quantum leap forward".

That said, OpenAI was in the same place earlier in the year and very quickly became the top agentic platform with GPT-5-Codex.

The AI crowd is surprisingly not sticky. Coders quickly move to whatever the best model is.

Excited to see Gemini making a leap here.

catigula5mo ago

Claude is still a better agent for software professionals though it is less capable, so there isn't nothing to having the incumbent advantage.

ryandrake5mo ago

I don't even know what the fuck "agentic" is or why the hell I would want it all over my software. So tired of everything in the computing world today.

[1]: https://gist.github.com/primaprashant/f181ed685ae563fd06c49d...

senfiaj5mo ago

Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.

nighwatch5mo ago

I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.

It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.

qustrolabe5mo ago

Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models

primaprashant5mo ago

Created a summary of comments from this thread about 15 hours after it had been posted and had 814 comments with gemini-3-pro and gpt-5.1 using this script [1]:

- gemini-3-pro summary: https://gist.github.com/primaprashant/948c5b0f89f1d5bc919f90...

- gpt-5.1 summary: https://gist.github.com/primaprashant/3786f3833043d8dcccae4b...

Summary from GPT 5.1 is significantly longer and more verbose compared to Gemini 3 Pro (13,129 output tokens vs 3,776). Gemini 3 summary seems more readable, however, GPT 5.1 one has interesting insights missed by Gemini.

Last time I did this comparison at the time of GPT 5 release [2], the summary from Gemini 2.5 Pro was way better and readable than the GPT 5 one. This time the readability of Gemini 3 summary still seems great while GPT 5.1 feels a bit more improved but not there quite yet.

[2]: https://news.ycombinator.com/item?id=44835029

icyfox5mo ago

Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:

Input: $1.25 -> $2.00 (1M tokens)

Output: $10.00 -> $12.00

Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.

rudedogg5mo ago

Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.

zone4115mo ago

Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/

King-Aaron5mo ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

"Incredible"! When they insert it into literally every google request without an option to disable it. How incredibly shocking so many people use it.

icapybara5mo ago

Anyone know how Gemini CLI with this model compares to Codex and Claude Code?

dr_dshiv5mo ago

Make a pelican riding a bicycle in 3d: https://gemini.google.com/share/def18e3daa39

Amazing and hilarious

recitedropper5mo ago

Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.

Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.

No benchmark is safe, when this much money is on the line.

sosodev5mo ago

Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390

> When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that?

> Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.

https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...

HarHarVeryFunny5mo ago

I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?

horhay5mo ago

They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1

CephalopodMD5mo ago

What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.

bespokedevelopr5mo ago

Wow so the polymarket insider bet was true then..

giarc5mo ago

These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.

6 more replies

fresh_broccoli5mo ago

In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...

_2d305mo ago

Gemini 3 is crushing my personal evals for research purposes.

I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.

It's really, really, really good so far. Wow.

Note that I haven't tried it for coding yet!

energy1235mo ago

I would personally settle for a web app that isn't slow. The difference in speed (latency, lag) between ChatGPT's fast web app and Gemini's slow web app is significant. AI Studio is slightly better than Gemini, but try pasting in 80k tokens and then typing some additional text and see what happens.

ethmarks5mo ago

Genuinely curious here: why is the desktop app so important?

I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA?

Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.

JacobiX5mo ago

Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.

skrebbel5mo ago

> it nailed it, but only solved it partially

Hey either it nailed it or it didn't.

jpkw5mo ago

Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.

eknkc5mo ago

Looks like it is already available on VSCode Copilot. Just tried a prompt that was not returning anything good on Sonnet 4.5. (Did not spend much time though, but the prompth was already there on the chat screen so I switched the model and sent it again)

Gemini 3 worked much better and I actually committed the changes that it created. I don't mean its revolutionary or anything but it provided a nice summary of my request and created a decent simple solution. Sonnet had created a bunch of overarching changes that I would not even bother reviewing. Seems nice. Will probably use it for 2 weeks until someone else releases a 1.0001x better model.

flyinglizard5mo ago

You were probably stuck at some local model minima avoidable by simply changing the model to something else.

ponyous5mo ago

Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:

- GPT-5 medium is the best

- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster

Really wonder how well Gemini 3 will perform

mrinterweb5mo ago

Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.

GodelNumbering5mo ago

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

panarky5mo ago

Claude Opus is $15 input, $75 output.

xnx5mo ago

If the model solves your needs in fewer prompts, it costs less.

CjHuber5mo ago

Is it the first time long context has separate pricing? I hadn’t encountered that yet

gertrunde5mo ago

"AI Overviews now have 2 billion users every month."

"Users"? Or people that get presented with it and ignore it?

mNovak5mo ago

Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).

I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.

singhrac5mo ago

They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.

recitedropper5mo ago

"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."

Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.

syedshahmir72145mo ago

I think from last few releases of these models from all companies, I have not observed much improvements in the response of these models. Their claims and launches are a little over hyped.

jordanpg5mo ago

What is Gemini 3 under the hood? Is it still just a basic LLM based on transformers? Or are there all kinds of other ML technologies bolted on now? I feel like I've lost the plot.

meowface5mo ago

I am very ignorant in this field but I am pretty sure under the hood they are all still fundamentally built on the transformer architecture, or at least innovations on the original transformer architecture.

anilgulecha5mo ago

It's a mixture-of-experts model. Basically N smaller model pieces put together, and when inference occurs, only 1 is active at a time. Each model piece would be tuned/good in one area.

becquerel5mo ago

The industry is still seeing how far they can take transformers. We've yet to reach a dollar value where it stops being worth pumping money into them.

bluecalm5mo ago

I've asked it (thinking 3) about the difference between Plus and Pro plans. First it thought I am asking for comparison between Gemini and ChatGPT as it claimed there is no "Plus" plan on Gemini. After I insisted I am on this very plan right now it apologized and told me it in fact exists. Then it told me the difference is that I got access to newer models with the Pro subscription. That is despite Google's own plan comparison page showing I get access to the Gemini 3 on both plans.

It also told me that on Plus I am most likely using "Flash" model. There is no "Flash" model in the dropdown to choose from. There is only "Fast" and "Thinking". It then told me "Fast" is just renamed Flash and it likely uses Gemini 2.5. On the product comparison page there is nothing about 2.5, it only mentions version 3 for both Plus and Pro plans. Of course on the dropdown menu it's impossible to see which model it is really using.

How can a normal person understand their products when their own super advanced thinking/reasoning model that took months to train on world's most advanced hardware can't?

It's amazing to me they don't see it as an epic failure in communication and marketing.

aerhardt5mo ago

Combining structured outputs with search is the API feature I was looking for. Honestly crazy that it wasn’t there to start with - I have a project that is mostly Gemini API but I’ve had to mix in GPT-5 just for this feature.

I still use ChatGPT and Codex as a user but in the API project I’ve been working on Gemini 2.5 Pro absolutely crushed GPT-5 in the accuracy benchmarks I ran.

As it stands Gemini is my de facto standard for API work and I’ll be following very closely the performance of 3.0 in coming weeks.

rubymamis5mo ago

I gave it the task to recreate StackView.qml to be feel more native on iOS and it failed - like all other models...

Prompt:

Instead of the current StackView, I want you to implement a new StackView that will have a similar api with the differences that:

1. It automatically handles swiping to the previous page/item. If not mirrored, it should detect swiping from the left edge, if mirrored it should detect from the right edge. It's important that swiping will be responsive - that is, that the previous item will be seen under the current item when swiping - the same way it's being handled on iOS applications. You should also add to the api the option for the swipe to be detected not just from the edge, but from anywhere on the item, with the same behavior. If swiping is released from x% of current item not in view anymore than we should animate and move to the previous item. If it's a small percentage we should animate the current page to get back to its place as nothing happened. 2. The current page transitions are horrible and look nothing like native iOS transitions. Please make the transitions feel the same.

BugsJustFindMe5mo ago

The Gemini AI Studio app builder (https://aistudio.google.com/apps) refuses to generate python files. I asked it for a website, frontend and python back end, and it only gave a front end. I asked again for a python backend and it just gives repeated server errors trying to write the python files. Pretty shit experience.

abixb5mo ago

Okay, Gemini 3.0 Pro has officially surpassed Claude 4.5 (and GPT-5.1) as the top ranked model based on my private evals (multimodal reasoning w/ images/audio files and solving complex Caesar/transposition ciphers, etc.).

Claude 4.5 solved it as well (the Caesar/transposition ciphers), but Gemini 3.0 Pro's method and approach was a lot more elegant. Just my $0.02.

WhyOhWhyQ5mo ago

Why doesn't this spell the death of OpenAI? Maybe someone with a better business sense can explain, but here's what I'm seeing:

OpenAI is going for the consumer-grade AI market, as opposed to a company like Anthropic making a specialized developer tool. Google can inject their AI tool in front of everybody in the world, and already have with Google AI search. All of these models are just going to reach parity eventually, but Google is burning cash compared to OpenAI burning debt. It seems like for consumer-grade purposes, AI use will just be free sooner or later (DeepSeek is free, Google AI search is free, students can get Gemini Pro for free for a year already). So all I'm seeing that OpenAI has is Sora, which seems like a business loser though I don't really understand it, and also ChatGPT seems to own the market of people roleplaying with chat bots as companions (which doesn't really seem like a multi-trillion dollar business but I could be wrong).

davide_benato5mo ago

I would love to see how Gemini 3 can solve this particular problem. https://lig-membres.imag.fr/benyelloul/uherbert/index.html

It used to be an algorithmic game for a Microsoft student competition that ran in the mid/late 2000. The game invents a new, very simple, recursive language to move the robot (herbert) on a board, and catch all the dots while avoiding obstacles. Amazingly this clone's executable still works today on Windows machines.

The interesting thing is that there is virtually no training data for this problem, and the rules of the game and the language are pretty clear and fit into a prompt. The levels can be downloaded from that website and they are text based.

What I noticed last time I tried is that none of the publicly available models could solve even the most simple problem. A reasonably decent programmer would solve the easiest problems in a very short amount of time.

iib5mo ago

As soon as I found out that this model launched, I tried giving it a problem that I have been trying to code in Lean4 (showing that quicksort preserves multiplicity). All the other frontier models I tried failed.

I used the pro version and it started out well (as they all did), but it couldn't prove it. The interesting part is that it typoed the name of a tactic, spelling it "abjel" instead of "abel", even though it correctly named the concept. I didn't expect the model to make this kind of error, because they all seems so good at programming lately, and none of the other models did, although they did some other naming errors.

I am sure I can get it to solve the problem with good context engineering, but it's interesting to see how they struggle with lesser represented programming languages by themselves.

energy1235mo ago

With the $20/m subscription, do we get it on "Low" or "High" thinking level?

aliljet5mo ago

When will this be available in the cli?

_ryanjsalva5mo ago

Gemini CLI team member here. We'll start rolling out today.

Sammi5mo ago

I'm already seeing it in https://aistudio.google.com/

alach115mo ago

This is a really impressive release. It's probably the biggest lead we've seen from a model since the release of GPT-4. Seems likely that OpenAI rushed out GPT-5.1 to beat the Gemini 3 release, knowing that their model would underperform it.

misja1115mo ago

I asked Gemini to solve today's Countle puzzle (https://www.countle.org/). It got stuck while iterating randomly trying to find a solution. While I'm writing this it has been trying already for 5 minutes and the web page has become unresponsive.

I also asked it for the best play when in backgammon opponent rolls 6-1 (plays 13/7 8/7) and you roll 5-1. It starts alright with mentioning a good move (13/8 6/5) but continues to hallucinate with several alternative but illegal moves. I'm not too impressed.

deanc5mo ago

The AntiGravity seems to be a bit overwhelmed. Unable to set up an account at the moment.

kanodiaayush5mo ago

I don't really understand the amount of ongoing negativity in the comments. This is not the first time a product has been near copied, and the experience for me is far superior to code in a terminal. It comes with improvements even though imperfect, and I'm excited for those! I've long wanted the ability to comment on code diffs instead of just writing things back down in chat. And I'm excited for the quality of gemini 3.0 pro; although I'm running into rate limits. I can already tell its something I'm going to try out a lot!

catigula5mo ago

The problem with experiencing LLM releases nowadays is that it is no longer trivial to understand the differences in their vast intelligences so it takes awhile to really get a handle on what's even going on.

briga5mo ago

Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.

But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

stephc_int135mo ago

Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.

Given the nature of how those models work, you don't need exact replicas.

mark_l_watson5mo ago

I had a fantastic ‘first result’ with Gemini 3 but a few people on social media I respect didn’t. Key takeaway is to do your own testing with your use cases. I feel like I am now officially biased re: LLM infrastructure: I am retired, doing personal research and writing, and I decided months ago to drop OpenAI and Anthropic infrastructure and just use Google to get stuff done - except I still budget about two hours a week to experiment with local models and Chinese models’ APIs.

realty_geek5mo ago

I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.

ZeroCool2u5mo ago

Seems like their new Antigravity IDE specifically has this built in. https://antigravity.google/docs/browser

xnx5mo ago

Gemini CLI can also control a browser: https://github.com/ChromeDevTools/chrome-devtools-mcp

Der_Einzige5mo ago

When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.

Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.

I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...

espeed5mo ago

I paid for Gemini Pro. Am I getting Gemini 3 Pro (https://gemini.google.com)? "To be precise: You are currently interacting with Gemini 1.5 Pro." https://x.com/espeed/status/1991333475098718601

clusterhacks5mo ago

I wish I could just pay for the model and self-host on local/rented hardware. I'm incredibly suspicious of companies totally trying to capture us with these tools.

lfx5mo ago

Technically you can!

I haven't seen it in the box yet, and pricing is unknown https://cloud.google.com/blog/products/ai-machine-learning/r...

Retr0id5mo ago

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

Do regular users know how to disable AI Overviews, if they don't love them?

0: https://en.wikipedia.org/wiki/51st_G7_summit#/media/File:Pri...

petesergeant5mo ago

Still insists the G7 photo[0] is doctored, and comes up with wilder and wilder "evidence" to support that claim, before getting increasingly aggressive.

DeathArrow5mo ago

It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh

rixed5mo ago

2025: solve the biking pelican problem

2026: cure cancer

pclark5mo ago

I just want Gemini to access ALL my Google Calendars, not just the primary one. If they supported this I would be all in on Gemini. Does no one else want this?

mikeortman5mo ago

Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.

Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.

I'm concerned.

zurfer5mo ago

It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.

mudkipdev5mo ago

Isn't it the same cutoff as 2.5?

taf25mo ago

I just wish gemini could write well formatted code. I do like the solutions it comes up to and I know I can use a linter/formatter tool - but it would just be nice if when I openned gemini (cli) up and asked it to write a feature it didn't mix up the indenting so badly... somehow codex and claude both get this without any trouble...

lofaszvanitt5mo ago

A tad bit better, still has the same issues regarding unpacking and understanding complex prompts. I have a test of mine and now it performs a bit better, but still, it has zero understanding what is happening and for why. Gemini is the best of the best model out there, but with complex problems it just goes down the drain :(.

hubraumhugo5mo ago

No gemini-3-flash yet, right? Any ETA on that mentioned? 2.5-flash has been amazing in terms of cost/value ratio.

8note5mo ago

ive found gemini 2.5-flash works better (for.agentic coding) than pro, too

nilsingwersen5mo ago

Feeling great to see something confidential

RobinL5mo ago

- Anyone have any idea why it says 'confidential'?

- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)

[Edit: working for me now in ai studio]

oezi5mo ago

Probably invested a couple of billion into this release (it is great as far as I can tell), but can't bring proper UI to AI Studio for long prompts and responses (e.g. it animates new text being generated even though you just return to the tab which was finished generating).

decide10005mo ago

We hire a developer to build parsers for a complicated file format. It takes a week per parser. Gemini 3 is the first LLM that is able to create a parser from scratch, and it does it very well. Within a minute, 1-shot-right. I am blown away.

visioninmyblood5mo ago

Really exciting results on paper. But truly interesting to see what data this has been trained on. There is a thin line between accuracy improvements and the data used from users. Hope the data used to train was obtained with consent from the creators

auggierose5mo ago

> Gemini 3 is the best vibe coding and agentic coding model we’ve ever built

Google goes full Apple...

jacky2wong5mo ago

What I loved about this release was that it was hyped up by a polymarket leak with insider trading - NOT with nonsensical feel the AGI hype. Great model that's pushed the frontier of spatial reasoning by a long shot.

guluarte5mo ago

it is live in the api

> gemini-3-pro-preview-ais-applets

> gemini-3-pro-preview

spudlyo5mo ago

Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.

BoorishBears5mo ago

So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it

Glad to see Google still can't get out of its own way.

BoorishBears5mo ago

I don't want to be one of those assholes who only calls out when they were right: I was very wrong

taf25mo ago

I had asked earlier in the day for gpt 5.1 high to refactor my apex visualforce page into a lightning component and it really didn’t do much here - Gemini 3 pro crushed this task… very promising

lofaszvanitt5mo ago

Oh that corpulent fella with glasses who talks in the video. Look how good mannered he is, he can't hurt anyone. But Google still takes away all your data and you will be forced out of your job.

scrollop5mo ago

Here it makes a text based video editor that works:

https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797

thingsilearned5mo ago

I love that the recipe example is still being used as one of the main promising use cases for computers and now AGI. One day hopefully computers will solve that pressing problem...

thedelanyo5mo ago

Reading the introductory passage - all I can say now is, Ai is here to stay.

I_am_tiberius5mo ago

I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.

gpm5mo ago

> I would even pay for it.

Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.

https://www.youtube.com/watch?v=cUbGVH1r_1U

sunaookami5mo ago

Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).

Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.

energy1235mo ago

Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.

vlmrun-admin5mo ago

side by side comparison of gemini with other models

CjHuber5mo ago

Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over

nprateem5mo ago

OMG they've obviously had a major breakthrough because now it can reply to questions with actual answers instead of shit blog posts.

agentifysh5mo ago

my only complaint is i wish the SWE and agentic coding would have been better to justify the 1~2x premium

gpt-5.1 honestly looking very comfortable given available usage limits and pricing

although gpt-5.1 used from chatgpt website seems to be better for some reason

Sonnet 4.5 agentic coding still holding up well and confirms my own experiences

i guess my reaction to gemini 3 is a bit mixed as coding is the primary reason many of us pay $200/month for

oceanplexian5mo ago

Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?

jdthedisciple5mo ago

What I'd prefer over benchmarks is the answer to a simple question:

What useful thing can it demonstrably do that its predecessors couldn't?

https://www.reddit.com/r/Bard/comments/1p0fene/gemini_3_the_...

qingcharles5mo ago

Somebody "two-shotted" Mario Bros NES in HTML:

hamasho5mo ago

I just googled latest LLM models and this page appears at the top. It looks like Gemini Pro 3 can score 102% in high school math tests.

sylware5mo ago

Trained models should be able to use formal tools (for instance a logical solver, a computer?).

Good. That said, I wonder if those models are still LLMs.

AstroBen5mo ago

First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5

maczwei5mo ago

entity.ts is in types/entity.ts .it cant grasp that it should import it like "../types/entity" and instead it always writes "../types" i am using the https://aistudio.google.com/apps

ilaksh5mo ago

okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.

zen_boy5mo ago

Is the "thinking" dropdown option on gemini.google.com what the blog post refers to as Deep Think?

bilsbie5mo ago

Is there a way to use this without being in the whole google ecosystem? Just make a new account or something?

https://github.com/marketplace/gemini-code-assist

slackerIII5mo ago

What's the easiest way to set up automatic code review for PRs for my team on GitHub using this model?

colechristensen5mo ago

Ask it.

If it's good enough to be useful on your code base, it better be good enough to instruct you on how to use it.

How easy it is depends on whether or not they've built that kind of thing in

esafak5mo ago

pflenker5mo ago

> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.

Come on, you can’t be serious.

muzani5mo ago

This is so disingenuous that it hurts the credibility of the whole thing.

raffkede5mo ago

Seems to be the first model that one-shots my secret benchmark about nested SQLite and it did it in 30s,

gigatexal5mo ago

How does it do in coding tasks? I’ve been absolutely spoiled by Claude sonnet 4.5 thinking.

taikahessu5mo ago

Boring. Tried to explore sexuality related topics, but Alphabet is stuck in some Christianity Dark Ages.

Edit: Okay, I admit I'm used to dealing with OpenAI models and it seems you have to be extra careful with wording with Gemini. Once you have right wording like "explore my own sexuality" and avoid certain words, you can get it going pretty interestingly.

serjester5mo ago

It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.

aoeusnth15mo ago

Maybe they're slow rolling the announcements to be in the news more

AbstractH245mo ago

Can someone ELI5 what the difference between AI Studio, Antigravity, and Colab is?

NullCascade5mo ago

I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.

Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?

Try lowering the temperature, use SymPy etc.

ducttapecrown5mo ago

Terry Tao is writing about this on his blog.

thrownaway5615mo ago

yea great.... when will I be able to have it dial a number on my google pixel? Seriously... Gemini absolutely sucks on pixel since it can't interact with the phone itself so it can't dial numbers.

elcapithanos5mo ago

> AI overviews now have 2 billion users every month

More like 2 billion hostages

m3kw95mo ago

If it ain't quantum leap, new models are just "OS updates".

fosterfriends5mo ago

Gemini 3 and 3 pro are good bit cheaper than Sonnet 4.5 as well. Big fan

Frannky5mo ago

I tried it on a landing page. Very, very impressive.

Dquiroga5mo ago

I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".

Gemini 3:

The cognitive dissonance in this thread is staggering.

We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.

Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":

1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.

2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.

Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.

Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?

https://www.ascii-art.de/ascii/uvw/unicorn.txt

DrNosferatu5mo ago

Anyone has any idea if/when it’s coming to paid Perplexity?

eterm5mo ago

> It seems there's a date conflict. The prompt claims it's 2025, but my internal clock says otherwise.

> I'm now zeroing in on the temporal aspect. Examining the search snippets reveals dates like "2025-10-27," suggesting a future context relative to 2024. My initial suspicion was that the system time was simply misaligned, but the consistent appearance of future dates strengthens the argument that the prompt's implied "present" is indeed 2025. I am now treating the provided timestamps as accurate for a simulated 2025. It is probable, however, that the user meant 2024.

Um, huh? It's found search results for October 2025, but this has led it to believe it's in a simulated future, not a real one?

samuelknight5mo ago

"Gemini 3 Pro Preview" is in Vertex

pgroves5mo ago

I was hoping Bash would go away or get replaced at some point. It's starting to look like it's going to be another 20 years of Bash but with AI doodads.

__MatrixMan__5mo ago

Nushell scratches the itch for me 95% of the time. I haven't yet convinced anybody else to make the switch, but I'm trying. Haven't yet fixed the most problematic bug for my useage, but I'm trying.

What are you doing to help kill bash?

pk-protect-ai5mo ago

It is pointless to ask an LLM to draw an ASCII unicorn these days. Gemini 3 draws one of these (depending on the prompt):

However, it is amazing how far spatial comprehension has improved in multimodal models.

I'm not sure the below would be properly displayed on HN; you'll probably need to cut and paste it into a text editor.

Prompt: Draw me an ASCII world map with tags or markings for the areas and special places.

Temperature: 1.85

Top-P 0.98

Answer: Edit (replaced with URL) https://justpaste.it/kpow3

vivzkestrel5mo ago

has anyone managed to use any of the AI models to build a complete 3D fps game using web GL or open GL?

kridsdale35mo ago

I made a webgl copy of wolfenstein with prompt engineering in browser-based "Make a website" tool that was gemini-powered.

beezlewax5mo ago

Can't wait til Gemini 4 is out!

keepamovin5mo ago

I don't wan't to shit on the much anticipated G3 model, but I have been using it for a complex single page task and find it underwhelming. Pro 2.5 level, beneath GPT 5.1. Maybe it's launch jitters. It struggles to produce more than 700 lines of code in a single file (aistudio). It struggles to follow instructions. Revisions omit previous gains. I feel cheated! 2.5 Pro has been clearly smarter than everything else for a long time, but now 3 seems not even as good as that, in comparison to the latest releases (5.1 etc). What is going on?

hekkle5mo ago

GOOGLE: "We have a new product".

REALITY: It's just 3 existing products rolled into one. One of which isn't even a Google product.

- Microsoft Code

- Gemeni

- Chrome Browser

kmeisthax5mo ago

The most devastating news out of this announcement is that Vending-Bench 2 came out and it has significantly less clanker[0] meltdowns than the first one. I mean, seriously? Not even one run where the model tried to stock goods that hadn't arrived yet, only for it to eventually try and fail to shut down the business, and then e-mail the FBI about the $2 daily fee being deducted from the bot?

[0] Fake racial slur for a robot, LLM chatbot, or other automated system

smarx0075mo ago

Is it coming to Google Jules?

iamA_Austin5mo ago

it started with OpenAI and Google took the competition damn seriously.

dankobgd5mo ago

every day, new game changer

VladimiOrlovsky5mo ago

import decimal

def solve_kangaroo_limit(): # Set precision to handle the "digits different from six" requirement decimal.getcontext().prec = 50

    # For U(0,1), H(x) approaches 2x + 2/3 very rapidly (exponential decay of error)
    # At x = 10^6, the value is indistinguishable from the asymptote
    x = 10**6
    limit_value = decimal.Decimal(2) * x + decimal.Decimal(2) / decimal.Decimal(3)
    
    print(f"H({x}) ≈ {limit_value}")
    # Output: 2000000.66666666666666666666...

if __name__ == "__main__": solve_kangaroo_limit() ....p.s. for airheads=idiots: """decimal.Decimal(2) / decimal.Decimal(3)""" == 0.6666666666666666666666666666666666666666666666666666666666666666666666666 ... This is your Fukingly 'smart' computer???

t_minus_405mo ago

is there even a puzzle or math problem gemini 3 cant solve?

chiragsrvstv5mo ago

Waiting for google to nuke this as well just like 2.5pro

testfrequency5mo ago

I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.

Yes, I know the Workspaces workaround, but that’s silly.

skerit5mo ago

Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?

CjHuber5mo ago

Honestly I liked 2.5 Pro preview much more than the final version

XCSme5mo ago

How's the pelican?

cognitive-gl5mo ago

Wild

casey25mo ago

The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.

jennyholzer5mo ago

boooooooooooooo

gabriel666smith5mo ago

Interestingly, Gemini 3 Pro Thinking clocks in at 2nd place - behind Horizon Alpha, but ahead of any actually-released model checkpoints - on the TvP Benchmark, with a score of 71.8%.

This blows Sonnet 4.5 (64.8%), GPT5.1 (58.7%) and Opus 4.1 (66.3%) away.

An example output:

'''

The cabin was cold. It smelled of pine needles and old dust. Bryan sat on the porch with a glass of scotch balanced on his knee. It was cheap scotch. He missed the good stuff, but he wasn't in Paris anymore, and he certainly wasn't in Beverly Hills.

He checked his watch. It was a bulky thing, black rubber. He checked the Glock on the table next to him. That was black, too.

"I told you," Bryan said. He wasn't talking to anyone. "I told you not to go into the woods."

He was thinking about Kim. He was always thinking about Kim. She was safe, probably. At a mall in Santa Monica. Or maybe a library. But it felt like she was gone. It always felt like she was gone.

The trees at the edge of the clearing shifted. There wasn't any wind. The air was heavy and humid, unusual for October. A bird stopped singing. Then the crickets stopped. It got very quiet.

Bryan took a sip of the scotch. It burned his throat. He liked the burn. It reminded him he was still there.

Three red dots appeared on his chest. They made a small triangle.

Bryan didn't look at them. He set the glass down on the railing. He picked up the phone. It was an old flip phone. He pressed a button.

"I don't know who you are," Bryan said into the phone. "I don't know what you want. If you are looking for ransom, I can tell you I don't have money."

The leaves rustled. A shape began to form out of the empty air. It looked like heat rising off the highway in August. It was tall. Much taller than a man.

"But what I do have," Bryan said, "are a very particular set of skills."

The shape solidified. It was ugly. It had skin like a lizard and dreadlocks that looked like heavy black cables. It wore a metal mask. It made a clicking sound. Click-click-click. Like a Geiger counter.

"Skills I have acquired over a very long career," Bryan said. He stood up. His knees popped. He felt the ache in his lower back. He was getting too old for this.

The thing roared. It wasn't a human sound. It was a wet, guttural noise.

"I will look for you," Bryan said. He put the phone in his pocket. "I will find you."

The creature raised a metal gauntlet. Two jagged blades slid out. They caught the moonlight.

"And I will kill you," Bryan said.

He raised the Glock and fired twice. Pop-pop.

The creature didn't fall. It took a step forward. Green blood, thick and glowing, leaked from its shoulder. It looked like antifreeze. The creature touched the wound, then looked at its hand. It seemed surprised.

Bryan didn't wait. He moved off the porch. He moved fast for a man his age. He went left, toward the woodpile. The creature fired a blast of blue light from its shoulder. It hit the porch swing. The swing exploded. Wood splinters rained down like confetti.

Bryan rolled behind the logs. He breathed in. He breathed out. He could hear the heavy footsteps crunching the dry leaves. Crunch. Crunch.

"You're big," Bryan whispered. "That makes you slow."

He grabbed a splitting maul from the stump. It was heavy. The handle was smooth hickory.

The creature came around the woodpile. It was scanning the trees. It didn't see Bryan down low. Bryan swung the maul. He put his hips into it. He swung it like he was chopping a cord of oak.

The blade hit the creature in the knee. There was a snap. A loud, wet snap.

The thing howled. It fell onto one knee.

Bryan dropped the maul. He stepped inside the creature’s reach. He knew exactly where to hit. The throat. The armpit. The soft spot under the ribs. He hit the creature three times, hard chops with the side of his hand. It was like hitting a brick wall, but he felt something give.

The creature swung its arm back. It hit Bryan in the chest.

Bryan flew backward. He hit the dirt. The wind went out of him. He lay there for a second, staring up at the stars. They looked very far away. He wondered if Lenore was looking at the same stars. Probably not. She was probably sleeping.

He sat up. His ribs hurt. Maybe broken.

The creature was trying to stand. It was clicking again. It tapped something on its wrist. A series of red symbols started flashing. They counted down.

Bryan knew a bomb when he saw one.

"No," Bryan said.

He tackled the thing. He didn't think about it. He just did it. He grabbed the creature’s arm. He twisted the wrist mechanism. He’d seen something like it in Baghdad once. Or maybe Istanbul. The memories ran together now.

He ripped the gauntlet loose. Wires sparked. He threw it as hard as he could into the darkness of the woods.

Three seconds later, there was a flash. A boom. A shockwave that shook the pine needles from the trees.

Silence came back.

The creature lay on the ground. It was breathing shallowly. The green blood was pooling under it. It took off its mask.

The face was hideous. Mandibles. Beady eyes. It looked at Bryan. It said something, a garbled copy of Bryan's own voice.

"...good luck..."

Then it died. It just stopped.

Bryan stood up. He dusted off his pants. He walked back to the porch. The swing was gone. The railing was scorched.

His glass of scotch was still sitting there, untouched. The ice hadn't even melted.

He picked it up. He took a drink. It still tasted cheap.

He took his phone out and looked at it. No service.

"Well," he said.

He went inside the cabin and locked the door. He sat on the couch and waited for the sun to come up. He hoped Kim would call. He really hoped she would call.

'''

otikik5mo ago

… agentic …

Meh, not interested already

jennyholzer5mo ago

"AI" benchmarks are and have consistently been lies and misinformation. Gemini is dead in the water.

denysvitali5mo ago

Finally!

mihau5mo ago

@simonw wen pelican

nextworddev5mo ago

It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.

Also probably over for OpenAI

alksdjf892435mo ago

Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.

poemxo5mo ago

It's amazing to see Google take the lead while OpenAI worsens their product every release.

WXLCKNO5mo ago

Valve could learn from Google here

https://www.youtube.com/watch?v=cUbGVH1r_1U

informal0075mo ago

It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com

kachapopopow5mo ago

It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.

the token efficiency and context is also mindblowing...

it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.

gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.

vlmrun-admin5mo ago

Everyone is talking about the release of Gemini 3. The benchmark scores are incredible. But as we know in the AI world, paper stats don't always translate to production performance on all tasks.

We decided to put Gemini 3 through its paces on some standard Vision Language Model (VLM) tasks – specifically simple image detection and processing.

The result? It struggled where I didn't expect it to.

Surprisingly, VLM Run's Orion (https://chat.vlm.run/) significantly outperformed Gemini 3 on these specific visual tasks. While the industry chases the "biggest" model, it’s a good reminder that specialized agents like Orion are often punching way above their weight class in practical applications.

Has anyone else noticed a gap between Gemini 3's benchmarks and its VLM capabilities?

irthomasthomas5mo ago

I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...

Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...

stickfigure5mo ago

He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.

[0] https://web.archive.org/web/20251118111103/https://storage.g...

gregsadetsky5mo ago

Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.

Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.

briga5mo ago

Maybe it has guard rails against such things? That would be my main guess on the Zizian one.

rvz5mo ago

I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.

Where is the outrage?

[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...

inkysigma5mo ago

Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.