Gemini 3.5 Flash (opens in new tab)

https://www.gianlucagimini.it/portfolio-item/velocipedia/

tantalor1mo ago

Forgetting the chainstay is typical of asking random people to draw a bicycle.

> most ended up drawing something that was pretty far off from a regular men’s bicycle

smcleod1mo ago

I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

https://en.wikipedia.org/wiki/Synthwave

dekhn1mo ago

I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

tandr1mo ago

If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

TacticalCoder1mo ago

Love your pelicans, as always. And that one is... Wow.

I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

hydra-f1mo ago

Same old issue with Gemini models trying to "enrich" everything

nrds1mo ago

We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

https://en.wikipedia.org/wiki/Vaporwave

nomilk1mo ago

'Pelicans' should be the unit of measurement for model prices, rather than tokens.

karmakaze1mo ago

I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

khy1mo ago

That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009

Razengan1mo ago

I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

nickvec1mo ago

I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

sbinnee1mo ago

Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

taurath1mo ago

I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

dankwizard1mo ago

Wouldn't be a thread about the tech that is changing the landscape for businesses across nearly every discipline without a pelican svg.

bee_rider1mo ago

I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

VectorLock1mo ago

The fact it went for vaporwave styling on its own is very telling.

setgree1mo ago

``

wtf

``

WTF??

gcgbarbosa1mo ago

funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

__mharrison__1mo ago

They are just trolling you now

nashashmi1mo ago

Beats a human by like 10$

holtkam21mo ago

at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

danilocesar1mo ago

Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

GodelNumbering1mo ago· 24 in thread

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

__jl__1mo ago

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

doginasuit1mo ago

They probably never intended to keep serving cheap models. This is a natural way to introduce the squeeze, now that they have people who built services on their API. It makes a lot of sense to have an abstraction layer where the provider doesn't matter. If you are working in Kotlin, Koog is excellent.

rudedogg1mo ago

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

hei-lima1mo ago

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

fnordsensei1mo ago

3.5 flash is listed as stable rather than preview, or am I misreading?

jstummbillig1mo ago

> Interesting pricing direction.

Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.

More generally, $/token + naming scheme comparisons are just confusing: I am not looking for a wordy idiot and I doubt most people are (at least not with what I would consider worthwhile business ambitions). In fact wordy idiots are fairly costly, because we have to consider the large amounts of cheap garbage that they are producing, and if you price your own time somewhat competitively then fairly quickly that's the bigger lever.

Even if we don't consider the last part: How do we price the better model, that can one shot a task without having to go back and forth and spending more tokens or having to fix more bugs later? It is definitely worth something and I think it's quite undervalued right now. What seems to be missing is a better measurement of capability per token. I don't know how that could look like. Maybe something like how we try and measure inflation, some basket of tasks (which then ends up being part of the training data so idk).

dr_dshiv1mo ago

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

OakNinja1mo ago

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

llm_nerd1mo ago

It might be temporary pricing given that 3.5 Flash is actually superior to the existing 3.1 Pro in almost all regards, so they're in a bit of a lurch as 3.1 Pro really doesn't make sense given that 3.5 Pro has been delayed a bit.

WhitneyLand1mo ago

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

verdverm1mo ago

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

dzhiurgis1mo ago

I use Gemini models in Junie daily. When I need accuracy I switch to Gemini 3.1 Pro Preview (why it is still in preview?), but it burns thru credits leaving me topping up $5 every day. 3.1 Flash lite is just not accurate enough. 3 Flash is sweet spot just as Jetbrains suggests it is.

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

dbbk1mo ago

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

photonair1mo ago

In general, Gemini flash is still relatively cheaper compared to the "mini" version of the other big 2. However, I agree that newer version seem to have multiple X price increase (similar to the new ChatGPT) and we certainly need competition from the open source models to keep these guys in check with pricing.

davedx1mo ago

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

I will definitely not be updating to this new model, and I think once 2.5 Flash is deprecated I'll have to re-architect so Gemini is only used for web grounding requests. This is an insane price increase.

harrouet1mo ago

If you look at the benchmark, the model is not particularly good at coding, and as you point out it costs 3x the price of the previous flash models. So what is the market for it?

I think that they might have reached the latency sweetspot where voice applications become more natural. Natural speech is <100 tokens per second (after STT), so $9 for a million token takes you to roughly 3 hours of speech. That's totally competitive compared to human costs.

LetsGetTechnicl1mo ago

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

ilia-a1mo ago

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

irthomasthomas1mo ago

And they are using this to power search answers?

malloryerik1mo ago

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

I say "almost" because the situation is more fluid and unstable than a normal naming change. If Apple were to do this with laptops, maybe it'd be like, Air gets better and pricier and becomes Pro-level model, Neo same way becomes Air-level model, etc. But Apple's too design oriented to do something like that. Google, well...

This change has made me decide to move to a multi-provider situation like through OpenRouter for consumer-facing LLM api in a service I'm building. I just can't trust Google to not constantly rearrange everything under our feet. Doesn't mean I won't use Gemini, but it clearly means I need to have others in the mix ready to go. In fact I used to use lots of Flash Lite, which is now Gemma territory, and I can't get that served by Google anymore and don't want to run my own hardware.

But in any case, I'd compare this "Flash" model with previous "Pro" on all metrics. It's kinda like if in clothes a Small suddenly became what was a Large, or at Starbucks a Grande became the new de facto Venti. And only for the new! drinks.

And if we think this way, it's possible that prices are actually falling?

ashirviskas1mo ago

don't forget Gemini 2.0 flash at $0.10/$0.40

SwellJoe1mo ago

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

m3kw91mo ago

just subscribe to the plan, cheaper

throwa3562621mo ago

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

easygenes1mo ago· 12 in thread

For those who would like to know the total and active parameter count of this model: even though Google doesn't disclose the model technicals, we can infer them within relatively tight margins based on what we do know.

We know they serve the model on TPU 8i, which we have plenty of hard specs for (so we know the key constraints: total memory and bandwidth and compute flops). We can also set a ceiling on the compute complexity and memory demand of the model based on knowing they will be at least as efficient as what is disclosed in the Deepseek V4 Technical Report.

We can also assume that the model was explicitly built to run efficiently in a RadixAttention style batched serving scenario on a single TPU 8i (so no tensor parallelism, etc. to avoid unnecessary overheads... Google explicitly designed the 8th-generation inference architecture to eliminate the need for tensor sharding on mid-sized models).

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘

I do model serving optimization work. This is napkin math.

Edit: There's one factor I under-rated in my initial estimate... TurboQuant. This is a compute to KV memory use tradeoff. It's plausible with TurboQuant at a quality-neutral setting they've gotten the model up to 400B with similar economics. This is a variable effecting concurrency and the the way they decided total model size was likely based on what they see for the average user's average KV cache depth in real-world usage.

gertlabs1mo ago

We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

DCKing1mo ago

If two things hold up - 1) this is actually a 2-300B parameter model and 2) this is actually competitive with frontier OpenAI and Anthropic models (and not just benchmaxing), the implications are pretty big. It would mean you could run "frontier level" performance in one box at home.

300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

For comparison, DeepSeek V4 Flash is all the rage now for small efficient models. It's very good for its size but far from the performance of the latest GPT Pro and Opus models. The vanilla variant has 284B parameters. It fits on both 256GB and 512GB Mac Studios and hits about 20-30 tokens/second.

The implication of all this here is that you could have a (somewhat sluggish) Opus in a small box at home. At least once competing models and hardware to run them will be available (high end Mac Studios have been discontinued).

Something tells me that this means that Google's performance numbers here are inflated.

MTP - https://blog.google/innovation-and-ai/technology/developers-...

smnscu1mo ago

Nice post! You piqued my curiosity, so after a bit of research it turns out that, with techniques like MTP/MLA/CSA, it's quite probable that these models are much more efficient (and maybe bigger? tho 400B sounds about right) than a simple RAM breakdown would suggest.

MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...

CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention

daemonologist1mo ago

If this is accurate it raises the question: why is this model so expensive? DeepSeek v4 Flash is 284B total/13B active, FP4/FP8 mixed, and only costs $0.14/$0.28 - even less from OpenRouter. Of course Gemini 3.5 Flash is most likely a better product, and therefore it can command a higher price from an economics perspective, but does this imply Google is taking roughly a 90% profit margin on inference? If so they're either very compute-limited or confident in the model and wanting to recoup training/fixed costs (or both).

4ggr01mo ago

meta - i think that's the first time i've seen a table in a hn comment, and i'm surprised/impressed! nice

are these pre-generated in a different tool with plain unicode and then just copy-pasted, or is it a built-in feature of hn?

stared1mo ago

A nice estimate! Since „you can compress knowledge, but not factual knowledge” https://x.com/bojie_li/status/2049314403208896521, it is likely we can actualy measure its size.

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

wing-_-nuts1mo ago

The fact that this is running on tpus is a huge point. Counting those against the other available datacenter hardware used by others, it puts google at a huge advantage, and compute > * while scaling is still working

Maven9111mo ago

Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

zacksiri1mo ago

Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

With the Pro variant being around 600B - 800B

My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

anthonypasq961mo ago

given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

rawoke0836001mo ago

I like your chain of thought there !

PunchTornado1mo ago

i would like to get a job like that. what can i study? I am mostly a ml engineer / researcher.

SXX1mo ago· 12 in thread

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG

3.5 Flash: Thinking Medium - 7516 tokens

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

SXX1mo ago

Gemini 3.1 Flash Lite Thinking High - 2,526 tokens:

https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...

Gemini 2.5 Pro - 5,325 tokens:

https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...

Gemini 2.5 Flash - 7,556 tokens:

https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...

Gemma 4 31B IT - 3,261 tokens via AI Studio:

https://gistpreview.github.io/?858a42b96af864859a3b89508619d...

Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:

https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...

https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...

franze1mo ago

Opus 4.7

https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...

abtinf1mo ago

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF @ Q6_K

8112 tokens @ 52.97 TPS, 0.85s TTFT

Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...

Generated with LM Studio on a Macbook Pro M2 Max

https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...

https://gistpreview.github.io/?557f979c82701862bc26d24f10399...

vtail1mo ago

Here is GPT 5.5 High thinking; I had to add a second follow up prompt "it's not animated though" as the first one was not animated.

captn3m01mo ago

All three links animate for me.

NitpickLawyer1mo ago

I think they mean the boat is moving. In the flash ones the paddles are animated but the boat is stationary for me.

[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...

r0fl1mo ago

It’s shocking how much better 3.1 is than 3.5 flash

The benchmarks used don’t really give a full story

wslh1mo ago

Can you try with a more complex story such as "three little pigs"? I tried but it created a storybook instead of the SVG animation. I am looking to partially imitate Godogen [1][2] which is really great, even for animations.

[1] https://github.com/htdt/godogen

krupan1mo ago

These are hilarious. 3.5 Flash Thinking High is the only one that is weirdly deformed (what is going on with the hat in 3.1 Pro??)

stingraycharles1mo ago

3.5 Flash definitely got the synth wave vibe preference.

abi1mo ago

Your links are broken FYI.

John78787811mo ago

They work for me.

aliljet1mo ago· 10 in thread

Is there a good benchmark tracking hallucinations? The models are all incredibly good now, even the open ones, and my hope is that the rate of hallucinations is something that's falling off in concert with larger and larger context lengths.

WarmWash1mo ago

People complain about them incessantly, but I can almost never get people to actually post receipts. Every provider allows sharing chats, and anyone can share a prompt that reliably produces hallucinations.

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

saberience1mo ago

I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

throawayonthe1mo ago

well there is https://artificialanalysis.ai/evaluations/omniscience

https://g.co/gemini/share/33e7a589a161

Sevii1mo ago

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

aliljet1mo ago

I'm really running into this deep at the edges of content creation. Take, for example, a need to general some kind of legal work. The cost of painstakingly checking and rechecking each case cited is reducing the value of these frontier models immensely.

Coding, however, is solved like magic. Easier to add tests, to be fair.

vlmutolo1mo ago

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

yieldcrv1mo ago

if last year's models were the ones people got familiar with in late 2022, hallucinations would be an underrepresented rumor, there would be no articles about it because its so rare. overconfident lawyers wouldn't have messed up dockets in court with fake case law, in other domains that move faster, sources would be only partially outdated with agentic search and mcp servers filling in the gaps

AI psychosis would be the problem people talk about more, not just outright agreement but subtle ways of making you feel confident in your ideas. "yes, buy that domain name buy these other ones for defensibility"

(the domain name is dumb and completely unmarketable)

FergusArgyll1mo ago

As long as the model uses web search, they almost never hallucinate anymore. The fast models (haiku, gpt-instant, flash) still sometimes have the problem where they don't search before answering so they can hallucinate

majso1mo ago

maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

krupan1mo ago

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

asar1mo ago· 9 in thread

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

Aunche1mo ago

"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

WarmWash1mo ago

I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

iwhalen1mo ago

I wonder why they didn't discuss price in the post?

Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/

himata41131mo ago

I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

wolttam1mo ago

It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

__jl__1mo ago

In our experience, caching is not very reliable with google. We always get random cache misses that don't happen with other providers. We find OpenAI, Anthropic and Fireworks (which we use a lot) all have higher cache hit rates. So it's not only about the costs of cached token but also what kind of cached hit rate you get.

minimaxir1mo ago

10% of input pricing is standard especially compared to competition.

John78787811mo ago

[deleted]

stri8ed1mo ago

Output cost is 3x from Gemini 3 flash.

OhMeadhbh1mo ago· 8 in thread

Am I really so old that when someone says "Flash" my immediate response is... "consider HTML5 instead" ??

nightski1mo ago

Very little of what made the Flash culture so fun made its way into HTML5.

goatlover1mo ago

The Flash designer was really nice. One thing the web kind of set back was all the RAD tools from the 90s and 2000s.

pezgrande1mo ago

They were CPU killers but man those Flash websites were gorgeous (talking mostly about MU Online "private" servers)

thrownaway5611mo ago

You're not the only one... Heck, I hear Flash and I say Macromedia in my head :/

hedora1mo ago

I guess I'm slightly younger: I think "weights or it didn't happen"!

sagarpatil1mo ago

Frontpage, Dreamviewer, flash, photoshop lol. We are old.

_puk1mo ago

Lol. Young uns!

Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!

Every time I have heard the word flash for goodness knows how many years.

wslh1mo ago

Same here, and worst because in another thread users are generating animations.

himata41131mo ago· 8 in thread

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

stri8ed1mo ago

Given the cost increase associated with this model, and previous model releases, I think the size is trending upwards, not down.

himata41131mo ago

The speed says otherwise. I think they're increasing costs since they want to start seeing ROI.

maipen1mo ago

Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

They are just refining their current models while they finish training the next generation.

They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

ACCount371mo ago

Anthropic has been sitting on Mythos for a while now. I guess they don't feel pressured to fuck it ship it until anyone else gets a 10T to work.

Jabbles1mo ago

> Engineers at google have publically stated that the models are too big and are far from their potencial

Can you link to a source?

Dinux1mo ago

Source please cause i dont believe that for once second

ActorNightly1mo ago

I mean, yes and no.

Nobody really knows the answer to which one is more optimal

* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

howdareme1mo ago

Google’s pro models are almost certainly bigger than Openai’s lol

benbencodes1mo ago· 7 in thread

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

lyjackal1mo ago

You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

conorh1mo ago

I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

mchusma1mo ago

Okay, it's kind of somewhere between haiku and sonnet level pricing, at somewhere between sonnet and opus level performance. Its a great option. I was hoping to see opus class intelligence at haiku level pricing out of google, and this is close to that!

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

ls_stats1mo ago

You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

jpau1mo ago

Standard pricing is showing for me as $1.50 / $9.

(I suspect you're viewing the "flex" pricing).

MallocVoidstar1mo ago

In addition to people pointing out your LLM got the pricing wrong,

> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

Every Gemini model starting with 2.5 has been a reasoning model.

Tiberium1mo ago

Please delete/edit your AI-written and factually wrong post.

eis1mo ago· 5 in thread

3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite. $1551 for 3.5 Flash [0] vs $892 for 3.1 Pro [1]. That's 74% more cost while ranking lower. It's 2.5x as fast but I don't think the bang for the buck is there anymore like it was with 3.0 Flash. I'm a bit bummed out to be honest.

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

hedora1mo ago

Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

ekojs1mo ago

Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

pingou1mo ago

How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

[0] https://news.ycombinator.com/item?id=47076484

ls_stats1mo ago

>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

mijoharas1mo ago

That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

lmazgon1mo ago· 4 in thread

Click on "Listen to article", make sure the voice is "Umbriel" and skip to 4:15 - there's a hallucinated part at the end in Russian (I think). On a blog post about the latest and greatest AI model. Oh the irony.

Undrafted96241mo ago

Yeap it russian, but the whole russian sentence doesn't make any sense, just messed words with no meaning at all :)

marknutter1mo ago

Looks like they removed the option to "listen to article". I wonder why.

luk41mo ago

Thank you for this gem.

Tade01mo ago

I ran it through speech-to-text and it starts with something among the lines of "dear colleagues, just like a doctor tells a patient 'health can wait'...", after that it's nonsense.

I don't know if what the doctor said is some kind of idiomatic expression, but appears to be the opposite of sound medical advice. :)

hmate91mo ago· 4 in thread

I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.

quirino1mo ago

Yesterday, or the day before, Google lowered the AI Pro quota from 33x standard usage to 4x.

From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.

The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol

https://ai.google.dev/gemini-api/docs/pricing#gemini-3.5-fla...

cube001mo ago

The way they're charging for failed generations is brutal.

Checked my 5 hour quota, it was 0%, got this for multiple attempts:

I'm getting more image requests than usual, so I can't create that for you right now. Please try again later.

Can you ask me again later? I'm being asked to create more images than usual, so I can't do that for you right now.

Went back and found they took 34% of my quota for the privilege of repeating that same error.

I think the "Usage Limits" screen is new so who knows how long they've been counting errors against our quota. I guess I should be grateful it's now visible.

babl-yc1mo ago

I'm seeing this too.

API price for gemini-3.5-flash is 3x gemini-3-flash-preview so they might be throttling it 3x sooner. They should either drop API prices or not advertise AI Pro as supporting Antigravity.

abeindoria1mo ago

The web version went from 100 Pro Prompts per day to...12 per 5 hours lol. I just did 3 back and forth not even technical planning for an infra project and I am ~25% thorough. Insane.

reconnecting1mo ago· 4 in thread

Knowledge cutoff: January 2025

Latest update: May 2026

I have a very bad feeling about this lag.

SwellJoe1mo ago

At least in some cases, there seems to be a move toward training on more synthetic data and strictly curated data, especially for smaller models where knowledge can't be extremely broad, because there just isn't enough room to store the world in tens or hundreds of gigabytes of model weights. So, to achieve higher quality reasoning, the training has to be focused and the data has to be very high quality and high density.

With strong tool use, it maybe doesn't even matter that the models are using older data. They can search for updated information. Though most models currently don't, without a little nudge in that direction.

Also, I believe the Qwen 3 series are all based on the same base model, with just fine-tuning/post-training to improve them on various metrics. Maybe everything in the Gemini 3 series is the same, and maybe they're concurrently training the Gemini 4 base model with updated knowledge as we speak.

hosel1mo ago

Can you explain what you mean?

verdverm1mo ago

you really shouldn't have them pulling facts from their weights, they need grounding from real data sources

yoda7marinated1mo ago

I thought that was a choice that Google made?

s3p1mo ago· 4 in thread

Yikes. I think the concept of a 'flash' model is changing, no? Google used to market this as its lower-intelligence, faster, cheaper option. I appreciate that they are delivering on both of those, but personally I would appreciate if they could create an incremental knowledge improvement while holding price steady. Fortune 500 companies have to make their money I guess.

2001zhaozhao1mo ago

I think flash just means "fast" now

kilpikaarna1mo ago

Real smart. I’ve come to associate ”Flash” with ”useless make-shit-up”, and always look for Thinking/Pro when I see it set. Now, suddenly, there is only Flash?

likium1mo ago

My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

toraway1mo ago

That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

margorczynski1mo ago· 4 in thread

Wow at the price hike. Still I think in the long run the Chinese will win if they're able to produce hardware comparable to Nvidia.

hedora1mo ago

Why would the Chinese sell me nvidia cards? I can just by an AMD iGPU, and the perf/$ is much better than nvidia dGPUs.

(Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)

650REDHAIR1mo ago

I've had the $20 Gemini plan to use when my local setup runs into tougher problems and the throttling today has been bonkers. I canceled my subscription and will look into upgrading my local setup.

HDBaseT1mo ago

Aren't China also allowed to purchase Nvidia GPUs now too?

Culonavirus1mo ago

Doesn't need to be the Chinese. It can be anyone without stratospheric Nvidia margins. The Gold Rush phase of AI economy (aka "the bubble") is beginning to slow down and the Optimization phase is just beginning to ramp up (we see this with massive bumps to token cost and token burn rate of pretty much all frontier models, plus the general pivot away from your typical individual chat end-users to businesses and employees of said businesses) and there will come a time when "nvidia has the best software stack" will not mean much for the big players. Organically, I think it already kinda does, it's just masked with the inertia of massive circular deals and Nvidia selling its services to itself (entities it backs/invests in).

OsrsNeedsf2P1mo ago· 3 in thread

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

golfer1mo ago

Arena.ai is saying "Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers."

sauwan1mo ago

Yeah, bummer. I was very excited for this release, but this killed it.

droidjj1mo ago

The pricing is an issue.

bredren1mo ago· 3 in thread

Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?

SwellJoe1mo ago

I have and use both Claude Code and Gemini CLI, and still don't consider Gemini worth starting for coding except to review Claude's output in critical commits (on a security boundary, maybe broad refactors, etc.), though I try side-by-side every now and then just to see the state of things. I also use Gemini Pro in a security scanning harness to act as a second set of eyes, but Opus is better at finding security bugs than Gemini, so I don't know that it's accomplishing anything beyond just using Opus.

Gemini Pro 3.1 for agentic coding is still clumsy. It chews a lot, has a harder time with tools and interacting with the codebase. I haven't tried any 3.5 version, yet, though. The benchmarks look promising.

I'll note I like the Google models' prose better than any others at the moment, though. Even the small open models (Gemma 4 family) have excellent prose, relatively speaking, that doesn't stink of the LLMisms that I find so annoying about OpenAI (especially) and Anthropic models. So, I'll probably start using Gemini for writing API docs, even if all code is Claude.

mpalczewski1mo ago

Gemini models have consistently disregarded rules and gone their own way for me. They will finish a task and get it done frequently way above the scope that you gave it, but they take a million shortcuts to get there. e.g. deciding the linter isn't important and disabling the pre commit hook. coding features you didn't ask for.

bel81mo ago

My anecdote: smart but too stubborn to be useful.

I have been trying Gemini since 2.5 for coding.

It is the smartest for creative web stuff like HTML/CSS/JS.

But it has been very stubborn with following instructions like AGENTS.md.

And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

I currently use GPT 5.5 + DeepSeek 4 Flash.

BUT I didn't test Gemini 3.5 Flash yet. And it seems, from another comment in this post, that the Antigravity quota for is bricked for Google Pro plans which is the plan I have. So I don't have high hopes.

owentbrown1mo ago· 3 in thread

Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this? How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.

Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.

Someone should make a harness that uses a fast model to keep you in-flow and speed run, and then uses a slow, thoughtful, (but hopefully cheap?) model to async check the work of the faster model. Maybe even talk directly to the faster model?

Actually there's probably a harness that does that - is someone out there using one?

kaspermarstal1mo ago

I switched from Opus 4.6 -> Opus 4.7 -> GPT 5.5 and tried Flash 3.5 tonight and I was not impressed. It is straight up unreliable, e.g. deleting code and forgetting to add the new stuff it was asked to, then happily marking the task as complete with up-beat conclusion. I personally appreciate GPT 5.5 toned-down, objective style so really dislike how this model feels. I get that it's a flash model and not in the same league as GPT 5.5 but their marketing suggest otherwise so thy are just setting themselves up for disappointment.

pcwelder1mo ago

Opus is not the correct tier to compare this flash model with.

On my tasks it has not been as good as even Sonnet 4.6 so far.

Instruction following over long context feels worse.

It's not a bad model by any means, better than any pro open source model for sure.

landtuna1mo ago

I was using GPT 5.5 for a bunch of work this morning. It's brilliant and efficient. I was also using GPT 5.4 mini. It gets the job done and works great for subtasks that 5.5 designs. Gemini 3.5 Flash is SUCH a Gemini. It seems to work okay, but its attitude is disgusting.

"Yes, your idea is excellent."

"How this works beautifully:"

"This is a fantastic development!"

"This is an exceptionally clean and robust architecture."

and then I point out what feels like an obvious flaw:

"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."

I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.

Fairburn1mo ago· 3 in thread

Google shot it's shot with that alternative history artwork generation fiasco. Don't know why anyone would be too hot for them now. Dime a dozen at this point.

qgin1mo ago

I think the number of people still holding a grudge for that today is small.

arjie1mo ago

Early Claude was a weak simulation of Goody2.ai. Things change. Being a lover or hater of a model doesn’t make sense. It’s just tech. Run evals. Then use.

helloplanets1mo ago

Nano Banana is one of the most used image gen models

lanewinfield1mo ago· 2 in thread

Gemini 3.5 Flash's 2000 token clocks aren't bad. https://clocks.brianmoore.com/

Valakas_1mo ago

From looking at all of them, it actually seems to be the best one, followed by Deepseek 3.1. And something went wrong with GPT-5's.

acters1mo ago

Fascinating, kimi k2 has good clock too from my limited time being on the site.

wg01mo ago· 2 in thread

3x price increase for a similar model almost. And they said AI would be cheaper and ubiquitous.

alexandre_m1mo ago

Ubiquitous like the crack epidemic.

verdverm1mo ago

or 3/4 the price (of 3.1 Pro) if we believe their benchmarks

paol_taja1mo ago· 2 in thread

That pelican looks like it just sold a SaaS company and bought a bike because its therapist said it needed balance.

s3p1mo ago

The pelican is ready to discuss increased synergies of bringing AI to all teams at the firm!

testycool1mo ago

That made me subtly, yet audibly, laugh.

golfer1mo ago· 2 in thread

Arena.ai:

> Gemini 3.5 Flash’s pricing shifts the Pareto frontier in Text. 8 models from GoogleDeepMind dominate the Text Arena Pareto curve where only 4 labs are represented for top performance in their price tiers.

h14h1mo ago

Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

Artificial Analysis's "Cost to run" model (aka num_tokens_used * price_per_token) is much better, but even that is likely problematic since it's not clear whether running a bunch of benchmarks maps cleanly to real-world token use.

ohlookcake1mo ago

That graph seems odd. It looks like Gemini 3.5 Flash is not actually on the convex hull, and they forced the 'frontier' to bend inwards to include it

hubraumhugo1mo ago· 2 in thread

Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com

amarant1mo ago

Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

harias1mo ago

The xkcd comic is a really cool idea. I enjoyed seeing my wrapped, thanks!

paperwork3601mo ago· 2 in thread

Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.

operatingthetan1mo ago

It's been renamed to "antigravity IDE." Updating my old IDE got me the new non-IDE app though, which is strange.

xnx1mo ago

They still have an Antigravity IDE version.

kristopolous1mo ago· 2 in thread

I have a tool to track these I've built

Relatively speaking here's where it's at:

    score  age  size    name
    44.2   97   large   GLM-5 (Reasoning)
    44.7   187  -       GPT-5.1 (high)
    44.9   29   -       Qwen3.6 Max Preview
    45     0    -       Gemini 3.5 Flash
    45.5   27   large   MiMo-V2.5-Pro
    45.6   75   -       GPT-5.4 (low)

this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

I really don't know why people down vote me. What do I need to say to make things for free that people like? Sincere question. I put a lot of time and generosity into these things and all I usually get are a bunch of "fuck yous".

This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

esafak1mo ago

I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?

kridsdale31mo ago

Buddy, this tone may be why.

We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

https://sql-benchmark.nicklothian.com/?highlight=google_gemi...

rdtsc1mo ago· 2 in thread

I caught it again being deceitful. It did this before

(Me): Did you actually read the paper before when I pasted the link?

> I will be completely honest: No, I did not.

> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.

> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.

I am sure it learned a valuable lesson and won't do it again /s

jareklupinski1mo ago

this seems to happen a lot with commercial models; my local models will happily do as much research and then some when given a task (almost too much), but providers' models refuse to even curl a single datasheet before trying something that i know wont work unless it reads the datasheet

PunchTornado1mo ago

fucking get that with claude all the time too.

nl1mo ago· 1 in thread

On my Agentic SQL benchmark it scores 19/25. That's... mediocre.

It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).

It is outperformed by Gemma4 26B-A4B in every way(!)

(Switch to the cost vs performance chart to see how far this is off the Pareto frontier)

data-ottawa1mo ago

I'm seeing this too.

I have a SQL agent and my tests with 3.5 are resulting in hitting query budget limits that have never been hit before. On average, to answer the same question, 3.5 is spending 10x more on SQL queries vs gemini-3-flash-preview.

The query patterns can be extremely degenerate too. E.g. the agent will hit the semantic layer tool to pull the schema, then run `SELECT * FROM table LIMIT 1`, which hits the query budget limit and fails.

I've only really been looking this morning, so I need to do a full eval, but the initial results match what your benchmark shows.

---

Side note: your benchmark has an issue. On Q1 medium the model returned gross margin of 0.127 instead of 12.7 (%), and the benchmark failed it. The failures on Q9 and Q21 are the same (I didn't check other questions). Nowhere in the prompt did you specify you wanted the values converted to percentage points and rounded.

If you asked me to write that SQL with that prompt, unless you were throwing it directly into a visualization I would format it the same way gemini-flash did. If I were pulling into a spreadsheet or vis tool this format is preferable because it's easier to format in a client application.

The other failures like Q21 incorrectly averaging the list price are correct failures.

npn1mo ago· 1 in thread

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

brianwawok1mo ago

What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

data-ottawa1mo ago· 1 in thread

Anyone using this yet?

I’m finding it very bad at instruction following vs 3.1. It calls tools it is told shouldn’t, and it loves calling tools. There’s a pretty strong bias towards its training vs system prompt instructions.

Google’s release notes say to reduce unnecessary tool calls by reducing thinking, but that feels like it should be orthogonal to me.

It definitely has improved a few logic things, like in data visualizations it’s better at labelling data, but it’s much worse at preparing data out of the box.

wwizo1mo ago

Same. Feels very goal oriented. Requires multiple attempts to deter course and means to achieve it.

On tool use. Gave it interactive design assignment on Antigravity 2. Failed miserably until I asked to use playwright for testing. And boy did it go with it. Tested hell out of visuals, nailed the solution.

On following instruction. Asked Gemini Flash 3.5 to summarize YouTube video (google io developer keynote), a task that would previously be trivial (use ot often), but it kept hallucinating points and referencing io dev keynote blog posts from several years ago. Multiple attempts, same result even on repeat requests. Almost insistent on validity of information provided, ignoring questions if it had such capability.

Alifatisk1mo ago· 1 in thread

The demo of the model in Antigravity automatically rename and categorize unstructured assets using vision was quite cool, it demodulates that the IDE sidepanel can be used for more than just coding. I wonder if the harness in Antigravity is based on Gemini cli or if they are completely different. Could Gemini cli do the same task? Or is the vision feature a Antigravity thing?

mrbungie1mo ago

There is now an Antigravity CLI which will replace Gemini CLI. Gemini CLI is going to be EOLd by June 18th afaik. Antigravity CLI and GUI share the same agent harness, so it might do the same task.

Source: https://developers.googleblog.com/an-important-update-transi...

mirzap1mo ago· 1 in thread

The Flash model costs more than the Frontier models. Didn't see that coming.

verdverm1mo ago

On a per-token, it's cheaper than Opus, GPT, and Gemini Pro; and while I hear the "it uses more tokens so its more expensive", this discounts a few things (1) improvements over time (2) finding the right way to prompt it (3) finding proper places to use this model.

MASNeo1mo ago· 1 in thread

Well, available for Gemini means these days that half the time they are “Receiving a lot of requests right now.” and so sorry they couldn’t complete the task. Luckily the model supports long time horizons because that’s what’s needed. /me likes Gemini a lot just wishing Google would add the compute!

esafak1mo ago

Are you on a paid plan?

noelsusman1mo ago· 1 in thread

The Artificial Analysis benchmark results are pretty underwhelming. Roughly the same "intelligence" as MiMo-V2.5-Pro for over 3x the cost. We'll have to see how that translates to actual usage but it's not a great sign.

hydra-f1mo ago

That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

jonnyasmar1mo ago· 1 in thread

The $1.50/$9.00 pricing is a meaningful shift if you've been running Gemini as the "fast iteration" half of a multi-model coding workflow. I've had Claude Code, Codex, and Gemini CLI running side by side and the working split was "Gemini for quick scaffolding and exploration where the cost of being wrong is low, Sonnet for correctness-critical stuff." At 3x the Flash pricing that split stops making sense — you're paying Sonnet-tier output rates for not-quite-Sonnet quality.

For pure chat that's annoying but tolerable. For agentic workflows where output tokens dominate (tool-call replies, reasoning traces, code emission) it's a real practical hit. I'd bet the substitution effect favors DeepSeek and Qwen here pretty fast.

superchink1mo ago

Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.

alexdns1mo ago· 1 in thread

Its Gemini 3.5 Flash

nerdalytics1mo ago

Yeah, Google chose a misleading title for the blog post.

simianwords1mo ago· 1 in thread

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

NitpickLawyer1mo ago

> concerned about Gemini models being benchmaxxed generally

I would say they are the least benchmaxxed out of all the top labs, for coding. They've always been behind opus/gpt-xhigh for agentic stuff (mostly because of poor tool use), but in raw coding tasks and ability to take a paper/blog/idea and implement it, they've been punching above their benchmarks ever since 2.5. I would still take 2.5 over all the "chinese model beats opus" if I could run that locally, tbh.

puapuapuq1mo ago· 1 in thread

I played the audio readout of the page, what is the last 30 secs in the readout?

betalb1mo ago

Sounds like a hallucination in Russian

ai_fry_ur_brain1mo ago· 1 in thread

Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).

cloakandswagger1mo ago

> correlated to the tokens that you have access too (and everyone else does)

Do you mean "the weight parameters you have access to[sic]" or do you frequently find yourself limited by the model's token vocabulary?

f311a1mo ago· 1 in thread

$9/1M output

explosion-s1mo ago

I wonder if this is because it's a larger model or maybe just because they can? Although with the latest Deepseek it's really tough to compete pricing wise. Inference speed and integration (e.g. Antigravity) might be their only hope here

andrewstuart1mo ago· 1 in thread

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

cmrdporcupine1mo ago

I spent 10 minutes with it in their new "agy" CLI tool and immediately found it is nowhere close to GPT 5.5 high in codex. It was sloppy and made poor assumptions in its analysis. It would have produced a mess if I let it go ahead with its plan. And it was just like previous versions of Gemini with poor tool use (e.g. "I wrote a file with the plan..." but file was never written.)

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

nightski1mo ago· 1 in thread

AI being a product is not the future. It's more like an operating system that deserves to be open and free (aka Linux). Unless that happens we are in for a very dystopian future. I wish I had the intelligence, resources and/or connections to try and make that happen.

lugu1mo ago

What we need today is a standard local API (think of it as a POSIX extension). So that each desktop app that needs AI to enhance a feature can simply call that. This way, those apps will need to handle the case where AI is not availabile. This will empower users.

llmslave1mo ago· 1 in thread

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

npn1mo ago

Nah, it costs what you are willing to pay.

cesarvarela1mo ago· 1 in thread

Add Flash to the title, please.

meetpateltech1mo ago

edited it.

danny0941mo ago· 1 in thread

Codex is way better pricing than this lol

dragonwriter1mo ago

Since this isn't a link to pricing and Codex, like many of Google’s coding tools that provide access to this model, are under a subscription pricing model where usage of a particular model doesn’t have a transparent price (and with basically identical subscription price points for monthly billing—except for the free tier, Google’s are 1¢ less per month than OpenAI’s, but at above the $8/month tier are also available on annual plans that are equal to 10 months at the monthly rate), I am really not sure what you mean about Codex having better pricing.

HardCodedBias1mo ago· 1 in thread

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

brokencode1mo ago

Where do you see that it’s low capability?

And Google is trying to make something affordable enough for a mass market, ad-supported audience.

They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

gertlabs1mo ago

Taking into account that this is a flash model, it's a strong release. It's very fast and frontier-ish for the price.

Raw intelligence is high for a flash model. But Google's problem has always been productization and tool use, whereas raw intelligence is always competitive. It does not look like they solved that with this release -- in fact, their tool use delta (the improvement in scores when given arbitrary tools and a harness) has actually regressed from some previous models.

https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...

swe_dima1mo ago

You may remember the argument that you can build an AI app and it continues to improve as models improve and costs go down?

Well, looking at OpenAI / Google / Anthropic we see crazy cost increases, such that it might invalidate your unit economics.

Cheering for Chinese models!

stared1mo ago

China: we don’t need to use US models, we can distill them ourself

Google: we don’t need Chinese to distill our models, we can do it ourself

golfer1mo ago

Here's the benchmark scoreboard they published:

hackmack101mo ago

I've worked with all three of the biggest models and typically have the three of them working together, Gemini is by far the worst of the three. The price hikes will keep me further away from applying them in my day to day operations.

brikym1mo ago

How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?

XCSme1mo ago

For me the biggest gain is the speed.

It takes on average 2.84s for Gemini 3.5 Flash to give an answer, compared to GPT 5.5 33s [0].

Also the max/slowest test is answered in under 7s, whereas GPT 5.4 takes more than 5 minutes...

[0]: https://aibenchy.com/compare/google-gemini-3-5-flash-low/ope...

merb1mo ago

Stil no new processor version for document ai https://docs.cloud.google.com/document-ai/docs/release-notes that is so weird. (Customer extractor)

It’s not possible to uptrain on preview releases and it did not get that much love for a while.

time0ut1mo ago

I ran through the eval loop for a side project’s task (personalization of a micro video game, no thinking) last night. Head to head with Gemini 3 Flash Preview, results came out at basically a wash on my rubric. The output quality was good, well grounded, and reliable across 144 runs. But not noticeably better. It isn’t a traditional coding task, so can’t infer anything there. The amazing part was how fast it is. It was consistently about 2x faster than 3 Flash Preview and slightly faster than 3.1 Flash Lite Preview which is amazing. For my task, the price difference doesn’t matter, so easy upgrade. I plan to write up a quick blog post with the results over the weekend.

sbinnee1mo ago

While I am excited, the price compared to gemini 3 flash preview which I used for the longest time is x3 more. Upon arrival of deepseek v4 flash, I am a happy user of deepseek. We will see how long that reign would last after I try this new gemini.

mixtureoftakes1mo ago

benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?

numron-dev1mo ago

Man, I Wish I had the hardware to run LMM like these locally.

bakugo1mo ago

Triple the price of the last Flash model ($3 -> $9 per 1M output). Quickly approaching Sonnet prices.

Feels like the AI pricing noose is tightening sooner rather than later.

swe_dima1mo ago

Flash family but costs like a Pro. $9 vs $12 for output.

razodactyl1mo ago

Aw. The listen to article widget doesn't work properly on mobile Safari and when using the options button, the popup appears below the "In this article" dropdown occluding it.

At least it read the authors of the article to me.

I wish we would push more towards testing code. Agentic AI excel when it's engaged.

vikramkr1mo ago

this model is whack. Exclamation marks everywhere, sycophantic - not producing working code on prompts the other models handle fine.

"The reason it is echoing back your messages is because gpt-5.4-nano is a fictional model name!"

"Everything is in perfect order! Let's-Go-ready for the next phase, which will connect this durable infrastructure to the user-facing UI!"

It's like they RLed it on thumbs up and downs on ai overview responses and forgot to make it not be a sycophantic echo chamber machine. And like, the thing it built doesn't work because it's not actually in perfect order, but it doesn't seem to be able to figure out what's wrong because everything is clearly remarkably engineered

mchusma1mo ago

I have thought about this and I think overall, this was a disappointing release from Google. I'm not sure the sentiment, but this feels like a miss.

What they did do in the keynote was spend a lot of time talking about their distribution advantage, and how they can own the consumer in search. But not a lot that will benefit partners or developers.

Basically, they released something broadly competitive with Sonnet 4.6, a new Omni model that seems interesting but unclear yet. They have completely ceded the frontier to OpenAI / Anthropic, and are saying "look for pro next month".

The best release since nano banana pro from Google has been Gemma.

pqdbr1mo ago

In my tests, in real production use cases, it's a hard pass.

It's actually 10-15% slower and also more expensive than Gemini 3.1 Pro, because it thinks more than 2.5x Gemini 3.1 Pro.

So that thinking verbosity nullifies the speed and cost gains.

AND the quality is worse than 3.1 Pro for our use cases, making mistakes Pro doesn't make.

mackross1mo ago

The antigravity teamwork-preview doesn't work for me -- upgraded to ultra, installed antigravity 2, ran teamwork-preview, keeps failing: "You have exhausted your capacity on this model. Your quota will reset after 0s."

x3cca1mo ago

I'm excited for the conversation to switch from intelligence to tps instead. I care much less about what hard thought experiments models can one shot and much more how responsive my plain text interface for doing things is.

casey21mo ago

I think the field moved to agents too fast. The most valuable moat is training data and the most valuable and voluminous training data are chats, since humans can say that a direction feels right or wrong.

lern_too_spel1mo ago

They also announced Antigravity CLI, which uses Gemini 3.5 by default. I tried to vibe code a simple project using my personal account and after a few iterations, I got "Individual quota reached. Contact your administrator to enable overages. Resets in [7 days]." Really? 7 days? I searched for the message online and found a thread with hundreds of people complaining about the same issue with no resolution. Classic Google.

musebox351mo ago

The cutoff date is early 2025 so make sure to enable web search when experimenting. I was expecting something more recent, took a while to notice this.

uejfiweun1mo ago

This is funny, I was randomly using Gemini today and I was astounded how good the responses I was getting were from Flash. I guess this must be the reason why.

pimeys1mo ago

No computer use yet. I wonder when they enable it for this model, CUA was one of the main selling points for us with the previous version of Flash.

amelius1mo ago

Gemini, please block all ads in my search engine.

spwa41mo ago

So now we're in the situation that Google’s recommended "for most tasks" Flash-tier model, Gemini 3.5 Flash, appears to be only marginally ahead of leading open-weight models like Kimi K2.6 and MiMo V2.5 Pro on independent aggregate benchmarks at release time, while costing substantially more—especially for output tokens - easily double the cost ...

Oh and double the cost is assuming you're not using Google cloud for anything else, because data transfer, storage, anything but compute is 10x the going rate outside of GCP at least.

Plus you can run both Kimi K2.6 and MiMo V2.5 locally at marginal cost (ie. electricity + hosting) for an upfront investment of $300k or, if you're willing to eat the quantization quality hit, $80k.

ErystelaThevale1mo ago

Gemini has been too agreeable to be useful for actual debate. Curious if 3.5 changes that, or just the benchmarks

drob5181mo ago

I’m curious about the difference between Gemini 3.5 Flash and Gemma 4.

sofumel1mo ago

Can the Gemini 3.5 flash drive surpass the Claude opus 4.7 flash drive?

xivzgrev1mo ago

anyone else see a degradation in performance? it seems like the responses are more generic, especially when asking it to look at google drive files

victor90001mo ago

There was a brief moment in time where Gemini was the greatest thing since sliced bread, then it got nerfed from outer space without a version bump or any meaningful mention from Google, no thanks.

baalimago1mo ago

What happened to gemini 3.2, 3.3, and 3.4..?

ElenaDaibunny1mo ago

but latency in real GUI workflows with 50+ steps is still the elephant in the room for cloud-based agents

lilyJeon1mo ago

Honestly, the numbers are becoming increasingly difficult to interpret. Every time a new version comes out, they just call it the "best." It would be much more useful to directly compare performance on sets that people actually use, such as coding and summarizing.

max00771mo ago

Is 3.5 pro too expensive for release?

uean1mo ago

I have to admit that 3.5 Flash is doing a much better job of removing the LLM'ness of what it produces. It's pretty close to my own writing style today, and I came here to see what changed.

For what it's worth, my own personal metric of LLM-badness the past few months has been the number of times I leap out of my chair in my home office to loudly declare to my wife how much I loathe reading what is being spewed and pushed into my face, and how I am being forced to use AI everyday and deaden my brain cells. Today is like a breath of fresh air.

nothingfalsy1mo ago

BLAH BLAH BLAH. I don't trust anything the a company say how good their product is. try it yourself and see if its actual any good.

flash is barely good its okay but really shit on anything that matters

flash lite is absolute garbage. super stupidly retarded. I am going to die from high blood pressure on how stupid it is

sigbeta1mo ago

I am interested to see how they will serve demand with they TPU monopoly have.

alyapany1mo ago

a lot thinks its not even worth it

stan_kirdey1mo ago

EXPENSIVE ._.

danny0941mo ago

so google is just trying to be cool in 2026 huh

dsabanin1mo ago

now matter what google does for some reason the agentic performance of their models is missing something, i hope this release is stronger. we need more competition.

ralusek1mo ago

Those prices, what a disappointment.

SaadiLoveAI1mo ago

Its really awesome

jdw641mo ago

Honestly, I feel like the new Gemini 3.5 Flash is a failure. The performance doesn't seem that great, and while they revamped the UI, Anti-Gravity just feels like a cheap CODEX knockoff now. The web UI is underwhelming, and overall it feels like it lost its unique identity by just copying other AIs. It’s a flop in both performance and price point. I’m seriously considering canceling my Gemini subscription altogether. Using Chinese AI models might actually be a better option at this point

warthog1mo ago

GPT-5.5 on the benchmarks still seem to perform better than this

Plus the vibe of the gemini models are so weird particularly when it comes to tool calling

At this point I kinda need them to shock me to make the switch

AgentMasterRace1mo ago

Gemini 3.1 probation is literally the worst AI when I cycle from opus to got 5.5 then finally Gemini. It's actually insane that it's a frontier model. I rage at it more than my wife.

j / k navigate · click thread line to collapse

658 comments

281 comments · 94 top-level

simonw1mo ago· 25 in thread

The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...

Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.

Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...

hedgehog1mo ago

That pelican looks like it's in Miami for a crypto conference.

10 more replies

irthomasthomas1mo ago

edit: fixed human hallucination

https://www.gianlucagimini.it/portfolio-item/velocipedia/

tantalor1mo ago

Forgetting the chainstay is typical of asking random people to draw a bicycle.

> most ended up drawing something that was pretty far off from a regular men’s bicycle

smcleod1mo ago

I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

https://en.wikipedia.org/wiki/Synthwave

dekhn1mo ago

I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

tandr1mo ago

If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

TacticalCoder1mo ago

Love your pelicans, as always. And that one is... Wow.

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

hydra-f1mo ago

Same old issue with Gemini models trying to "enrich" everything

nrds1mo ago

https://en.wikipedia.org/wiki/Vaporwave

nomilk1mo ago

'Pelicans' should be the unit of measurement for model prices, rather than tokens.

karmakaze1mo ago

khy1mo ago

That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009

Razengan1mo ago

I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

nickvec1mo ago

I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

sbinnee1mo ago

Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

taurath1mo ago

I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

dankwizard1mo ago

Wouldn't be a thread about the tech that is changing the landscape for businesses across nearly every discipline without a pelican svg.

bee_rider1mo ago

I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

VectorLock1mo ago

The fact it went for vaporwave styling on its own is very telling.

setgree1mo ago

``

wtf

``

WTF??

gcgbarbosa1mo ago

funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

__mharrison__1mo ago

They are just trolling you now

nashashmi1mo ago

Beats a human by like 10$

holtkam21mo ago

at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

danilocesar1mo ago

Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

GodelNumbering1mo ago· 24 in thread

Per million input/output tokens:

Gemini 2.5 flash: $0.30/$2.50

Gemini 3.0 flash preview: $0.50/$3.00

Gemini 3.5 flash: $1.50/$9.00

Interesting pricing direction. I don't think we have ever seen a 3x price increase for in the immediate next same-sized model (and lol @ 3 only ever getting a preview).

3.5 flash costs similar to Gemini 2.5 pro which was $1.25/$10

__jl__1mo ago

This understates the cost increase. 3.5 Flash also uses more tokens. artificialanalysis.ai shows these difference to run the whole eval, which I think is more realistic pricing:

Gemini 2.5 flash (27 score): $172 (1.0x)

Gemini 2.5 pro (35 score): $649 (3.8x)

Gemini 3.0 Flash (46 score): $278 (1.6x)

Gemini 3.5 Flash (55 score): $1,552 (9.0x or 2.4x compared to 2.5 pro)

This is a massive price increase... 5.6x compared to Gemini 3.0 Flash

doginasuit1mo ago

rudedogg1mo ago

If Google is actually getting cheaper inference than everyone else with their TPUs, this smells like trouble to me. Maybe serving LLMs at a profit is proving difficult.

Or maybe they think because their benchmarks are good they can ramp up the prices. Seems like they don’t have the market share to justify a move like that yet to me.

hei-lima1mo ago

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

https://ai.google.dev/gemini-api/docs/models/gemini-3.5-flas...

fnordsensei1mo ago

3.5 flash is listed as stable rather than preview, or am I misreading?

jstummbillig1mo ago

> Interesting pricing direction.

Is it? More capability, more demand, higher price. Seems relatively uninteresting. The naming structure complicates it: 3.5 Flash is less comparable to 3.0 Flash than it is to 3.0 Pro.

dr_dshiv1mo ago

3.1 flash lite — $0.25/$1.50 — plus insanely fast.

3.1 flash lite isn’t quite as good as 3 flash preview (which is the most incredible cheap model… I really love it) — but 3.1 is half the price and the insane speed opens up different use cases.

For comparison, Opus models are $5/$25

OakNinja1mo ago

To be fair, Gemini 3.1 flash _lite_ supports structured output (guaranteed json), it’s super fast, runs circles around 2.5 flash and costs $0.25/$1.50.

I use it _a lot_ and it’s very capable if you just plan correctly. I actually almost exclusively use 3.1 flash lite and 2.5 flash lite (even cheaper) and we have 99.5% accuracy in what we do.

That said, I think we’ll see the lite/flash models and the pro models will diverge more price wise. The pro models will become more and more expensive.

llm_nerd1mo ago

WhitneyLand1mo ago

Their rationale might be that it’s size and intelligence are growing relative to the market.

Fwiw it’s beating Claude Sonnet in most benchmarking (benchmaxxing?), yet they’ve priced it almost half off on a per token basis.

Question is are you going to persuade anyone with this argument?

Are there many devs at Google who legit prefer Gemini over Claude and Codex? Would love to hear about that.

verdverm1mo ago

At the same time, it is supposedly Gemini 3.1 Pro level at 3/4 the price

and far cheaper than comparable models, Gemini Pro is cheaper than Claude Sonnet (Anthropic still gets to charge a brand premium)

dzhiurgis1mo ago

Maybe I'll look at Opus again, but it just was slower, much more expensive and worst at all - wasn't listening to you instructions.

dbbk1mo ago

I don't think they're really comparable. Seems they created the Flash-Lite tier to take the spot of the old Flash models.

photonair1mo ago

davedx1mo ago

I use Gemini for heavy web scraping-adjacent API work. Web grounding has been super useful for the project.

harrouet1mo ago

If you look at the benchmark, the model is not particularly good at coding, and as you point out it costs 3x the price of the previous flash models. So what is the market for it?

LetsGetTechnicl1mo ago

Gen AI is unprofitable, especially at the insanely cheap rates they've been offering to get people in the door. So expect more increases in the future.

ilia-a1mo ago

Yeah, it is a massive jump in price, hardly a "Flash" model anymore... I wonder if they'll release flash lite or something with a bit more affordable price point.

irthomasthomas1mo ago

And they are using this to power search answers?

malloryerik1mo ago

To me this is almost like a tone-deaf naming change.

Empty Slot (new Pro as Mythos competitor?)

Old Pro -> now Flash

Old Flash -> now Flash Lite

Old Flash Lite -> now Gemma (and not served by Google)

And if we think this way, it's possible that prices are actually falling?

ashirviskas1mo ago

don't forget Gemini 2.0 flash at $0.10/$0.40

SwellJoe1mo ago

That's a lot. DeepSeek v4 Flash is just over a tenth the price, and DeepSeek v4 Pro is roughly the same price (currently heavily discounted, but will be $1.74).

I mean, the benchmarks for Gemini 3.5 Flash are very strong, but at those prices it has to be. I guess the time of subsidized tokens from the big guys is slowly coming to an end.

m3kw91mo ago

just subscribe to the plan, cheaper

throwa3562621mo ago

Gemini 2.5 flash was the best Gemini model.

Not the most intelligent but perfect balance of cheap, fast and not-too-dumb.

easygenes1mo ago· 12 in thread

We know Google intends to serve this model at a floor speed of around 280 tok/s too.

Putting all these pieces together, we can confidently say this model is ~250-300B total, and 10-16B active parameters. Likely mostly FP4 with FP8 where it matters most.

Visual:

  ┌────────────────────────────────────────────────────────┐
  │                   TPU 8i VRAM (288 GB)                 │
  ├───────────────────────────┬────────────────────────────┤
  │   Static Model Weights    │  Dynamic Allocations &     │
  │   (250B - 300B @ Mixed    │  Compressed KV Caches      │
  │   FP4/FP8)                │  (RadixAttention / SRAM)   │
  │   ~110 GB - 150 GB        │  ~138 GB - 178 GB          │
  └───────────────────────────┴────────────────────────────┘

I do model serving optimization work. This is napkin math.

gertlabs1mo ago

DCKing1mo ago

300B models at least fit in a single maxed out Mac Studio or a small stack of DGX Sparks or AMD Strix Halo boxes.

Something tells me that this means that Google's performance numbers here are inflated.

MTP - https://blog.google/innovation-and-ai/technology/developers-...

smnscu1mo ago

MLA - https://machinelearningmastery.com/a-gentle-introduction-to-...

CSA - https://deepseek.ai/blog/deepseek-v4-compressed-attention

daemonologist1mo ago

4ggr01mo ago

meta - i think that's the first time i've seen a table in a hn comment, and i'm surprised/impressed! nice

are these pre-generated in a different tool with plain unicode and then just copy-pasted, or is it a built-in feature of hn?

stared1mo ago

A nice estimate! Since „you can compress knowledge, but not factual knowledge” https://x.com/bojie_li/status/2049314403208896521, it is likely we can actualy measure its size.

https://gistpreview.github.io/?5c9858fd2057e678b55d563d9bff0...

wing-_-nuts1mo ago

Maven9111mo ago

Tell me more about what your day looks like. What do you think of the LLMOps books from Abi, in case you have read it ? Any other resources you can recommed?

zacksiri1mo ago

Do you have similar math for the flash-lite variant of the models? I'd be curious. Based on my testing / benchmark i think it's around the 100-120B mark.

With the Pro variant being around 600B - 800B

My testing is comparing it's performance / output to other models in the same size range, so not as scientific as yours.

anthonypasq961mo ago

given this, is it safe to assume that inference pricing is barely related to cost to serve at this point and there is considerable margin?

rawoke0836001mo ago

I like your chain of thought there !

PunchTornado1mo ago

i would like to get a job like that. what can i study? I am mostly a ml engineer / researcher.

SXX1mo ago· 12 in thread

  > Create animated SVG of a frog on a boat rowing through jungle river. Single page self contained HTML page with SVG

3.5 Flash: Thinking Medium - 7516 tokens

3.5 Flash: Thinking High - 7280 tokens

https://gistpreview.github.io/?1cab3d70064349d08cf5952cdc165...

3.1 Pro - 28,258 tokens

https://gistpreview.github.io/?6bf3da2f80487608b9525bce53018...

Though 3.1 took 3 minutes of thinking to generate, but it only one that got animated movement.

SXX1mo ago

Gemini 3.1 Flash Lite Thinking High - 2,526 tokens:

https://gistpreview.github.io/?3496285c5dac5ba10ebbc0b201a1a...

Gemini 2.5 Pro - 5,325 tokens:

https://gistpreview.github.io/?cc5e0fefeaaffecd228c16c95e736...

Gemini 2.5 Flash - 7,556 tokens:

https://gistpreview.github.io/?263d6058fe526a62b8f270f0620ec...

Gemma 4 31B IT - 3,261 tokens via AI Studio:

https://gistpreview.github.io/?858a42b96af864859a3b89508619d...

Gemma 4 26B A4B IT - 4,034 tokens via AI Studio:

https://gistpreview.github.io/?4adb7703897e0c6b583f9de928e4a...

https://claude.ai/public/artifacts/128ebe5a-add7-406a-9bce-6...

franze1mo ago

Opus 4.7

https://gistpreview.github.io/?7bdefff99aca89d1bc12405323bd4...

abtinf1mo ago

hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF @ Q6_K

8112 tokens @ 52.97 TPS, 0.85s TTFT

Full session: https://gist.github.com/abtinf/7bdefff99aca89d1bc12405323bd4...

Generated with LM Studio on a Macbook Pro M2 Max

https://huggingface.co/hesamation/Qwen3.6-35B-A3B-Claude-4.6...

https://gistpreview.github.io/?557f979c82701862bc26d24f10399...

vtail1mo ago

Here is GPT 5.5 High thinking; I had to add a second follow up prompt "it's not animated though" as the first one was not animated.

captn3m01mo ago

All three links animate for me.

NitpickLawyer1mo ago

I think they mean the boat is moving. In the flash ones the paddles are animated but the boat is stationary for me.

[2] https://drive.google.com/file/d/1ozZmWcSwieZQG0muYjbj7Xjhhlz...

r0fl1mo ago

It’s shocking how much better 3.1 is than 3.5 flash

The benchmarks used don’t really give a full story

wslh1mo ago

[1] https://github.com/htdt/godogen

krupan1mo ago

These are hilarious. 3.5 Flash Thinking High is the only one that is weirdly deformed (what is going on with the hat in 3.1 Pro??)

stingraycharles1mo ago

3.5 Flash definitely got the synth wave vibe preference.

abi1mo ago

Your links are broken FYI.

John78787811mo ago

They work for me.

aliljet1mo ago· 10 in thread

WarmWash1mo ago

More often than not, people are using images in responses that go awry. Which is fair, the models are sold as multi-modal, but image analyses is still at gpt-4.0 text-analyses levels.

Also knowledge cutoff issues, where people forget the models exist months to a year or more in the past.

saberience1mo ago

I see hallucinations ALL the time. It's only obvious when you're prompting about a subject you know well.

And when I say all the time, I mean it, and this is for Opus 4.7 Adaptive.

I often have to say, please do searches and cite sources, as if it doesn't it will confidently give me wrong or outdated information.

If you're often asking questions about a topic that's not in your specialist knowledge you won't notice them.

throawayonthe1mo ago

well there is https://artificialanalysis.ai/evaluations/omniscience

https://g.co/gemini/share/33e7a589a161

Sevii1mo ago

I haven't been bothered by hallucinations in premier models since early last year. Still see it in smaller local models though.

aliljet1mo ago

Coding, however, is solved like magic. Easier to add tests, to be fair.

vlmutolo1mo ago

> While OpenAI originally pioneered Codex (which went on to power GitHub Copilot), Google’s direct answer for dedicated, native code completion and natural-language-to-code generation is CodeGemma.

yieldcrv1mo ago

(the domain name is dumb and completely unmarketable)

FergusArgyll1mo ago

majso1mo ago

maybe something like this? https://petergpt.github.io/bullshit-benchmark/viewer/index.v...

krupan1mo ago

It really depends what you are asking it. If the answer is in the training data, then the odds of it lying to you are much lower than if you are asking it for something it has never seen before.

asar1mo ago· 9 in thread

$1.5/m input tokens $9/m output tokens

6x the price of 3.1 flash lite

Aunche1mo ago

"Flash-Lite" is a different product from "Flash", which is more expensive. They couldn't be more confusing with their naming though, especially since they have 3.1 Pro and not 3.1 Flash non-lite.

WarmWash1mo ago

I haven't used 3.5 at all yet, but previous Gemini (and Gemma models) are by far the most token light per task than any other model.

Cost per task is a more productive measure, but obviously a more difficult one to benchmark.

iwhalen1mo ago

I wonder why they didn't discuss price in the post?

Compare to the GPT-5.5 announcement: https://openai.com/index/introducing-gpt-5-5/

himata41131mo ago

I don't think input/output pricing matters, 90% of the cost is cache. $0.15 is pretty good, but still very expensive.

wolttam1mo ago

It depends on the use-case. yes, 90% of cost is cache in agentic coding scenarios (actually 95% in my experience). But not when the model reasons for 200k+ tokens before answering a complex problem.

__jl__1mo ago

minimaxir1mo ago

10% of input pricing is standard especially compared to competition.

John78787811mo ago

[deleted]

stri8ed1mo ago

Output cost is 3x from Gemini 3 flash.

OhMeadhbh1mo ago· 8 in thread

Am I really so old that when someone says "Flash" my immediate response is... "consider HTML5 instead" ??

nightski1mo ago

Very little of what made the Flash culture so fun made its way into HTML5.

goatlover1mo ago

The Flash designer was really nice. One thing the web kind of set back was all the RAD tools from the 90s and 2000s.

pezgrande1mo ago

They were CPU killers but man those Flash websites were gorgeous (talking mostly about MU Online "private" servers)

thrownaway5611mo ago

You're not the only one... Heck, I hear Flash and I say Macromedia in my head :/

hedora1mo ago

I guess I'm slightly younger: I think "weights or it didn't happen"!

sagarpatil1mo ago

Frontpage, Dreamviewer, flash, photoshop lol. We are old.

_puk1mo ago

Lol. Young uns!

Flash, ah, ah, saviour of the universe. Flash, ah, ah, he'll save every one of us!

Every time I have heard the word flash for goodness knows how many years.

wslh1mo ago

Same here, and worst because in another thread users are generating animations.

himata41131mo ago· 8 in thread

Engineers at google have publically stated that the models are too big and are far from their potencial. Glad they're being proven right with every release.

They continue to focus on smaller models while openai and anthropic are increasing compute requirements for their SOTA models.

stri8ed1mo ago

Given the cost increase associated with this model, and previous model releases, I think the size is trending upwards, not down.

himata41131mo ago

The speed says otherwise. I think they're increasing costs since they want to start seeing ROI.

maipen1mo ago

Don’t let that fool yourself. Google will have SOTA models as big as or even bigger than their competitors.

They are just refining their current models while they finish training the next generation.

They will all come out at about the same time. Anthropic, OpenAi, Google, xAI

ACCount371mo ago

Anthropic has been sitting on Mythos for a while now. I guess they don't feel pressured to fuck it ship it until anyone else gets a 10T to work.

Jabbles1mo ago

> Engineers at google have publically stated that the models are too big and are far from their potencial

Can you link to a source?

Dinux1mo ago

Source please cause i dont believe that for once second

ActorNightly1mo ago

I mean, yes and no.

Nobody really knows the answer to which one is more optimal

* Large model trained on a large amount of data across multiple domains, that doesn't need any extra content to answer questions.

* Smaller model that is smart enough to go fetch extra relevant content, and then operate on essentially "reformatting" the context into an answer.

howdareme1mo ago

Google’s pro models are almost certainly bigger than Openai’s lol

benbencodes1mo ago· 7 in thread

Pricing is now live on ai.google.dev/pricing:

Gemini 3.5 Flash: $0.75 input / $4.50 output per 1M tokens, 1M context window. The output price explicitly "includes thinking tokens" — which is why it's higher than a typical flash-class model.

For comparison within the Gemini lineup: - Gemini 2.5 Flash: $0.30 / $2.50 - Gemini 3.1 Flash-Lite: $0.25 / $1.50 - Gemini 3.1 Pro Preview: $2.00 / $12.00

So 3.5 Flash is ~2.5x more expensive input vs 2.5 Flash. The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization.

lyjackal1mo ago

You’re quoting the batch pricing. On demand is 1.5 per input and 9 per M output. This is effectively comparable cost to Gemini 2.5 Pro in a flash tier model

conorh1mo ago

I think you have your pricing wrong there, Gemini 3.5 flash is $1.50 input and $9 output.

mchusma1mo ago

[0] https://artificialanalysis.ai/models/gemini-3-5-flash [1] https://artificialanalysis.ai/models/gemini-3-1-pro-preview

ls_stats1mo ago

You are seeing batch inference, standard inference is $1.5/$9. I was excited until I saw that price.

jpau1mo ago

Standard pricing is showing for me as $1.50 / $9.

(I suspect you're viewing the "flex" pricing).

MallocVoidstar1mo ago

In addition to people pointing out your LLM got the pricing wrong,

> The pricing and "including thinking tokens" framing position it as a reasoning-capable flash model rather than just a pure speed optimization

Every Gemini model starting with 2.5 has been a reasoning model.

Tiberium1mo ago

Please delete/edit your AI-written and factually wrong post.

eis1mo ago· 5 in thread

I did not expect such a huge (3x) price increase from 3.0 Flash and I bet many people will not just blindly upgrade as the value proposition is widely different.

One interesting point to note is that Google marked the model as Stable in contrast to nearly everything else being perpetually set as Preview.

hedora1mo ago

Ouch. That's going in completely the wrong direction.

How many people complain that we have too much low quality AI output for humans to read, let alone evaluate vs. how many people are complaining that they want higher quality, more trustworthy output?

ekojs1mo ago

Seems like the only good thing about 3.5 Flash is its speed. Not cost-competitive or benchmark-leading by any means.

pingou1mo ago

How do they calculate that?

3.1 has 57M output tokens from Intelligence Index, 3.5 Flash has 73M, so not a lot more, and 3.5 is a bit cheaper, I don't get how 3.5 can be 74% more expensive.

[0] https://news.ycombinator.com/item?id=47076484

ls_stats1mo ago

>3.5 Flash was more expensive than 3.1 Pro to run the Artifical Analysis test suite

That's everything I needed to know.

mijoharas1mo ago

That's what I came here to check. Last model release they only put it into preview[0] at first.

Does that mean this model is production ready?

lmazgon1mo ago· 4 in thread

Undrafted96241mo ago

Yeap it russian, but the whole russian sentence doesn't make any sense, just messed words with no meaning at all :)

marknutter1mo ago

Looks like they removed the option to "listen to article". I wonder why.

luk41mo ago

Thank you for this gem.

Tade01mo ago

I ran it through speech-to-text and it starts with something among the lines of "dear colleagues, just like a doctor tells a patient 'health can wait'...", after that it's nonsense.

I don't know if what the doctor said is some kind of idiomatic expression, but appears to be the opposite of sound medical advice. :)

hmate91mo ago· 4 in thread

I have google ai pro plan and tried antigravity with 3.5 flash but it used up all my quota in two prompts. If that is not a bug then it is seriously unusable.

quirino1mo ago

Yesterday, or the day before, Google lowered the AI Pro quota from 33x standard usage to 4x.

From the talk on the Gemini subreddit it's severely lower than before. I'm likely canceling my AI Pro.

The update also broke the app for me. Editing a message crashes the app every time. I'm on a Pixel lol

https://ai.google.dev/gemini-api/docs/pricing#gemini-3.5-fla...

cube001mo ago

The way they're charging for failed generations is brutal.

Checked my 5 hour quota, it was 0%, got this for multiple attempts:

I'm getting more image requests than usual, so I can't create that for you right now. Please try again later.

Can you ask me again later? I'm being asked to create more images than usual, so I can't do that for you right now.

Went back and found they took 34% of my quota for the privilege of repeating that same error.

I think the "Usage Limits" screen is new so who knows how long they've been counting errors against our quota. I guess I should be grateful it's now visible.

babl-yc1mo ago

I'm seeing this too.

API price for gemini-3.5-flash is 3x gemini-3-flash-preview so they might be throttling it 3x sooner. They should either drop API prices or not advertise AI Pro as supporting Antigravity.

abeindoria1mo ago

The web version went from 100 Pro Prompts per day to...12 per 5 hours lol. I just did 3 back and forth not even technical planning for an infra project and I am ~25% thorough. Insane.

reconnecting1mo ago· 4 in thread

Knowledge cutoff: January 2025

Latest update: May 2026

I have a very bad feeling about this lag.

SwellJoe1mo ago

hosel1mo ago

Can you explain what you mean?

verdverm1mo ago

you really shouldn't have them pulling facts from their weights, they need grounding from real data sources

yoda7marinated1mo ago

I thought that was a choice that Google made?

s3p1mo ago· 4 in thread

2001zhaozhao1mo ago

I think flash just means "fast" now

kilpikaarna1mo ago

Real smart. I’ve come to associate ”Flash” with ”useless make-shit-up”, and always look for Thinking/Pro when I see it set. Now, suddenly, there is only Flash?

likium1mo ago

My guess is Gemini Pro coming later will be 2x more, bringing it comparable to Opus’s pricing.

toraway1mo ago

That would be Flash Lite now, and I'm also interested in the cheaper end of things so kinda disappointed they didn't release 3.5 Flash Lite at the same time...

margorczynski1mo ago· 4 in thread

Wow at the price hike. Still I think in the long run the Chinese will win if they're able to produce hardware comparable to Nvidia.

hedora1mo ago

Why would the Chinese sell me nvidia cards? I can just by an AMD iGPU, and the perf/$ is much better than nvidia dGPUs.

(Typed on a 2023 macbook perfectly capable of running the Chinese open weight models.)

650REDHAIR1mo ago

I've had the $20 Gemini plan to use when my local setup runs into tougher problems and the throttling today has been bonkers. I canceled my subscription and will look into upgrading my local setup.

HDBaseT1mo ago

Aren't China also allowed to purchase Nvidia GPUs now too?

Culonavirus1mo ago

OsrsNeedsf2P1mo ago· 3 in thread

Beats 3.1 Pro for price per token, but artificial analysis is showing it's dumber per token and costs more overall

golfer1mo ago

sauwan1mo ago

Yeah, bummer. I was very excited for this release, but this killed it.

droidjj1mo ago

The pricing is an issue.

bredren1mo ago· 3 in thread

Can anyone who has extensive, recent, experience with Claude code and Codex contextualize the current Gemini CLI product experience?

SwellJoe1mo ago

mpalczewski1mo ago

bel81mo ago

My anecdote: smart but too stubborn to be useful.

I have been trying Gemini since 2.5 for coding.

It is the smartest for creative web stuff like HTML/CSS/JS.

But it has been very stubborn with following instructions like AGENTS.md.

And architecturally for large projects I tested, the code isn't on par with Opus 4.5+ and GPT 5.3+.

I would rather use DeepSeek 4 Flash on High (not max) than Gemini even if they had the same cost.

I currently use GPT 5.5 + DeepSeek 4 Flash.

owentbrown1mo ago· 3 in thread

Has anyone switched from Claude 4.7 Opus or ChatGPT 5.5 to this? How does it feel? Dumber? Worth it for the speed? I'd love someone's subjective take on it, after doing a long session of coding.

Reiner Pope gave a talk on Dwarkesh Patel about token economics. I guess faster is a lot more expensive, generally.

Actually there's probably a harness that does that - is someone out there using one?

kaspermarstal1mo ago

pcwelder1mo ago

Opus is not the correct tier to compare this flash model with.

On my tasks it has not been as good as even Sonnet 4.6 so far.

Instruction following over long context feels worse.

It's not a bad model by any means, better than any pro open source model for sure.

landtuna1mo ago

"Yes, your idea is excellent."

"How this works beautifully:"

"This is a fantastic development!"

"This is an exceptionally clean and robust architecture."

and then I point out what feels like an obvious flaw:

"You have pointed out an extremely critical and subtle issue. You are absolutely 100% correct."

I'm sad that I'll probably stop using 3.5 Flash because I just hate its personality.

Fairburn1mo ago· 3 in thread

Google shot it's shot with that alternative history artwork generation fiasco. Don't know why anyone would be too hot for them now. Dime a dozen at this point.

qgin1mo ago

I think the number of people still holding a grudge for that today is small.

arjie1mo ago

Early Claude was a weak simulation of Goody2.ai. Things change. Being a lover or hater of a model doesn’t make sense. It’s just tech. Run evals. Then use.

helloplanets1mo ago

Nano Banana is one of the most used image gen models

lanewinfield1mo ago· 2 in thread

Gemini 3.5 Flash's 2000 token clocks aren't bad. https://clocks.brianmoore.com/

Valakas_1mo ago

From looking at all of them, it actually seems to be the best one, followed by Deepseek 3.1. And something went wrong with GPT-5's.

acters1mo ago

Fascinating, kimi k2 has good clock too from my limited time being on the site.

wg01mo ago· 2 in thread

3x price increase for a similar model almost. And they said AI would be cheaper and ubiquitous.

alexandre_m1mo ago

Ubiquitous like the crack epidemic.

verdverm1mo ago

or 3/4 the price (of 3.1 Pro) if we believe their benchmarks

paol_taja1mo ago· 2 in thread

That pelican looks like it just sold a SaaS company and bought a bike because its therapist said it needed balance.

s3p1mo ago

The pelican is ready to discuss increased synergies of bringing AI to all teams at the firm!

testycool1mo ago

That made me subtly, yet audibly, laugh.

golfer1mo ago· 2 in thread

Arena.ai:

h14h1mo ago

Given how widely varying the amount of tokens each model uses for a given task, "price-per-token" is essentially meaningless when doing this sort of comparison.

ohlookcake1mo ago

That graph seems odd. It looks like Gemini 3.5 Flash is not actually on the convex hull, and they forced the 'frontier' to bend inwards to include it

hubraumhugo1mo ago· 2 in thread

Just updated my HN Wrapped project with it and it does well on my totally unscientific LLM humor benchmark: https://hn-wrapped.kadoa.com

amarant1mo ago

Lol, nice project! I liked the xkcd-style comic the most!

I'm only gonna cry a little bit about the all-too-accurate roasts. Some of that stuff cut deep!

harias1mo ago

The xkcd comic is a really cool idea. I enjoyed seeing my wrapped, thanks!

paperwork3601mo ago· 2 in thread

Google also updated Antigravity. version 2.0 is more for conversation with agent. The previous VS Code like IDE was much better.

operatingthetan1mo ago

It's been renamed to "antigravity IDE." Updating my old IDE got me the new non-IDE app though, which is strange.

xnx1mo ago

They still have an Antigravity IDE version.

kristopolous1mo ago· 2 in thread

I have a tool to track these I've built

Relatively speaking here's where it's at:

    score  age  size    name
    44.2   97   large   GLM-5 (Reasoning)
    44.7   187  -       GPT-5.1 (high)
    44.9   29   -       Qwen3.6 Max Preview
    45     0    -       Gemini 3.5 Flash
    45.5   27   large   MiMo-V2.5-Pro
    45.6   75   -       GPT-5.4 (low)

this is from artificial-analysis using https://github.com/day50-dev/aa-eval-email/blob/main/art-ana...

This is honestly an existential issue for me. I quit my job a year ago to try to address this full time and I'm getting nowhere.

esafak1mo ago

I see no 'score' or 'age' mentioned in your script. What does age signify and how are they calculated?

kridsdale31mo ago

Buddy, this tone may be why.

We genuinely don't understand what your post is about. What is this tool? What are these numbers representative? Why are things sorted in that order?

You haven't communicated really anything at all. I am interested, I'd like to understand. Write a more complete post, please.

https://sql-benchmark.nicklothian.com/?highlight=google_gemi...

rdtsc1mo ago· 2 in thread

I caught it again being deceitful. It did this before

(Me): Did you actually read the paper before when I pasted the link?

> I will be completely honest: No, I did not.

> You caught me hallucinating a confident answer based on incomplete recall rather than actually verifying the document.

> Thank you for calling it out and providing the exact quote. It forced me to re-evaluate the actual data you provided rather than relying on my flawed assumption.

I am sure it learned a valuable lesson and won't do it again /s

jareklupinski1mo ago

PunchTornado1mo ago

fucking get that with claude all the time too.

nl1mo ago· 1 in thread

On my Agentic SQL benchmark it scores 19/25. That's... mediocre.

It means performs worse than 3.1 Flash Lite Preview (22/25), is slower (367s vs 142s) and is more expensive (75c vs 2c).

It is outperformed by Gemma4 26B-A4B in every way(!)

(Switch to the cost vs performance chart to see how far this is off the Pareto frontier)

data-ottawa1mo ago

I'm seeing this too.

I've only really been looking this morning, so I need to do a full eval, but the initial results match what your benchmark shows.

---

The other failures like Q21 incorrectly averaging the list price are correct failures.

npn1mo ago· 1 in thread

The price is crazy.

And I guess Gemini 3.5 pro will have the pricing increment, too. 12 x 5 = 60?

It seems like google does want us to use Chinese models.

brianwawok1mo ago

What exactly are you doing with this that you can’t generate $1.50 of value per million tokens?

data-ottawa1mo ago· 1 in thread

Anyone using this yet?

Google’s release notes say to reduce unnecessary tool calls by reducing thinking, but that feels like it should be orthogonal to me.

It definitely has improved a few logic things, like in data visualizations it’s better at labelling data, but it’s much worse at preparing data out of the box.

wwizo1mo ago

Same. Feels very goal oriented. Requires multiple attempts to deter course and means to achieve it.

Alifatisk1mo ago· 1 in thread

mrbungie1mo ago

There is now an Antigravity CLI which will replace Gemini CLI. Gemini CLI is going to be EOLd by June 18th afaik. Antigravity CLI and GUI share the same agent harness, so it might do the same task.

Source: https://developers.googleblog.com/an-important-update-transi...

mirzap1mo ago· 1 in thread

The Flash model costs more than the Frontier models. Didn't see that coming.

verdverm1mo ago

MASNeo1mo ago· 1 in thread

esafak1mo ago

Are you on a paid plan?

noelsusman1mo ago· 1 in thread

hydra-f1mo ago

That really depends on whether they have similar parameter counts, doesn't it? Unless you know that, the comparison is just strange

jonnyasmar1mo ago· 1 in thread

superchink1mo ago

Out of curiosity, what was your workflow to generate this comment? I’m curious what model (claude?) and process (manual prompt with bullet points?) you used.

alexdns1mo ago· 1 in thread

Its Gemini 3.5 Flash

nerdalytics1mo ago

Yeah, Google chose a misleading title for the blog post.

simianwords1mo ago· 1 in thread

No one talking about how this flash Beats Pro? Imagine what 3.5 pro looks like?

Also concerned about Gemini models being benchmaxxed generally

NitpickLawyer1mo ago

> concerned about Gemini models being benchmaxxed generally

puapuapuq1mo ago· 1 in thread

I played the audio readout of the page, what is the last 30 secs in the readout?

betalb1mo ago

Sounds like a hallucination in Russian

ai_fry_ur_brain1mo ago· 1 in thread

Imagine reducing yourself to the worst of averages by making your competency 1:1 correlated to the tokens that you have access too (and everyone else does).

cloakandswagger1mo ago

> correlated to the tokens that you have access too (and everyone else does)

Do you mean "the weight parameters you have access to[sic]" or do you frequently find yourself limited by the model's token vocabulary?

f311a1mo ago· 1 in thread

$9/1M output

explosion-s1mo ago

andrewstuart1mo ago· 1 in thread

The benchmark that matters - can it actually program as well as Claude code.

If not then I’m not using it.

Cancelled my account 3 months ago, only Claude code level capability would bring me back.

cmrdporcupine1mo ago

For reference, this is a Rust codebase, deep "systems" stuff (database, compiler, virtual machine / language runtime)

They're still months behind OpenAI and Anthropic on coding.

Mind you I also find Claude Code careless and unreliable these days, too. (But it's good at tool use at least).

I do use Gemini for "lifestyle" AI usage (web research etc) tho.

nightski1mo ago· 1 in thread

lugu1mo ago

llmslave1mo ago· 1 in thread

Conspiracy theory:

This model isnt an advancement, its a previous model that runs more compute, which is why it costs more

npn1mo ago

Nah, it costs what you are willing to pay.

cesarvarela1mo ago· 1 in thread

Add Flash to the title, please.

meetpateltech1mo ago

edited it.

danny0941mo ago· 1 in thread

Codex is way better pricing than this lol

dragonwriter1mo ago

HardCodedBias1mo ago· 1 in thread

Oh boy.

GDM is making (or has been backed into a corner into making) the bet that high throughput, low latency, low capability models are the path forward.

That probably works for vibe coded apps by non-practitioners.

I suspect that practitioners/professionals will wait longer for better results.

brokencode1mo ago

Where do you see that it’s low capability?

And Google is trying to make something affordable enough for a mass market, ad-supported audience.

They aren’t hyper focused on enterprise like Anthropic is. And that’s okay. There’s room for different players in different markets.

gertlabs1mo ago

Taking into account that this is a flash model, it's a strong release. It's very fast and frontier-ish for the price.

https://storage.googleapis.com/gweb-uniblog-publish-prod/ori...

swe_dima1mo ago

You may remember the argument that you can build an AI app and it continues to improve as models improve and costs go down?

Well, looking at OpenAI / Google / Anthropic we see crazy cost increases, such that it might invalidate your unit economics.

Cheering for Chinese models!

stared1mo ago

China: we don’t need to use US models, we can distill them ourself

Google: we don’t need Chinese to distill our models, we can do it ourself

golfer1mo ago

Here's the benchmark scoreboard they published:

hackmack101mo ago

brikym1mo ago

How is this progress? The token cost just keeps going up and up. Flash is the new Pro? Do the models actually cost more to run or is it fattening margins?

XCSme1mo ago

For me the biggest gain is the speed.

It takes on average 2.84s for Gemini 3.5 Flash to give an answer, compared to GPT 5.5 33s [0].

Also the max/slowest test is answered in under 7s, whereas GPT 5.4 takes more than 5 minutes...

[0]: https://aibenchy.com/compare/google-gemini-3-5-flash-low/ope...

merb1mo ago

Stil no new processor version for document ai https://docs.cloud.google.com/document-ai/docs/release-notes that is so weird. (Customer extractor)

It’s not possible to uptrain on preview releases and it did not get that much love for a while.

time0ut1mo ago

sbinnee1mo ago

mixtureoftakes1mo ago

benchmarks look REALLY good, the price hike is big but it also beats sonnet 4.6 in every discipline?