Understanding, using, and finetuning Gemma (opens in new tab)

(lightning.ai)

118 pointsrasbt2y ago48 comments

48 comments

31 comments · 5 top-level

behnamoh2y ago· 13 in thread

Gemma (and Gemini) are heavily nerfed. Why are they on the news lately?

Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

All this hype seems to be backed by Google to boost their models whereas in practice, the models are not that good.

Google also made a big claim about Gemini 1.5 1M context window, but at the end of their article they said they'll limit it to 128K. So all that 1M flex was for nothing?

Not to mention their absurd approach in alignment in image creation.

gliptic2y ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

Are you talking about gemma.cpp? Then no, they didn't.

andy992y ago

Presumably he means this https://cloud.google.com/blog/products/application-developme...

The claim is correct but not related to gemma

2 more replies

sivakon2y ago

It’s objectively worse in my local tests compared to Mistral. Again their model doesn’t include MT-bench benchmark because it’s really really bad at answering a follow up question(s). (this is also a problem in Ultra). It’s reasoning is also pretty bad compared to mistral.

b33j0r2y ago

I can’t get it to recognize the stop token consistently in the 7b models.

About 50% of the shots, I get a sentence and a half of beautiful poetry, then a codeswitch into kanji, and then ral ral ral ral ral 膳 ral 杯 ral ral

Until I kill the process. Not every time, but way more often than the other llamas (which is basically never, these days).

I think they underestimated the impact of training on bulleted lists. It seems to love those!

rasbtOP2y ago

> Gemma is a +9B model

Yes that's correct. It's 9.3B parameters if you count the embedding layer and final projection layer separately. However, since they used weight tying, the adjusted count is 8.5B as discussed in the article.

neodymiumphish2y ago

Which still rounds to 9B and is 21.4% larger.

1 more reply

htrp2y ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

They said it was inspired by llama

>This is inspired by vertically-integrated model implementations such as ggml, llama.c, and llama.rs.

from https://github.com/google/gemma.cpp

pests2y ago

Not gemma.cpp

He meant this:

https://cloud.google.com/blog/products/application-developme...

brucethemoose22y ago

Counterpoints:

- Local models are pretty easy to de-censor, if thats what you mean.

- ...Yeah, it should not be labeled as a 7B. Its sort of 7B class.

- The repo mentions they use the llama-cpp-python server

- 1M context brute forced across TPUs is insanely expensive, I can see why Google reigned it in.

But overall your message is not wrong. Google is hyping Gemma a ton when its... Well, not very remarkable. And they could have certainly made something niche and interesting, like a long context 8.5B model, a specialized model, a vastly more multilingual model, something to differentiate it from Mistral 7B 0.2

d-z-m2y ago

> Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

They say it's because they're not counting embedding parameters[0]. Although apparently even with the embedding parameters subtracted it still rounds to 8B not 7B. From what understand, rounding to the nearest B is the standard. Seems slightly disingenuous to call it 7B, but not a big deal IMO since I don't hear anyone saying this model is outperforming popular OSS 7Bs.

[0]: https://huggingface.co/google/gemma-7b/discussions/34

andy992y ago

Gemma has a 7B parameters model https://huggingface.co/google/gemma-7b that's what I saw compared to Mistral

(Edit: I'm wrong)

light_hue_12y ago

No it doesn't.

Gemma 7B is a 9B model. The name is a lie. Then they really played games with Gemma 2B as well.

I don't get how Google can be this incompetent and far behind everyone else. They have amazing people and the kinds of resources that almost no one else does but somehow need to resort to faking demos, blatant lies about model sizes, etc.

Google used to be the place everyone wanted to go. Someone at Google AI needs to be fired so they can start being productive again.

2 more replies

cyanydeez2y ago

the context window is entirely limited by VRAM size

do you even LLM?

brunooliv2y ago· 7 in thread

Anyone who uses these models for more than 10 min will immediately realize that they're really, really bad compared to other free, OSS models. Even Phi-2 was giving me "on par" results except that its a model of a different league.

Many models are being released now, which is good to keep OpenAI on their toes and not mess up, but, truth be told, I've yet to see _any_ OSS model that I can run on my machine being as good as ChatGPT 3 (not 3.5, not 4, but the original one from when everyone went crazy).

My hopes for consumer hardware ChatGPT-3.5 within 2024 probably lie with what Meta will keep building upon.

Google was great, once. Now, they're a mere bystander in the larger scheme of things. I think that's a good thing. Everything in the world is cyclic and ephemeral and Google enjoyed their time while it lasted, but, newer and better things are and will, keep on coming.

PS: Completely unrelated, but, gmail is now the only Google product I actively use. I don't, genuinely, remember the last time I did a Google Search... When I need to do my own digging I use Phind these days.

Times are changing and that's great for tech and future generations joining the field and workforce!

brucethemoose22y ago

Yi 34 200K finetunes (like Tess 1.5),Deepseek Code 33B and Miqu 70B definitely outpace ChatGPT-3.5, at least for me.

They don't have the augmentations of being a service, but generally they are smarter, have a bigger context and (perhaps most importantly) are truly unbound.

I am on a single 3090 desktop, for reference. Admittedly, this is much more expensive now than it was a few months ago, with the insane prices used 3090s are going for now.

brunooliv2y ago

Damn, I see, how many tokens per sec you get on that setup?

On a Macbook M2 I get ~10/12t/sec which is a tiny tad bit too slow for continued/ daily use, but if I think its worthy I might invest on a more powerful machine soon-ish!

1 more reply

CuriouslyC2y ago

If Mixtral isn't outperforming chatgpt 3 you're configuring it wrong. It gives somewhat terse answers by default, but you can prompt it to spit out wordy answers of the sort chatgpt has been aligned to prefer easily enough.

brunooliv2y ago

Mixtral aka the 8x7B the "sparse mixture of experts" one is not the same as, eg. Mistral-7B which is still very, very good, just not quite hitting the mark on some things.

I still couldn't run Mixtral 8x7B on an M1 Macbook Pro with 32Gb ram, so maybe I am indeed doing it wrong? Or are there better quantized versions available now or..?

1 more reply

bradley132y ago

My initial impressions of mixtral (not mistral) are quite good. It runs fairly well on my PC, using ollama.

d-z-m2y ago

> I've yet to see _any_ OSS model that I can run on my machine being as good as ChatGPT 3 (not 3.5, not 4, but the original one from when everyone went crazy).

It depends on your machine I guess, but IMO there's definitely OSS models out there that rival the original ChatGPT offering for certain use cases(dolphin mixtral comes to mind). Having a model with RAG capability is going to make a huge difference in the quality of the answer, as well.

jerpint2y ago

The only OSS model I’ve been wowed by so far is CC mixtral which from limited usage gave me a vibe closer to gpt3.5 turbo

Solvency2y ago· 3 in thread

Can we just stop talking about Gemini/Gemma for at least two years before it's improved? In fact, the two-year mark is rather strategic recommendation, because I guarantee it'll become vaporware by then anyway with Google's track record. It's outrageously poorly performing.

breezeTrowel2y ago

How can it be vaporware if it's already been released?

Solvency2y ago

Pardon, I should have said "vaporizedware".

nurettin2y ago

It was "released", like Tesla's fully automated self-driving was "released".

1 more reply

lopkeny12ko2y ago· 2 in thread

Gemma, despite being developed by a company worth billions of dollars, is a phenomonally poor model.

I tried the open source release yesterday. I started with the input string "hello" and it responded "I am a new user to this forum and I am looking for 100000000000000..." with zeros repeating forever.

Ok, cool I guess. Looks like I'll be sticking with GPT-4.

fsmv2y ago

Did you use the raw model or the instruction tuned one? 2B or 7B? You didn't give it much to go on.

SunlitCat2y ago

The Mistral model I tried when it came out produced "blog posts" as responses. I assume this somehow depends on where those models get much of their training data from (please correct me if I'm wrong).

brucethemoose22y ago· 1 in thread

What are HNers looking for in this article? The architectural differences, or how to run/finetune it?

BryanLegend2y ago

I was pleased to see the architectural differences highlighted. It's insightful to see how they're evolving.

j / k navigate · click thread line to collapse

48 comments

31 comments · 5 top-level

behnamoh2y ago· 13 in thread

Gemma (and Gemini) are heavily nerfed. Why are they on the news lately?

Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

All this hype seems to be backed by Google to boost their models whereas in practice, the models are not that good.

Google also made a big claim about Gemini 1.5 1M context window, but at the end of their article they said they'll limit it to 128K. So all that 1M flex was for nothing?

Not to mention their absurd approach in alignment in image creation.

gliptic2y ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

Are you talking about gemma.cpp? Then no, they didn't.

andy992y ago

Presumably he means this https://cloud.google.com/blog/products/application-developme...

The claim is correct but not related to gemma

2 more replies

sivakon2y ago

b33j0r2y ago

I can’t get it to recognize the stop token consistently in the 7b models.

About 50% of the shots, I get a sentence and a half of beautiful poetry, then a codeswitch into kanji, and then ral ral ral ral ral 膳 ral 杯 ral ral

Until I kill the process. Not every time, but way more often than the other llamas (which is basically never, these days).

I think they underestimated the impact of training on bulleted lists. It seems to love those!

rasbtOP2y ago

> Gemma is a +9B model

neodymiumphish2y ago

Which still rounds to 9B and is 21.4% larger.

1 more reply

htrp2y ago

> Google also took llama.cpp and used it in one of their Github repos without giving credit. Again, not cool.

They said it was inspired by llama

>This is inspired by vertically-integrated model implementations such as ggml, llama.c, and llama.rs.

from https://github.com/google/gemma.cpp

pests2y ago

Not gemma.cpp

He meant this:

https://cloud.google.com/blog/products/application-developme...

brucethemoose22y ago

Counterpoints:

- Local models are pretty easy to de-censor, if thats what you mean.

- ...Yeah, it should not be labeled as a 7B. Its sort of 7B class.

- The repo mentions they use the llama-cpp-python server

- 1M context brute forced across TPUs is insanely expensive, I can see why Google reigned it in.

d-z-m2y ago

> Also, Gemma is a +9B model. I think it's not okay that Google compared it with Mistral and Llama 2 (7B) models.

[0]: https://huggingface.co/google/gemma-7b/discussions/34

andy992y ago

Gemma has a 7B parameters model https://huggingface.co/google/gemma-7b that's what I saw compared to Mistral

(Edit: I'm wrong)

light_hue_12y ago

No it doesn't.

Gemma 7B is a 9B model. The name is a lie. Then they really played games with Gemma 2B as well.

Google used to be the place everyone wanted to go. Someone at Google AI needs to be fired so they can start being productive again.

2 more replies

cyanydeez2y ago

the context window is entirely limited by VRAM size

do you even LLM?

brunooliv2y ago· 7 in thread

My hopes for consumer hardware ChatGPT-3.5 within 2024 probably lie with what Meta will keep building upon.

Times are changing and that's great for tech and future generations joining the field and workforce!

brucethemoose22y ago

Yi 34 200K finetunes (like Tess 1.5),Deepseek Code 33B and Miqu 70B definitely outpace ChatGPT-3.5, at least for me.

They don't have the augmentations of being a service, but generally they are smarter, have a bigger context and (perhaps most importantly) are truly unbound.

I am on a single 3090 desktop, for reference. Admittedly, this is much more expensive now than it was a few months ago, with the insane prices used 3090s are going for now.

brunooliv2y ago

Damn, I see, how many tokens per sec you get on that setup?

On a Macbook M2 I get ~10/12t/sec which is a tiny tad bit too slow for continued/ daily use, but if I think its worthy I might invest on a more powerful machine soon-ish!

1 more reply

CuriouslyC2y ago

brunooliv2y ago

Mixtral aka the 8x7B the "sparse mixture of experts" one is not the same as, eg. Mistral-7B which is still very, very good, just not quite hitting the mark on some things.

I still couldn't run Mixtral 8x7B on an M1 Macbook Pro with 32Gb ram, so maybe I am indeed doing it wrong? Or are there better quantized versions available now or..?

1 more reply

bradley132y ago

My initial impressions of mixtral (not mistral) are quite good. It runs fairly well on my PC, using ollama.

d-z-m2y ago

> I've yet to see _any_ OSS model that I can run on my machine being as good as ChatGPT 3 (not 3.5, not 4, but the original one from when everyone went crazy).

jerpint2y ago

The only OSS model I’ve been wowed by so far is CC mixtral which from limited usage gave me a vibe closer to gpt3.5 turbo

Solvency2y ago· 3 in thread

breezeTrowel2y ago

How can it be vaporware if it's already been released?

Solvency2y ago

Pardon, I should have said "vaporizedware".

nurettin2y ago

It was "released", like Tesla's fully automated self-driving was "released".

1 more reply

lopkeny12ko2y ago· 2 in thread

Gemma, despite being developed by a company worth billions of dollars, is a phenomonally poor model.

Ok, cool I guess. Looks like I'll be sticking with GPT-4.

fsmv2y ago

Did you use the raw model or the instruction tuned one? 2B or 7B? You didn't give it much to go on.

SunlitCat2y ago

brucethemoose22y ago· 1 in thread

What are HNers looking for in this article? The architectural differences, or how to run/finetune it?

BryanLegend2y ago

I was pleased to see the architectural differences highlighted. It's insightful to see how they're evolving.

j / k navigate · click thread line to collapse