Implementation of Google's Griffin Architecture – RNN LLM (opens in new tab)

(github.com)

218 pointsmilliondreams2y ago38 comments

38 comments

18 comments · 4 top-level

VHRanger2y ago· 8 in thread

Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have.

However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.

But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.

Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.

whimsicalism2y ago

transformers were also just better at the LM task than 2018 RNNs for equal amount of flop training

VHRanger2y ago

Yeah, that's just the training stability part to my knowledge

1 more reply

foota2y ago

Do you know the downside with RWKV? Based on how they present it, it seems like the best thing since sliced bread, but I would have assumed that it would have been widely adopted if that were the case.

kouteiheika2y ago

The downside is that it's bad (like, really bad) on a certain subset of tasks. I once trained RWKVv4 model on a machine translation task and no matter how much I scaled it up it just didn't work at all, while an equivalent transformer did the job without a problem.

Intuitively this does make sense, because a transformer can at any time "look back" at the source sentence and at what it has previously generated (due to its attention mechanism) for every token it outputs, while an RNN like RWKV has to compress this into its internal state which is both lossy and limited in size.

I haven't looked at the new versions of RWKV (apparently we're at v6 now), but hopefully it performs better now. In the end I think that a hybrid architecture probably makes the most sense - have some sort of an attention mechanism for the near context, and an RNN-like state for far context, and that would give you the best of both worlds.

1 more reply

shawntan2y ago

Not sure if this is the type of answer you're looking for, but RWKV is not really recurrent the same way RNNs are recurrent. This quasi-recurrentness allows it and its comrades to use algorithms like parallel SCAN to achieve log N complexity when parallelised. But you pay for that in terms of state-tracking.

There's a cool talk here if you care to know the details:https://www.youtube.com/watch?v=4-VXe1yPDjk

VHRanger2y ago

It seems only OK as a model? Looking at the LLM chat leaderboard it's 71st and the 14B version is worse than a lot of 7B models:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Also, llama.cpp makes inference accessible for a lot of people, and it's not available for RWKV.

Not to knock on the model, I'm sure it's good. I also like that it's a succesful example of citizen science.

It's just not popular enough to have the inference infrastructure transformers have, not established enough to attract enough money to get 60B+ models trained, and so on.

3 more replies

jimmyl022y ago

From what I know about RWKV, it's mostly a one man effort and doesn't have the same data pipeline / resources as most major labs. It's a bit unfortunate but I'm curious about the performance given the same training corpus as OpenAI's GPTs. Maybe some labs have tried internally but haven't released results? On the other hand it makes sense to invest more money into transformer training runs as they have been proven to work.

They really burst onto the scene and brought back RNNs in the world of transformers. The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme. I'd guess it's generalizable performance as there is a difference between doing well on benchmarks and being usable. Personally I've tried running the weights a long time ago when it was first released and the results weren't usable but I'm sure there has been considerable progress since then.

2 more replies

GaggiX2y ago

The paper shows that the speed is comparable to transformer models, faster with smaller with "long" sequence length like 8k.

spxneo2y ago· 3 in thread

im not smart enough to know the significance of this...is Griffin like MAMBA?

VHRanger2y ago

Yes, like RWKV and Mamba this is a new generation of models that are more like big RNNs than pure transformers we have now

stri8ed2y ago

Isn't that how previous models were, before the attention is all you need paper?

boywitharupee2y ago

and is Griffin a state space model?

1 more reply

riku_iki2y ago· 2 in thread

I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B Griffin

Why wouldn't select equal size models?..

szundi2y ago

They probably had them for some reason and it was cheaper not to retrain one of them again

riku_iki2y ago

Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..

1 more reply

janwas2y ago· 1 in thread

For anyone interested in a C++ implementation, our github.com/google/gemma.cpp now supports this model.

JyrkiAlakuijala2y ago

Fun fact -- gemma.cpp uses highway, an amazing high performance computation library originally developed in the JPEG XL effort.

j / k navigate · click thread line to collapse

38 comments

18 comments · 4 top-level

VHRanger2y ago· 8 in thread

Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have.

However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.

But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.

Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.

whimsicalism2y ago

transformers were also just better at the LM task than 2018 RNNs for equal amount of flop training

VHRanger2y ago

Yeah, that's just the training stability part to my knowledge

1 more reply

foota2y ago

kouteiheika2y ago

1 more reply

shawntan2y ago

There's a cool talk here if you care to know the details:https://www.youtube.com/watch?v=4-VXe1yPDjk

VHRanger2y ago

It seems only OK as a model? Looking at the LLM chat leaderboard it's 71st and the 14B version is worse than a lot of 7B models:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Also, llama.cpp makes inference accessible for a lot of people, and it's not available for RWKV.

Not to knock on the model, I'm sure it's good. I also like that it's a succesful example of citizen science.

It's just not popular enough to have the inference infrastructure transformers have, not established enough to attract enough money to get 60B+ models trained, and so on.

3 more replies

jimmyl022y ago

2 more replies

GaggiX2y ago

The paper shows that the speed is comparable to transformer models, faster with smaller with "long" sequence length like 8k.

spxneo2y ago· 3 in thread

im not smart enough to know the significance of this...is Griffin like MAMBA?

VHRanger2y ago

Yes, like RWKV and Mamba this is a new generation of models that are more like big RNNs than pure transformers we have now

stri8ed2y ago

Isn't that how previous models were, before the attention is all you need paper?

boywitharupee2y ago

and is Griffin a state space model?

1 more reply

riku_iki2y ago· 2 in thread

I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B Griffin

Why wouldn't select equal size models?..

szundi2y ago

They probably had them for some reason and it was cheaper not to retrain one of them again

riku_iki2y ago

Its just performance comparison is misleading then, they report marginal improvements which is expected just because of models size differences..

1 more reply

janwas2y ago· 1 in thread

For anyone interested in a C++ implementation, our github.com/google/gemma.cpp now supports this model.

JyrkiAlakuijala2y ago

Fun fact -- gemma.cpp uses highway, an amazing high performance computation library originally developed in the JPEG XL effort.

j / k navigate · click thread line to collapse