However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.
But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.
Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.
Intuitively this does make sense, because a transformer can at any time "look back" at the source sentence and at what it has previously generated (due to its attention mechanism) for every token it outputs, while an RNN like RWKV has to compress this into its internal state which is both lossy and limited in size.
I haven't looked at the new versions of RWKV (apparently we're at v6 now), but hopefully it performs better now. In the end I think that a hybrid architecture probably makes the most sense - have some sort of an attention mechanism for the near context, and an RNN-like state for far context, and that would give you the best of both worlds.
There's a cool talk here if you care to know the details:https://www.youtube.com/watch?v=4-VXe1yPDjk
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
Also, llama.cpp makes inference accessible for a lot of people, and it's not available for RWKV.
Not to knock on the model, I'm sure it's good. I also like that it's a succesful example of citizen science.
It's just not popular enough to have the inference infrastructure transformers have, not established enough to attract enough money to get 60B+ models trained, and so on.
They really burst onto the scene and brought back RNNs in the world of transformers. The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme. I'd guess it's generalizable performance as there is a difference between doing well on benchmarks and being usable. Personally I've tried running the weights a long time ago when it was first released and the results weren't usable but I'm sure there has been considerable progress since then.
Why wouldn't select equal size models?..