Would you say the following understanding is correct?:
- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.
- You can fine-tune any model on any hardware, provided it fits into RAM. That means, that the 30B llama-derived models in their 4-bit quantized version and 19.5GB of VRAM requirement can be fine-tuned on consumer grade GPUs with 24gb of VRAM. (Like the RTX 3090 and 4090)
To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.
But nonetheless, training time improvements look interesting.
e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.
I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?
Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?
I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.
It seems like they use a fixed-distribution controller for training. It’d be nice to see why it’s worth deviating from the original RL paradigm.