DyLoRA: Parameter Efficient Tuning of Pre-Trained Models (opens in new tab)

(arxiv.org)

115 pointsmparrett3y ago36 comments

36 comments

17 comments · 6 top-level

whimsicalism3y ago· 3 in thread

I'm unsure of the value of dynamically reducing the rank of the LoRA matrix at inference time given that probably most of the parameter count comes from the original weights rather than the LoRA diff.

But nonetheless, training time improvements look interesting.

e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.

I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?

huevosabio3y ago

Yea, this is interesting but I can't see the immidiate value (not that there isn't).

Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?

vladf3y ago

The optimal rank could differ across layers

whimsicalism3y ago

I would be shocked if the "optimal rank" in terms of performance wouldn't be using the maximum rank from the DynLoRA across all layers.

1 more reply

turnsout3y ago· 3 in thread

So this can tune a model 7X faster than LoRA, which was already a massive speed boost? Curious to see what this will do to the LLaMA-derivative community in particular.

whimsicalism3y ago

7x faster compared to grid-search LoRA for best rank.

I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.

Majromax3y ago

Highest posssible in which combination, though? If you’re fine tuning a model with N layers, then you could apply LoRA to any or all of them. Maybe it’s better to concentrate effort unevenly, in which case a uniform increase of adaptation rank (to compute budget) could still be subpar.

1 more reply

sitkack3y ago

What is the fastest way to show that?

1 more reply

fancyfredbot3y ago· 2 in thread

When fine tuning an LLM you can use the LORA technique to make the fine tuning faster. LORA involves fine tuning a subset of parameters (really it's a low rank approximation of the weight matrix determined by picking the n largest eigenvalues in the SVD decomposition). The size of the subset is determined by the rank. The smaller the rank the faster the fine tuning. However if you make the rank too small then quality will suffer. So you want to pick the optimal rank. This paper describes a technique which can be used to find the optimal rank more easily.

FrostKiwi3y ago

Fascinating progress.

Would you say the following understanding is correct?:

- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.

- You can fine-tune any model on any hardware, provided it fits into RAM. That means, that the 30B llama-derived models in their 4-bit quantized version and 19.5GB of VRAM requirement can be fine-tuned on consumer grade GPUs with 24gb of VRAM. (Like the RTX 3090 and 4090)

whimsicalism3y ago

Yes to the first.

To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.

1 more reply

lxe3y ago· 2 in thread

Kudos for the authors for providing the code https://github.com/huawei-noah/KD-NLP/tree/main/DyLoRA and the roberta example. Considering the current state of the OSS LLM community, I'm guessing someone is already porting it to Llama and gpt-style models.

kernelsanderz3y ago

Adding this to the huggingface peft library would be amazing. That's the main library that people using LoRA are currently using. https://github.com/huggingface/peft/issues/289

brucethemoose23y ago

The stable diffusion community has, unfortunately, largely ignored peft because the training/inference scripts largely ignored diffusers.

vladf3y ago· 1 in thread

How does this technique differ from the supernet optimization for one-shot NAS? https://proceedings.mlr.press/v80/bender18a.html

It seems like they use a fixed-distribution controller for training. It’d be nice to see why it’s worth deviating from the original RL paradigm.

whimsicalism3y ago

It's very different, but hard to distill in a comment. They use a new regularization technique to basically create a LoRA with dynamically adjustable rank.

charleshmartin3y ago

There are good theoretical reasons behind this as well https://calculatedcontent.com/2023/02/01/deep-learning-and-e...

j / k navigate · click thread line to collapse

36 comments

17 comments · 6 top-level

whimsicalism3y ago· 3 in thread

But nonetheless, training time improvements look interesting.

e: Oh I see, the training time improvement is compared to a grid search over the LoRA rank. Not for a single run.

I am not convinced that you shouldn't just train on the highest possible rank that you can with your compute budget. If you can train a DynLoRA with rank 8, why not just train a LoRA with that rank?

huevosabio3y ago

Yea, this is interesting but I can't see the immidiate value (not that there isn't).

Maybe if the "optimal rank" of LORA applies to any adaptation and you interested in training multiple adaptations for different use cases?

vladf3y ago

The optimal rank could differ across layers

whimsicalism3y ago

I would be shocked if the "optimal rank" in terms of performance wouldn't be using the maximum rank from the DynLoRA across all layers.

1 more reply

turnsout3y ago· 3 in thread

So this can tune a model 7X faster than LoRA, which was already a massive speed boost? Curious to see what this will do to the LLaMA-derivative community in particular.

whimsicalism3y ago

7x faster compared to grid-search LoRA for best rank.

I am not convinced that the "best rank" is not just the highest possible with your compute budget, personally.

Majromax3y ago

1 more reply

sitkack3y ago

What is the fastest way to show that?

1 more reply

fancyfredbot3y ago· 2 in thread

FrostKiwi3y ago

Fascinating progress.

Would you say the following understanding is correct?:

- You can fine-tune a model, regardless of whether it has been quantized (as in the 4-bit versions of models made to fit in consumer grade RAM sizes) or not.

whimsicalism3y ago

Yes to the first.

To the second, I'm not sure that the RAM requirements are the same to train because you have to preserve the state which takes extra memory.

1 more reply

lxe3y ago· 2 in thread

kernelsanderz3y ago

Adding this to the huggingface peft library would be amazing. That's the main library that people using LoRA are currently using. https://github.com/huggingface/peft/issues/289

brucethemoose23y ago

The stable diffusion community has, unfortunately, largely ignored peft because the training/inference scripts largely ignored diffusers.

vladf3y ago· 1 in thread

How does this technique differ from the supernet optimization for one-shot NAS? https://proceedings.mlr.press/v80/bender18a.html

It seems like they use a fixed-distribution controller for training. It’d be nice to see why it’s worth deviating from the original RL paradigm.

whimsicalism3y ago

It's very different, but hard to distill in a comment. They use a new regularization technique to basically create a LoRA with dynamically adjustable rank.

charleshmartin3y ago

There are good theoretical reasons behind this as well https://calculatedcontent.com/2023/02/01/deep-learning-and-e...

j / k navigate · click thread line to collapse