I have a PC that is able to run e.g. Mistral Instruct 7B Q4 inference with around 30 token/s.
How (computation and memory) expensive would it be to also run backpropagation in addition to inference?
I'm aware that the models are typically fed with much more and better data than what is typically provided during normal conversations but on the other hand if I could finetune my local model a teeny tiny bit during during / after each conversation I have with it anyways, it would after a while be perfectly customize for me.
I'm also aware that this could be problematic for models that are used by multiple users but my intended use case would be personal use by a single user.
AFAIK the model can’t be quantized during backprop, so right there you’d need a ton of RAM.
Backprop is faster bc it can be parallelized, but IIRC you need to hold an entire copy of the model for each backprop process.