Yeah, you can't change the model much with low LRs. That's the point! Same reason you don't get continual-learning if you just keep using low LRs:
https://arxiv.org/abs/2403.08763 You need to really shake up the model if you want to learn some genuinely better (ie. different) internal representations that exploits the DenseNet (
https://arxiv.org/abs/1608.06993)/LTG-BERT (
https://arxiv.org/abs/2311.02265) arch you're using here.