I would expect it to (but I haven't thought about it too deeply so I could be extremely wrong). My thinking is as follows:
At the end of the day, all we're doing is maximum likelihood estimation. So we're trying to find model parameters which define a probability distribution where our observed data is the most probable. In the original GPT-2, this observed data is the text from quality outgoing links on Reddit. Since this data is so diverse, there will not really be any special structure that the model can pick up on, besides whatever structure exists in the English language.
However, when we fine-tune on RapGenius, the observed data is now songs. These songs have a certain structure to them such as stanzas, rhyming, etc. In order to maximize the likelihood of this data, the model must learn to model the structure.
Finally, if we further fine-tune on Beatles lyrics, the model is again trying to find parameters which maximize the likelihood of the data. So the model will try to match both the lyrics and the structure of Beatles songs. It's likely that the structure of Beatles songs is pretty similar to the other songs from RapGenius, so mostly what will change are the lyrics. Also, changing the lyrics seems to be the most straightforward way to maximize the likelihood since by definition we want these particular lyrics to be the most likely.
That being said, this is all just conjecture. It would be interesting to try out both methods and see if you get better results doing this two step fine-tuning vs the original fine tuning (or just fine tuning on RapGenius then conditionally sampling Beatles songs as @gwern suggested).