Finetuning GPT-2 to Generate Beatles Lyrics (opens in new tab)

(towardsdatascience.com)

55 pointseugenhotaj6y ago14 comments

14 comments

12 comments · 4 top-level

kastnerkyle6y ago· 4 in thread

Tricks in beam search to force rhyme schemes, or techniques like constrained markov chains (c.f. https://redylan.neocities.org/#/how-it-works/ and https://github.com/gabrielebarbieri/markovchain) can give really strong results in lyric / structured text generation.

Might be worth investigating if you are interested in this application.

gwern6y ago

Is beam search a good idea? Whenever anyone tries beam search on a neural language model like a char-RNN or GPT-2, it seems to generally either do little or make it much worse (by exacerbating the repetition problem), and get worse the more beams/computation you do: eg https://github.com/karpathy/char-rnn/issues/138 or https://arxiv.org/abs/1904.09751

yorwba6y ago

If I'm interpreting "Tricks in beam search to force rhyme schemes" correctly, the idea is to filter the beams and only keep those which correspond to the chosen scheme. You don't have to use beam search to be able to do that; you could also rollback the generation process whenever it doesn't rhyme and try again with a different alternative.

1 more reply

eugenhotajOP6y ago

I've seen some research [1] where the authors use beam search with an explicit diversity penalty to get around the repetition problem. They seem to get good results.

[1] https://arxiv.org/pdf/1610.02424.pdf

kastnerkyle6y ago

There are many flavors of beam search - I have found that for adding explicit checks and constraints (for example rhyme constraints or certain pivot words) the resulting proposals are generally a lot better. Even with simple markov chains I see pretty diverse behavior depending on beam search style.

Some of the better ones I used were variants of diverse beam search, and stochastic beam searches usually combined together. The "classic" / pure variant has generally not been as useful in generative modeling for me, it tends to collapse to basically one or two effective candidates (with maybe some filler words changed) fairly quickly.

Also it seems to generally work better for me in conditional generation than in unconditional generation (e.g. charRNN / some uses of GPT-2). However, things like the "repetition problem" can be removed by construction if you are willing to hack in the beam search just a little bit. See https://badsamples.tumblr.com/post/160777871547/stochastic-s... (stochastic, diverse beam search w Markov iirc) vs https://badsamples.tumblr.com/post/160767248407/a-markov-arg... (fixed beam search, where I didn't try to remove repetition or anything special, same Markov setup)

Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.

Nucleus sampling works well, and if you don't want to control or constrain the output in unconditional generation I don't know that beam search really does much. But for conditional generation, or post-hoc hacks to get more control over a generator I find beam search variants really useful. Especially combined with a specifically conditional architecture.

For example, conditioning the language model on a particular bag-of-rhyme-words + (stochastic, probably) beam search to force rhyme pairs at the start and end of lines, probably further modified by input and output masks to "blank out" invalid tokens and tell the model which tokens will be blanked out. I've used some blend of these tricks in speech, music, and text experiments and it can be helpful if you have structure that is important to replicate and a simple model, with simple sampling just isn't replicating the necessary structure.

EDIT: One practical reason to do this would be plagiarism detection, especially if fine-tuning a small corpus. There are ways with guarantees by construction (https://www.researchgate.net/profile/Pierre_Roy2/publication...) but simple setups using beam searches and tries can also do constraint checks for n-grams of certain lengths. Concretely, set up tries for 1-2-3-4-...-nminus1-grams, which are considered "valid" transitions, then set a "bad" trie for n-grams. Check these tries during generation, and throw out any candidates which violate the "bad" trie, but still match in the good one.

See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.

gwern6y ago· 3 in thread

His data formatting could be improved here. Title + authors would be better off denoted somehow, like using quotes, and the separate songs should be explicitly delimited using '<|endoftext|>' - looking at the samples in https://github.com/EugenHotaj/beatles/blob/master/gpt_2_gene... , GPT-2 does manage to mostly figure out that the songs are separate, but omitting '<|endoftext|>' makes it harder on GPT-2, more prone to runons (already a problem with GPT-2), and also makes prompting less effective (since you can't prompt it like '<|endoftext|>"On The Run" by John Lennon\n' to make it generate lyrics for a specific title & author). Also wouldn't be bad if he had included the specific commands + hyperparameters for the nshepperd repo he's apparently using, even if only the defaults along the lines of the examples in my own writeup ( https://www.gwern.net/GPT-2 ).

I'm not surprised that GPT-2-117M has memorized songs by the end of training, it's not a very large corpus of songs. Hard to learn and generalize well from it. If one were working more on this, it'd probably make sense to train on a much larger and varied corpus of song (with inline metadata properly formatted to allow controllable generation); something like RapGenius, maybe?

eugenhotajOP6y ago

Hi, author here.

Yea I did the delimiting you mentioned when "training" a bigram model. For GPT-2 I was mostly interested in how well the model would be able to pick up signals from the raw data so I didn't do any kind of preprocessing at all (it's also not very fun ;)). I think it's interesting that the model was able to pick up titles, authors, starts/ends of songs on it's own.

I didn't try generating specific songs but that's a good idea. Having the delimiters would probably improve things but feeding in "On the Run\nJohn Lennon" would work as well with the current approach.

Using RapGenius corpus is also something interesting that I didn't think about. The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

gwern6y ago

> it's also not very fun

Pish-posh! It's a single simple search-and-replace: replace '\n\n\n' with '\n<|endoftext|>\n' or so. For bonus, you can use regexp capture groups to rewrite the metadata simultaneously - something like '\n\n\n\(.\)\n\(.\)' → '\n<|endoftext|>\n"\1", by \2\n'.

> The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

You can do it either way: either train a single model on a multi-artist corpus and then simply prompt it appropriately, or train the single model and then further finetune on just the specific artist. I've tried both in various ways with GPT-2 and StyleGAN, and it's not clear which is best, although I hypothesize that the two-stage pretraining works best with very small corpuses, where in the multi-artist corpus single model, all the other artists might 'squeeze out' the desired artist (a kind of class imbalance), eliminating the transfer benefits.

with StyleGAN, a major benefit of the two-stage pretraining approach is that there's no easy way to 'condition' on a specific class or input; so with my anime face generator (https://www.gwern.net/Faces), when I wanted specific characters, I'd just finetune on that character alone because it's easy to select out just their data and create character-specific corpuses.

drusepth6y ago

>To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

OT: Is that how fine-tuning actually works with GPT-2? It makes sense that it'd just be strengthening connections on the most-recently-fine-tuned corpus, with previous fine-tunes still around in some way.

Should you expect that first fine tune to pick up and solidify song structure, rhyme, etc, and the second fine tune to keep those concepts in place while muddying up other aspects like the specific lyrics used?

(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just very interested and would love to read more about how all this works. :) )

1 more reply

lostmsu6y ago· 1 in thread

Or any lyrics: http://billion.dev.losttech.software:2095/

And the blog article: https://habr.com/post/453232/ (also there's no paywall here)

eugenhotajOP6y ago

Really cool stuff, thanks for sharing!

HeWhoLurksLate6y ago

Tomorrow, anybody?

j / k navigate · click thread line to collapse

14 comments

12 comments · 4 top-level

kastnerkyle6y ago· 4 in thread

Might be worth investigating if you are interested in this application.

gwern6y ago

yorwba6y ago

1 more reply

eugenhotajOP6y ago

I've seen some research [1] where the authors use beam search with an explicit diversity penalty to get around the repetition problem. They seem to get good results.

[1] https://arxiv.org/pdf/1610.02424.pdf

kastnerkyle6y ago

Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.

See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.

gwern6y ago· 3 in thread

eugenhotajOP6y ago

Hi, author here.

gwern6y ago

> it's also not very fun

drusepth6y ago

(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just very interested and would love to read more about how all this works. :) )

1 more reply

lostmsu6y ago· 1 in thread

Or any lyrics: http://billion.dev.losttech.software:2095/

And the blog article: https://habr.com/post/453232/ (also there's no paywall here)