Might be worth investigating if you are interested in this application.
Some of the better ones I used were variants of diverse beam search, and stochastic beam searches usually combined together. The "classic" / pure variant has generally not been as useful in generative modeling for me, it tends to collapse to basically one or two effective candidates (with maybe some filler words changed) fairly quickly.
Also it seems to generally work better for me in conditional generation than in unconditional generation (e.g. charRNN / some uses of GPT-2). However, things like the "repetition problem" can be removed by construction if you are willing to hack in the beam search just a little bit. See https://badsamples.tumblr.com/post/160777871547/stochastic-s... (stochastic, diverse beam search w Markov iirc) vs https://badsamples.tumblr.com/post/160767248407/a-markov-arg... (fixed beam search, where I didn't try to remove repetition or anything special, same Markov setup)
Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.
Nucleus sampling works well, and if you don't want to control or constrain the output in unconditional generation I don't know that beam search really does much. But for conditional generation, or post-hoc hacks to get more control over a generator I find beam search variants really useful. Especially combined with a specifically conditional architecture.
For example, conditioning the language model on a particular bag-of-rhyme-words + (stochastic, probably) beam search to force rhyme pairs at the start and end of lines, probably further modified by input and output masks to "blank out" invalid tokens and tell the model which tokens will be blanked out. I've used some blend of these tricks in speech, music, and text experiments and it can be helpful if you have structure that is important to replicate and a simple model, with simple sampling just isn't replicating the necessary structure.
EDIT: One practical reason to do this would be plagiarism detection, especially if fine-tuning a small corpus. There are ways with guarantees by construction (https://www.researchgate.net/profile/Pierre_Roy2/publication...) but simple setups using beam searches and tries can also do constraint checks for n-grams of certain lengths. Concretely, set up tries for 1-2-3-4-...-nminus1-grams, which are considered "valid" transitions, then set a "bad" trie for n-grams. Check these tries during generation, and throw out any candidates which violate the "bad" trie, but still match in the good one.
See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.
I'm not surprised that GPT-2-117M has memorized songs by the end of training, it's not a very large corpus of songs. Hard to learn and generalize well from it. If one were working more on this, it'd probably make sense to train on a much larger and varied corpus of song (with inline metadata properly formatted to allow controllable generation); something like RapGenius, maybe?
Yea I did the delimiting you mentioned when "training" a bigram model. For GPT-2 I was mostly interested in how well the model would be able to pick up signals from the raw data so I didn't do any kind of preprocessing at all (it's also not very fun ;)). I think it's interesting that the model was able to pick up titles, authors, starts/ends of songs on it's own.
I didn't try generating specific songs but that's a good idea. Having the delimiters would probably improve things but feeding in "On the Run\nJohn Lennon" would work as well with the current approach.
Using RapGenius corpus is also something interesting that I didn't think about. The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.
Pish-posh! It's a single simple search-and-replace: replace '\n\n\n' with '\n<|endoftext|>\n' or so. For bonus, you can use regexp capture groups to rewrite the metadata simultaneously - something like '\n\n\n\(.\)\n\(.\)' → '\n<|endoftext|>\n"\1", by \2\n'.
> The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.
You can do it either way: either train a single model on a multi-artist corpus and then simply prompt it appropriately, or train the single model and then further finetune on just the specific artist. I've tried both in various ways with GPT-2 and StyleGAN, and it's not clear which is best, although I hypothesize that the two-stage pretraining works best with very small corpuses, where in the multi-artist corpus single model, all the other artists might 'squeeze out' the desired artist (a kind of class imbalance), eliminating the transfer benefits.
with StyleGAN, a major benefit of the two-stage pretraining approach is that there's no easy way to 'condition' on a specific class or input; so with my anime face generator (https://www.gwern.net/Faces), when I wanted specific characters, I'd just finetune on that character alone because it's easy to select out just their data and create character-specific corpuses.OT: Is that how fine-tuning actually works with GPT-2? It makes sense that it'd just be strengthening connections on the most-recently-fine-tuned corpus, with previous fine-tunes still around in some way.
Should you expect that first fine tune to pick up and solidify song structure, rhyme, etc, and the second fine tune to keep those concepts in place while muddying up other aspects like the specific lyrics used?
(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just very interested and would love to read more about how all this works. :) )
And the blog article: https://habr.com/post/453232/ (also there's no paywall here)