MusicGen is an LLM on top of EnCodec tokens, instead of working directly with audio. EnCodec is neural audio compression algorithm that encodes audio as tokens from a codebook. It's a really clever trick!
"We introduce a proper inductive bias of periodicity to the generator by applying a recently proposed periodic activation called Snake function (Liu et al., 2020), defined as fα(x) = x + 1 α sin2 (αx), where α is a trainable parameter that controls the frequency of the periodic component of the signal and larger α gives higher frequency. The use of sin2 (x) ensures monotonicity and renders it amenable to easy optimization. Liu et al. (2020) demonstrates this periodic activation exhibits an improved extrapolation capability for temperature and financial data prediction."
I think you just described chicago acid house :-P
In this day and age, curiosity is not worth the risk.
File "predict.py", line 211, in predict raise ValueError( ValueError: Failed to generate a loop in the requested 60.23 bpm. Please try again.
EDIT: At 52 bpm (exact) it seems to work. What it generated would not sound good if looped however. In terms of style.. it sounded a little music box like - celesta or so (think of the beginning of the Harry Potter soundtrack) with some sustained strings and pizzicato strings. That would be appropriate, except the rhythm and chords are fairly random and I wouldn't exactly call this musical :)
Firstly, with temperature set to 2:
“Amen break with a bag of spanners” (140 bpm): If the amen break is in there, I can’t tell. There does seem to be a kind of harp/bell thing doing the melody, though.
“John Bonham with kettle drums” (90 bpm): Lots of guitar, subdued drums, but could definitely be late-period Zeppelin. Variation 2 is the exception: Zep at the start and end, long pause in the middle so John can drag his sticks along a LEGO oil tanker.
“John Bonham with kettle drums and angry cat” (90 bpm): We are now inside the oil tanker.
Now setting the temperature to 1:
“Hardfloor in Luton Primark” (90 bpm): The bpm setting was an accidental leftover from the previous experiment, and the result sounds much more Primark than Hardfloor.
“Portishead at cheezy funfair” (110 bpm): It’s a very folk-y funfair. Accordions? Organs? What the hell?
Hours of fun! Again, many thanks!
Edit: am i mis-interpreting the term looper here? it just made an output with a fade out.
It worked better with "suomisaundi psychedelic trance spugedelic".
This is a neat idea in many ways.
Just seems like a fundamentally different problem than a photograph or painting.
Perhaps it's getting hugged to death...