The sentence embeddings are calculated using a Bidirectional Encoder Representation Transformer (BERT) model. There's a pre-trained model for this network trained on over 1 billion sentences from the internet that is publicly available, (thanks Microsoft) . The model transforms your description into a 784-long list of numbers (a vector) that represents the contextual meaning of your sentence.
The model runs off a dataset of musical metadata for 35,000 songs. As a "chronically online music nerd", I knew where to find it. The metadata is very rich, it has a lot of useful columns like the genres, subgenres, and descriptions of tracks. The numerical data is binned into categorical values like "obscure" mapping popularity between 0 and 10, "highly danceable" mapping danceability between 80 and 100, etc. The text data is modified into a coherent sentence: "this song's main genres are _____. this song is from the 80s. this name of this song is lovefool by the cardigans. etc"
An arduous part of the project was describing each musical genre in depth, with its own paragraph such that each genre's actual contextual meaning is captured and not just "This song is a Hyperpop song" or "This song is Adult Contemporary". It was a big exercise in music history and tested my knowledge of music. I also learned a lot about musical genres like "Mongolian Throat Singing" and how it compares to "Gamelan Throat Singing".
I also put the song lyrics for each song through GPT-3 and asked it to summarize the lyrical themes. That's also embedded and used in NLPlaylist.
Each feature for each song in our metadata dataset is now a big paragraph that describes the song. The paragraph is split up into sentences, and the embedding of each sentence is found. The final embedding for each song is then calculated by taking a weighted average over all sentence embeddings from the big paragraph and genre and lyrical embeddings.
To make your playlist, all that has to be done is compare the embedding of your query all 35,000 embeddings in the dataset and return the 100 most similar queries, using the cosine similarity distance metric. Thank god we have computers.
Once the 100 most similar candidate tracks are found, they are reranked using a "cross encoder" trained on 215M question-answer pairs from various sources and domains, including StackExchange, Yahoo Answers, Google & Bing search queries to give the best matches.