Singing would an interesting experiment, but I don't see that here.
AI art gets rid of the technical skill step but the rest is there, although you may luck in to something at random. If you’re using ControlNet on Stable Diffusion or training your own models you have a lot of control over the output as well.
I could easily find that music entertaining if it started playing the moment my character triggered a trap and suddenly, "the floor is lava" or my character enters a scene with the quest of winning over one of the love interests =)
The people we need to worry about are aallll of the people earning a living for everything else like background music for Indi games, Ambient music etc.
I don't believe that's a good example. Video game music is an important part of the gaming experience, but its often taken for granted or overlooked.
https://www.youtube.com/watch?v=KCaya74_NHw
Or the changes shortly after 1:15, 2:15 and 2:40 in these extensions of Take On Me:
From all of the hype, I want to be impressed with results. Instead, we get these mediocre at best examples of what it can do. They are not good sales pitches to me.
ChatGPT is good in a way that having it is better than not having it, especially with how bad google has become, audio generation will also be good in this way, some people don't need your "musical expertise" but just some background calm music to use with a tutorial video without having youtube take it down for copyright infringement.
some go to video game music concerts or to fan covers
While this does not seem to be the trend I hope more gen ai in the audio and visual realms start to produce more structured / symbolic output. For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.
I've come across AI-generated music that outputs something like MIDI and controls synthesizers. Its audio quality was crystal-clear, but the music was boring. That's not to say the approach is a dead-end, of course -- and indeed, as a musician, the idea of that kind of output is exciting. But getting good data to train something that outputs separate MIDI-ish voices seems much harder than getting raw audio signals.
It’s easy to forget this is all pretty new stuff and it still costs a lot to make the base models. But the techniques are (more or less) well documented and implementable with open source tools.
This is what we do with AI images: you can fix them in Photoshop, etc. You cannot do this for raw audio due to how music is produced.
I really like this idea. Creating new tools for artists to use to create rather than whatever we're accepting as use now. The use of current full image creation is boring to me in the same way the choice of invisibility as a super power is. The invisibility is ultimately going to slide into pervy tendencies, just like deep fakes will slide in the same way or some other inappropriate use.
Yes, AI is partly hype, but had someone told me this even two years ago, I wouldn't have believed it.
Anyway, I very much would rather run this sort of thing locally. You could just manually set your taste profile. Plus, music can be quite personal, imagine you start listening to too much music inspired by The Cure and suddenly Amazon starts advertising black makeup and antidepressants or something like that, it would be too disconcerting.
These days I also feel like my workout playlists might as well be randomly generated dance music.
Or I'd like to take a song I like, and make it educational, like make it include the period table of elements.
We'll also publish a webapp where you can use the denoiser for free. Mail me if you want beta access to it (email in profile).
It won't be open-source though, although the paper will of course be public. It will also only reduce noise, and not reconstruct other aspects of audio quality. However, it can do so on any audio (in particular music), not just speech like Adobe Podcast, and it fully preserves the audio quality. It's designed exactly for the use case you want: to make noisy recordings sound professional.
Only weird thing it’s designed to be used real time but I’ve had some luck on cleaning up voice recordings replayed back through it via audio routing.
On one side the tech for literal denoising has stagnated a bit. It’s a very hard problem to remove all noise while keeping things like transients.
On the other side, AI is being rapidly developed for it’s ability to denoise by recreating the recording, just without the noise.
This combination was non-trivial as training old school DSP denoisers is not easily possible. We’ll describe the math needed in our paper. We hope our publication will help the wider community work not just on denoising but also tasks like automatic mixing.
This video from MKBHD's studio channel dives into this topic
Without stability, all of AI would still be closed and opaque.
Kudos to stability.ai for achieving this as I am sure it took a lot of effort and this is a huge leap forward in terms of generation of audio by generative AI.
However as a musician (BMus and MMus at 2 different conservatoires) I think it's important to say that the job risk being experienced by creative writers will not be extending to musicians... yet.
It is the musical equivalent of a meandering paragraph.
It makes me wonder whether the music generation should be stratified -- a coarse model lays out where parts like verse and chorus are, what distinguishes them, how to transition, etc., and then a finer-grained model fills in the details.
The position of the guitar in stereo is all over the place, higher frequency elements appear to come from the left while other parts are more centered.
It sounds like it can't handle lyrics or semantics that well so I suspect any genre where the lyricism is important would also be quite mushy and recognizably AI
Other cool things would be a way to generate a sampled instrument from a text description, or to generate a new track given a text description and all the previous tracks for other instruments. There could be a new generation of audio tools that let you generate placeholders or better for everything.
When AudioGen was announced this was my first question, but from what I've been able to test the model just ignores spatial audio prompts.
Unfortunately I haven't been able to find any discussion or interest in online discussion about the importance / significance of spatial audio. Why not?
At this comment I listed a few instances:
So they compete with generative AI for a fixed number of jobs. The AI is cheaper and faster. Humans stop training to become artists.
Without new training data, the generative AI models stagnate. Progress in art stops globally, forever.
But for a brief glorious moment, we were able to say "huh, that's not bad".
For fine art, it’s a way for them to launder money and keep it out of bank accounts where it can be seized trivially.
For mass art, it’s about selling to enough rubes to make a profit.
Neither are impacted by a stagnation in art. If anything, they’re aided by it - suddenly the art you bought to launder money retains its value because it’s no longer the flavor of the week with the arts crowd.
I guess the thing that strikes me so odd about the generative thing is all of the press releases on people presenting things like it's a final product, yet it's clearly pre-release beta at best but more likely alpha versions of code in the results in quality. If a non-AI product released something that was so clearly not finished, it would be panned to no end for not working.
It's sort of reminds me of the audio effects they use to indicate that you're incapacitated and things start distorting in a weird way.
Entertaining !
you dont need 45 or 90 straight seconds of a coherent song rendered. just need to dip in the 45 sec clip and cut out 4 seconds here, another 4 there. reroll those cuts through stable audio, keep rolling, keep rolling. cut up and get a pile of clips together. arrange, layer, voila - you saved money on paying royalties for sampling.
the lofi melodic sample on the stability page was passable. thought the bluegrass one sounded great actually. imagine being able to program bluegrass like rap.
edit: oof. fully trained on a licensed commercial dataset from AudioSparx. muzak in, muzak out.
Will be adding this to my SaaS side grift and introduce generated music you can listen to while you're chatting with your PDFs.
Can't wait for the next one.
Can produce longer content, and more genres and range of music. Isn't 48khz though.
Trying to do all of that in a single DNN, much less parameterize it useably seems overly ambitious (or will be of more limited value ultimately).