Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion (opens in new tab)

dylan6042y ago

Yeah, the "rock drums" example was like a student in a practice. I'll be impressed when it can sound like Danny Carey.

From all of the hype, I want to be impressed with results. Instead, we get these mediocre at best examples of what it can do. They are not good sales pitches to me.

chpatrick2y ago

Sir, your dog can talk!

Yes, but not very well.

hoosieree2y ago

Hate to break it to you but there is a vast market for mediocre content, and even Danny Carey has a bad day once in a while.

chaosbolt2y ago

This is the same kind of comment that got HN to seeth for months about how ChatGPT isn't the god programmer some clickbaity news sites claimed it was.

ChatGPT is good in a way that having it is better than not having it, especially with how bad google has become, audio generation will also be good in this way, some people don't need your "musical expertise" but just some background calm music to use with a tutorial video without having youtube take it down for copyright infringement.

zone4112y ago

Yes, while working on my AI Melodies Assistant project, it quickly became clear that generating a pleasant but boring music isn't too difficult. To create a catchy tune, an element of surprise is essential. In the end, I was able to use it as an assistant to compose 60 melodies that I'm happy with (https://www.melodies.ai/)

Art96812y ago

The real test is when this stuff is out in the wild and no one tells you it’s AI and the thought doesn’t cross your mind. Of course it’s not impressive nor surprising when the answer was given up front.

dvngnt_2y ago

gamers don't want bad music either.

some go to video game music concerts or to fan covers

iandanforth2y ago· 13 in thread

The solo piano was interesting because of how clean it is. I can imagine going from that sample to a score without too much difficulty. Once it's in a symbolic format it becomes much more flexible and re-usable.

While this does not seem to be the trend I hope more gen ai in the audio and visual realms start to produce more structured / symbolic output. For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.

gabereiser2y ago

Yessssss! I thought about MusicGAN and Markov chains last night thinking “Why can’t we just codify all chords and use a GAN to generate markov chains on chords of a key and have AI generate instruments and waveform from those chains?” IANA researcher but in my head, that sounded logical.

TylerE2y ago

That's existed for decades. It's called Band in a Box. It's also cheezy as hell.

schazers2y ago

I strongly agree about generating "editables" rather than finalized media. In fact, that's why text generators are more useful than current media generators: text is editable by default. Here's a tweetstorm about it: https://x.com/jsonriggs/status/1694490308220964999?s=20

waffletower2y ago

Audio is definitely editable. While generative audio is new I am hopeful that a host of interesting applications will emerge (audio2audio etc.) within its ecosystem. Promising signal separation (audio to STEMs) and pitch detection tools already exist for raw audio signals. If you want to force Stability to focus on symbolic representations (such as severely lossy MIDI) I hope you can instead first try adapting to tools that work fundamentally with rich audio signals. Perhaps there will be room for symbolic music AI and perhaps Stability will even develop additional models that generate schematic music, but please please don't sacrifice audio generality for piano roll thinking alone. LORAs will undoubtedly be usable to generate more schematic audio via the Stable Audio model -- I imagine they could be easily purposedly to develop sample libraries compatible with DAW (digital audio workstation), sequencer and tracker production workflows.

3 more replies

Jeff_Brown2y ago

That raises an interesting difference between cleaning AI-generated sound and cleaning ordinary recordings. In an ordinary recording, there is an objective reality to discover -- a certain collection of voices was summed to create a signal. With (most? the best?) existing AI audio generation, the waveform is created from whole cloth, and extracting voices from it is an act of creation, not just discovery.

I've come across AI-generated music that outputs something like MIDI and controls synthesizers. Its audio quality was crystal-clear, but the music was boring. That's not to say the approach is a dead-end, of course -- and indeed, as a musician, the idea of that kind of output is exciting. But getting good data to train something that outputs separate MIDI-ish voices seems much harder than getting raw audio signals.

fnordpiglet2y ago

Generative models can certainly create midi, but no one has done it yet. Given the technique is making video, audio, images, and language, all you need to do is train and build a model with an appropriate architecture.

It’s easy to forget this is all pretty new stuff and it still costs a lot to make the base models. But the techniques are (more or less) well documented and implementable with open source tools.

miohtama2y ago

Having music editable for human post production is necessary for most professional adoption. Generating MIDIs would make much more sense than generating raw audio.

This is what we do with AI images: you can fix them in Photoshop, etc. You cannot do this for raw audio due to how music is produced.

waffletower2y ago

Build or seek out a MIDI generating model. I hope Stable Audio is never the place for that. MIDI is deeply lossy and it would be tragedy if it was the only music representation. Imagine if instead of phonographs, compact disks and streaming audio we only had piano rolls. What a loss indeed.

dylan6042y ago

>For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.

I really like this idea. Creating new tools for artists to use to create rather than whatever we're accepting as use now. The use of current full image creation is boring to me in the same way the choice of invisibility as a super power is. The invisibility is ultimately going to slide into pervy tendencies, just like deep fakes will slide in the same way or some other inappropriate use.

waffletower2y ago

Hopefully, the entire industry will NOT move in such a schematic and lossy direction. Use separate tools to analyze audio streams please. Don't throw the timbre baby out with the bathwater. MusicGen utilizes a tokenized transformer model for music, which is attractive for symbolic translation use cases. However, the overall audio quality is far more lossy than the examples you hear from Stable Audio. I believe that symbolic representation should not be a foundational approach to adequately represent and generate rich audio signals.

tech_ken2y ago

I was wondering the same thing, definitely seems like generating the raw waveform runs into all kinds of weird issues (like they touched on in this post). I would imagine that training data would be a serious chokepoint here. Given how much discourse is currently kicking off around the intellectual property rights of just the final product (the mastered track), I can't imagine many musicians would be eager to share what is effectively the "proof of ownership" (track stems or MIDIs).

fnordpiglet2y ago

There are a lot of Lora models that are being made to generate textures, maps, diagrams, backgrounds, etc. You don’t need to wait for adobe, open source models like stable diffusion let you do whatever you think is useful. I’d look to the open source world for creative innovation. Adobe is just doing what’s on the product management roadmap.

rhelsing2y ago

I am approaching this from the symbolic angle via MIDI at neptunely (https://neptunely.com/)

hubraumhugo2y ago· 9 in thread

Now imagine Spotify using this to generate individual earworms for everybody based on their personal tastes (likes, playlists).

Yes, AI is partly hype, but had someone told me this even two years ago, I wouldn't have believed it.

joshspankit2y ago

This is why it’s vital that AI is openly available. Imagine a world where Spotify is the only company that can do that, and they use it to make sure they never pay royalties again.

bee_rider2y ago

How is Spotify for finding new music based on your tastes? I’ve only used Amazon and Pandora; Amazon is quite poor, Pandora is pretty good. I suspect (although, without proof) that if a service can’t suggest new music, it will have trouble generating new music as well.

Anyway, I very much would rather run this sort of thing locally. You could just manually set your taste profile. Plus, music can be quite personal, imagine you start listening to too much music inspired by The Cure and suddenly Amazon starts advertising black makeup and antidepressants or something like that, it would be too disconcerting.

liotier2y ago

Machine-generated music might be functionally equivalent to human-generated music, but that ignores the cultural role of art as a shared human experience - witness the liturgy of live music. That can't happen with music tailored to each listener, it can't happen without tracks that are fixed in time and can be referred to. I can imagine it well-accepted for dynamic music such as gaming soundtracks, but I suppose that machine generation will be mostly a production technique resulting in branded pieces.

Jeff_Brown2y ago

Saying that's impossible makes me immediately wonder whether it's not. There are already headphone dance parties. What if a musical act's output was being interpreted through genre lenses specific to each listener?

jimmygrapes2y ago

I know quite a few technically talented musicians who have next to no creativity in actually writing the music (aside from jazz style improv sessions). Most of them never really play live unless it's part of a similarly uncreative band of college friends. I wonder if having a catchy/complex AI generated song created for them to play live might be interesting to them. Gonna check in and see what they think.

broast2y ago

> such as gaming soundtracks

These days I also feel like my workout playlists might as well be randomly generated dance music.

ragazzina2y ago

Spotify does not need to generate a tailored earworm for me. It could already suggest songs that I like based on my personal taste out of their 100-million-songs catalog - and it's absolutely unable to do it.

zachthewf2y ago

Building a tailored earworm might actually be easier.

hospitalJail2y ago

I really want this. I have a band that I like, and I want more!

Or I'd like to take a song I like, and make it educational, like make it include the period table of elements.

kherud2y ago· 8 in thread

Thank you for sharing! On a tangent: I'm wondering if there are any good open source models/libraries to reconstruct audio quality. I'm thinking about an end-to-end open source alternative to something like Adobe Podcast [1] to make noisy recordings sound professional. Anecdotally it's supposed to be very good. In a recent search, I haven't found anything convincing. In my naive view this tasks seems much simpler than audio generation and the demand far bigger, since not everyone has a professional audio setup ready at all times.

[1] https://podcast.adobe.com/

earthnail2y ago

We've been researching an audio denoiser for music that we will present at the AES conference in October. Description page: https://tape.it/denoising

We'll also publish a webapp where you can use the denoiser for free. Mail me if you want beta access to it (email in profile).

It won't be open-source though, although the paper will of course be public. It will also only reduce noise, and not reconstruct other aspects of audio quality. However, it can do so on any audio (in particular music), not just speech like Adobe Podcast, and it fully preserves the audio quality. It's designed exactly for the use case you want: to make noisy recordings sound professional.

haywirez2y ago

Are you sure the demo sound files are correct on the website? Couldn't appreciate any glaringly obvious differences between the original and denoised with studio grade headphones here. Or, the originals aren't noisy enough.

white_beach2y ago

denoising seems to fail in the guitar and vocals example

https://youtu.be/o-kJ4_CuWzA

whywhywhywhy2y ago

It’s not open but Nvidia has RTX Voice for free if you have and Nvidia card.

Only weird thing it’s designed to be used real time but I’ve had some luck on cleaning up voice recordings replayed back through it via audio routing.

joshspankit2y ago

There seems to have been a fork in the road:

On one side the tech for literal denoising has stagnated a bit. It’s a very hard problem to remove all noise while keeping things like transients.

On the other side, AI is being rapidly developed for it’s ability to denoise by recreating the recording, just without the noise.

earthnail2y ago

In our denoiser (see other comment), we worked on combining these two forks. That’s how we can mathematically guarantee great audio quality.

This combination was non-trivial as training old school DSP denoisers is not easily possible. We’ll describe the math needed in our paper. We hope our publication will help the wider community work not just on denoising but also tasks like automatic mixing.

cosmok2y ago

I have had a lot of success with this: https://ultimatevocalremover.com/ for de-noising

spdif8992y ago

This video from MKBHD's studio channel dives into this topic

naillo2y ago· 8 in thread

I keep thinking back to when we didn't have stabilityai and it was just google and meta teasing us with mouth watering papers but never letting us touch them. I'm so thankful stability exists.

Tenoke2y ago

Stability is great but Meta's MusicGen is available with code and weights while this isn't so that's a really odd place to make that comparison and complaint.

Taek2y ago

Before stable diffusion, nobody released weights at all. Meta et al only started sharing their models with the world when they realized how fast a developer ecosystem was building around the best models.

Without stability, all of AI would still be closed and opaque.

waffletower2y ago

Unfortunately MusicGen's output quality isn't strong enough. I applaud Meta for open sourcing it. The audio samples released for Stable Audio show much more promise. I look forward to code and model releases. I built out a Cog model for MusicGen and took it for a fairly extensive test drive and came back disappointed.

naillo2y ago

The way I see it regarding the point "but meta is also releasing models" is: there was one span of time between say 2014-2019 when mostly ML was just classifiers (nothing generative). People did open source those. Then there was a period between 2019-2023 when generative AI was possible. It's true that meta is releasing models in that space now finally. But there was an excruciating 3-4 year period between 2019 and 2022 when stable diffusion was finally made and released which opened the floodgates to others doing so as well. But I'm eternally grateful for emad and stabilitai for opening the gates that had been titillatingly closed for 4 annoying years.

refulgentis2y ago

Nah it's not because without the releases of Stability/ChatGPT it'd be the same situation. Cool nihilism though

seydor2y ago

Stability def helped push things forward , it even probably showed them that open source is inevitable

Jeff_Brown2y ago

Is the source for this available? I found no mention of it on the page.

fnordpiglet2y ago

It says source is coming

jncfhnb2y ago· 8 in thread

The bluegrass one is super weird. I can’t identify exactly why.

zzbzq2y ago

I can identify a bunch of things. The chord structure jumps all over randomly in a genre that usually does the opposite. The banjo is clearly not an actual banjo being strummed/frailed, but a weird agglomeration of bright toned instruments including both frailed/scruggs banjo and dobro, and maybe harmonica and fiddle creeping in. The AI doesn't know it's making a combination of instruments, so where it's trained on instruments blending, it thinks it can produce pre-blended sounds. I guess maybe this is more like a return to being a child hearing music for the first time with no preconceptions or expectations.

ewan2512y ago

I think the super weird part is that it's not great? I understand this is most likely very impressive technologically but musically it is disjointed, inconsistent and fake sounding. Most of the "music" examples have weird phrasing and confusing harmonic rhythm.

Kudos to stability.ai for achieving this as I am sure it took a lot of effort and this is a huge leap forward in terms of generation of audio by generative AI.

However as a musician (BMus and MMus at 2 different conservatoires) I think it's important to say that the job risk being experienced by creative writers will not be extending to musicians... yet.

jncfhnb2y ago

I feel like music composition is a fundamentally hard task for AI. Music production seems like it should be a lot easier but I haven’t seen that

jnwatson2y ago

The AI seems to understand 4/4 time but doesn't understand groupings of 4 measures into phrases. It definitely doesn't understand ABABACA or even the basic parts of a song.

It is the musical equivalent of a meandering paragraph.

Jeff_Brown2y ago

Absolutely. And all AI for music I have seen suffers that problem.

It makes me wonder whether the music generation should be stratified -- a coarse model lays out where parts like verse and chorus are, what distinguishes them, how to transition, etc., and then a finer-grained model fills in the details.

benesing2y ago

Also, the music is not bluegrass as much as it is old-time, a confusion that continually irritates old-time players.

smat2y ago

You are right it feels off.

The position of the guitar in stereo is all over the place, higher frequency elements appear to come from the left while other parts are more centered.

stef252y ago

Same for the death metal

jasbur2y ago· 7 in thread

It's interesting that the Death Metal was the hardest to reproduce. I conclude that it's the most fundamentally human of all genres.

dontreact2y ago

Well... they hardly tried all genres :)

It sounds like it can't handle lyrics or semantics that well so I suspect any genre where the lyricism is important would also be quite mushy and recognizably AI

Jeff_Brown2y ago

The Beatles seemed to be the hardest music for JukeBox to emulate.

Ninovdmark2y ago

The sound sample seemed to fit the 'vibe', but lacked any discernible definition. Could it be that it's too sonically dense to easily reproduce? Perhaps this could be improved with a more tailored training set.

awestroke2y ago

I think it was just the genre that was the least represented in the training data

emperorcj2y ago

dadabots here: haven't gotten good death metal with it. problem is there's not really much in the dataset.

emperorcj2y ago

(context: i make ai death metal & also i worked on stable audio). 100% it was a dataset problem. Diffusion models still work well when you train them on death metal: https://www.youtube.com/watch?v=rlsRMQzD_6Q

chankstein382y ago

It sounds more like break core haha

jacooper2y ago· 3 in thread

Everything looks very convincing apart from the airpane pilot and the sound effects. They sound very weird as if one is hallucinating

wiz21c2y ago

The airplane is super convincing as an encrypted Empire communication :-) (see Star Wars episode 5 IIRC)

sebzim45002y ago

The airplane one just sounds like a foreign language over a bad intercom, I think that could still be useful for some stuff.

xpe2y ago

Perhaps because generating good white noise requires randomness without autocorrelation or detectable patterns.

naillo2y ago· 2 in thread

This is gonna be great to finetune on. There's only so many boards of canada/aphex twin songs out there but I wish there were more and this will let us generate more.

52-6F-622y ago

This is not the way.

naillo2y ago

Why not? Mostly for private use in my case. SDXL has created some beautiful works of art in my experiments and I would love to have a similar experience in the music world.

skybrian2y ago· 2 in thread

As an amateur musician, I’d be more interested in these tools if, along with the text description, they took as input a melody or chord progression or performance data. Maybe ABC notation or a MIDI track? Anyone doing that?

Other cool things would be a way to generate a sampled instrument from a text description, or to generate a new track given a text description and all the previous tracks for other instruments. There could be a new generation of audio tools that let you generate placeholders or better for everything.

l33tman2y ago

The analogue from stable diffusion would be ControlNet, where you can train a superimposed model on auxiliary data, this should be possible to do with chords for example, just like you can do with human poses, 3D depth maps etc in stable diffusion using controlnet

emadm2y ago

It’s coming

https://news.ycombinator.com/item?id=37499067

TheAceOfHearts2y ago· 2 in thread

Does this model support / "understand" concepts of spatial audio? For example, something like "an alarm moving around you in a circle".

When AudioGen was announced this was my first question, but from what I've been able to test the model just ignores spatial audio prompts.

Unfortunately I haven't been able to find any discussion or interest in online discussion about the importance / significance of spatial audio. Why not?

cheald2y ago

My guess is that it's not a very interesting problem because it's not particularly difficult to add spatial dimensions to arbitrary audio - after all, it is already commonly done in video games. All you have to do is manipulate the multichannel outputs with an understanding of the spatial positioning of each channel's speaker location relative to the listener and some basic trig.

Jerrrry2y ago

Dolby wouldn't appreciate it.

Jeff_Brown2y ago· 2 in thread

I still consider OpenAI's JukeBox (now at least 2 years old!) far and away the most creative music AI. But the combination of coherence, sound quality and creativity of this model is (to my knowledge) easily best in class.

4RealFreedom2y ago

The sound quality of Jukebox is muddled. There are many inconsistencies. The loudness of vocals and the quality of instruments really stand out and not in a good way. Hard to talk about creativity because it's so subjective but I've found it lacking in all AI music including JukeBox. Don't get me wrong - this tech is amazing.

Jeff_Brown2y ago

It's mushy and inconsistent, absolutely. But it also comes up with wild yet coherent changes that I've seen from nothing else.

At this comment I listed a few instances:

Cloudef2y ago· 2 in thread

Is the extreme metal music lacking from the training set? Why do the extreme metal examples always sound horrible?

Jeff_Brown2y ago

Metal is especially hard to mix in a way that keeps the voices distinct and clear. Maybe the training catalog includes a lot of low-budget metal.

rafaelero2y ago

Because metal sounds horrible.

2Gkashmiri2y ago· 1 in thread

So.... Wait for llama for audio and train your own voice to having to call you friends by text and the software instead of actually saying the words? This is going to be nice for authentication, proving to a third party that you are yourself

systoll2y ago

Just wait for iOS17 https://youtu.be/oMt02DNbQlk

stared2y ago· 1 in thread

I would love to use it for background music when I am working. I have specific tastes that depend on the task, mood, energy level, and ambiance.

k12sosse2y ago

If you're not attuned to Cryo Chamber (label), check them out. Maybe not fitting all use-cases, but a strong and deep catalogue.

hoosieree2y ago· 1 in thread

Humans take a long time to get good at art; in the meantime they still have to eat.

So they compete with generative AI for a fixed number of jobs. The AI is cheaper and faster. Humans stop training to become artists.

Without new training data, the generative AI models stagnate. Progress in art stops globally, forever.

But for a brief glorious moment, we were able to say "huh, that's not bad".

phone86753092y ago

This is by design - the capital that backs modern art isn’t doing it for love of the art but for money.

For fine art, it’s a way for them to launder money and keep it out of bank accounts where it can be seized trivially.

For mass art, it’s about selling to enough rubes to make a profit.

Neither are impacted by a stagnation in art. If anything, they’re aided by it - suddenly the art you bought to launder money retains its value because it’s no longer the flavor of the week with the arts crowd.

skilled2y ago

Relevant,

https://news.ycombinator.com/item?id=37493741

https://www.stableaudio.com/

dylan6042y ago

Just like all examples of generative "AI" I've seen, there's always some bit of uncanny valley vibe present. In the audio examples, there's always this weird distortion like a really poorly compression sources were used as training data. The sounds are muddled together, and rarely do I hear clean musical voices. It's just a smear of sounds coming together that our brains try really hard to say "oh, that's a _____" situation. While the samples in the TFA are probably the closest I've heard to date, the issue is still present.

I guess the thing that strikes me so odd about the generative thing is all of the press releases on people presenting things like it's a final product, yet it's clearly pre-release beta at best but more likely alpha versions of code in the results in quality. If a non-AI product released something that was so clearly not finished, it would be panned to no end for not working.

https://www.youtube.com/watch?v=MwtVkPKx3RA

cesaref2y ago

The death metal example reminded me of the continuous streaming death metal here:

moonchrome2y ago

It's funny how some of those examples give me this creepy uncanny valley feel for music (the lowfi hip hop example) - I've never experienced it this way before.

It's sort of reminds me of the audio effects they use to indicate that you're incapacitated and things start distorting in a weird way.

Entertaining !

_sys491522y ago

gamechanging stuff for sample based rap producers. havent been able to log-in yet but i think a good benchmark to start off with is to see if it can replicate the 'al green' sound from the early 70s - very distinct sounding production - drumless and instrumental.

you dont need 45 or 90 straight seconds of a coherent song rendered. just need to dip in the 45 sec clip and cut out 4 seconds here, another 4 there. reroll those cuts through stable audio, keep rolling, keep rolling. cut up and get a pile of clips together. arrange, layer, voila - you saved money on paying royalties for sampling.

the lofi melodic sample on the stability page was passable. thought the bluegrass one sounded great actually. imagine being able to program bluegrass like rap.

edit: oof. fully trained on a licensed commercial dataset from AudioSparx. muzak in, muzak out.

shon2y ago

Ed Newton-Rex, VP of Audio at Stability, is speaking about how this was built at The AI Conference in 2 weeks. https://aiconference.com

colesantiago2y ago

This is yet another amazing release from Stability AI.

Will be adding this to my SaaS side grift and introduce generated music you can listen to while you're chatting with your PDFs.

Can't wait for the next one.

chandhoo2y ago

Another alternative: http://www.word.band

Can produce longer content, and more genres and range of music. Isn't 48khz though.

junon2y ago

This is the most impressive of this genre of SD audio so far, by a long shot. Really impressive!

smrtinsert2y ago

As a hobby musician, why don't they start with instrument samples? Every sampler user out there would love a button press generate sample on the fly as a plugin. It would blow away gigs and gigs of ridiculous duplicative or near duplicate samples.

londons_explore2y ago

Seems nobody has really cracked vocals with songs...

randcraw2y ago

I wonder if it makes sense to generate a combo of instruments rather than individual voices and then combine those with an arranger DNN. I would think it'd be much easier to capture each instrument's transients and dynamics that way, much less allow more subtlety in how they combine, like allowing the lead voice to shift among instruments, or even let the listener choose how each voice expresses stylistically and how they should combine.

Trying to do all of that in a single DNN, much less parameterize it useably seems overly ambitious (or will be of more limited value ultimately).

stainablesteel2y ago

i want something that can take in a song and transform it into a different genre

PcChip2y ago

it's funny how they're all very impressive except the death metal

gyumjibashyan2y ago

This is crazy tech!

j / k navigate · click thread line to collapse

203 comments

114 comments · 31 top-level

coldcode2y ago· 14 in thread

Singing would an interesting experiment, but I don't see that here.

Auracle2y ago

riskable2y ago

Maschinesky2y ago

I'm surprised that you even leap to people like Hans Zimmer and others.

The people we need to worry about are aallll of the people earning a living for everything else like background music for Indi games, Ambient music etc.

cwillu2y ago

mpsprd2y ago

>where quality is not important, like in games

I don't believe that's a good example. Video game music is an important part of the gaming experience, but its often taken for granted or overlooked.

PatronBernard2y ago

Jeff_Brown2y ago

OpenAI's jukebox -- now 3 years old -- is creative and non-repetitive. Witness, for instance, its jam on Uptown Funk here:

https://www.youtube.com/watch?v=KCaya74_NHw

Or the changes shortly after 1:15, 2:15 and 2:40 in these extensions of Take On Me:

https://www.youtube.com/watch?v=_3yOrUJ0SzY

dylan6042y ago

Yeah, the "rock drums" example was like a student in a practice. I'll be impressed when it can sound like Danny Carey.

From all of the hype, I want to be impressed with results. Instead, we get these mediocre at best examples of what it can do. They are not good sales pitches to me.

chpatrick2y ago

Sir, your dog can talk!

Yes, but not very well.

hoosieree2y ago

Hate to break it to you but there is a vast market for mediocre content, and even Danny Carey has a bad day once in a while.

chaosbolt2y ago

This is the same kind of comment that got HN to seeth for months about how ChatGPT isn't the god programmer some clickbaity news sites claimed it was.

zone4112y ago

Art96812y ago

dvngnt_2y ago

gamers don't want bad music either.

some go to video game music concerts or to fan covers

iandanforth2y ago· 13 in thread

gabereiser2y ago

TylerE2y ago

That's existed for decades. It's called Band in a Box. It's also cheezy as hell.

schazers2y ago

waffletower2y ago

3 more replies

Jeff_Brown2y ago

fnordpiglet2y ago

It’s easy to forget this is all pretty new stuff and it still costs a lot to make the base models. But the techniques are (more or less) well documented and implementable with open source tools.

miohtama2y ago

Having music editable for human post production is necessary for most professional adoption. Generating MIDIs would make much more sense than generating raw audio.

This is what we do with AI images: you can fix them in Photoshop, etc. You cannot do this for raw audio due to how music is produced.

waffletower2y ago

dylan6042y ago

waffletower2y ago

tech_ken2y ago

fnordpiglet2y ago

rhelsing2y ago

I am approaching this from the symbolic angle via MIDI at neptunely (https://neptunely.com/)

hubraumhugo2y ago· 9 in thread

Now imagine Spotify using this to generate individual earworms for everybody based on their personal tastes (likes, playlists).

Yes, AI is partly hype, but had someone told me this even two years ago, I wouldn't have believed it.

joshspankit2y ago

This is why it’s vital that AI is openly available. Imagine a world where Spotify is the only company that can do that, and they use it to make sure they never pay royalties again.

bee_rider2y ago

liotier2y ago

Jeff_Brown2y ago

jimmygrapes2y ago

broast2y ago

> such as gaming soundtracks

These days I also feel like my workout playlists might as well be randomly generated dance music.

ragazzina2y ago

zachthewf2y ago

Building a tailored earworm might actually be easier.

hospitalJail2y ago

I really want this. I have a band that I like, and I want more!

Or I'd like to take a song I like, and make it educational, like make it include the period table of elements.

kherud2y ago· 8 in thread

[1] https://podcast.adobe.com/

earthnail2y ago

We've been researching an audio denoiser for music that we will present at the AES conference in October. Description page: https://tape.it/denoising

We'll also publish a webapp where you can use the denoiser for free. Mail me if you want beta access to it (email in profile).

haywirez2y ago

white_beach2y ago

denoising seems to fail in the guitar and vocals example

https://youtu.be/o-kJ4_CuWzA

whywhywhywhy2y ago

It’s not open but Nvidia has RTX Voice for free if you have and Nvidia card.

Only weird thing it’s designed to be used real time but I’ve had some luck on cleaning up voice recordings replayed back through it via audio routing.

joshspankit2y ago

There seems to have been a fork in the road:

On one side the tech for literal denoising has stagnated a bit. It’s a very hard problem to remove all noise while keeping things like transients.

On the other side, AI is being rapidly developed for it’s ability to denoise by recreating the recording, just without the noise.

earthnail2y ago

In our denoiser (see other comment), we worked on combining these two forks. That’s how we can mathematically guarantee great audio quality.

cosmok2y ago

I have had a lot of success with this: https://ultimatevocalremover.com/ for de-noising

spdif8992y ago

This video from MKBHD's studio channel dives into this topic

naillo2y ago· 8 in thread

I keep thinking back to when we didn't have stabilityai and it was just google and meta teasing us with mouth watering papers but never letting us touch them. I'm so thankful stability exists.

Tenoke2y ago

Stability is great but Meta's MusicGen is available with code and weights while this isn't so that's a really odd place to make that comparison and complaint.

Taek2y ago

Without stability, all of AI would still be closed and opaque.

waffletower2y ago

naillo2y ago

refulgentis2y ago

Nah it's not because without the releases of Stability/ChatGPT it'd be the same situation. Cool nihilism though

seydor2y ago

Stability def helped push things forward , it even probably showed them that open source is inevitable

Jeff_Brown2y ago

Is the source for this available? I found no mention of it on the page.

fnordpiglet2y ago

It says source is coming

jncfhnb2y ago· 8 in thread

The bluegrass one is super weird. I can’t identify exactly why.

zzbzq2y ago

ewan2512y ago

Kudos to stability.ai for achieving this as I am sure it took a lot of effort and this is a huge leap forward in terms of generation of audio by generative AI.

However as a musician (BMus and MMus at 2 different conservatoires) I think it's important to say that the job risk being experienced by creative writers will not be extending to musicians... yet.

jncfhnb2y ago

I feel like music composition is a fundamentally hard task for AI. Music production seems like it should be a lot easier but I haven’t seen that

jnwatson2y ago

The AI seems to understand 4/4 time but doesn't understand groupings of 4 measures into phrases. It definitely doesn't understand ABABACA or even the basic parts of a song.

It is the musical equivalent of a meandering paragraph.

Jeff_Brown2y ago

Absolutely. And all AI for music I have seen suffers that problem.

benesing2y ago

Also, the music is not bluegrass as much as it is old-time, a confusion that continually irritates old-time players.

smat2y ago

You are right it feels off.

The position of the guitar in stereo is all over the place, higher frequency elements appear to come from the left while other parts are more centered.

stef252y ago

Same for the death metal

jasbur2y ago· 7 in thread

It's interesting that the Death Metal was the hardest to reproduce. I conclude that it's the most fundamentally human of all genres.

dontreact2y ago

Well... they hardly tried all genres :)

It sounds like it can't handle lyrics or semantics that well so I suspect any genre where the lyricism is important would also be quite mushy and recognizably AI

Jeff_Brown2y ago

The Beatles seemed to be the hardest music for JukeBox to emulate.

Ninovdmark2y ago

awestroke2y ago

I think it was just the genre that was the least represented in the training data

emperorcj2y ago

dadabots here: haven't gotten good death metal with it. problem is there's not really much in the dataset.

emperorcj2y ago

chankstein382y ago

It sounds more like break core haha

jacooper2y ago· 3 in thread

Everything looks very convincing apart from the airpane pilot and the sound effects. They sound very weird as if one is hallucinating

wiz21c2y ago

The airplane is super convincing as an encrypted Empire communication :-) (see Star Wars episode 5 IIRC)

sebzim45002y ago

The airplane one just sounds like a foreign language over a bad intercom, I think that could still be useful for some stuff.

xpe2y ago

Perhaps because generating good white noise requires randomness without autocorrelation or detectable patterns.

naillo2y ago· 2 in thread

This is gonna be great to finetune on. There's only so many boards of canada/aphex twin songs out there but I wish there were more and this will let us generate more.

52-6F-622y ago

This is not the way.

naillo2y ago

Why not? Mostly for private use in my case. SDXL has created some beautiful works of art in my experiments and I would love to have a similar experience in the music world.

skybrian2y ago· 2 in thread

l33tman2y ago

emadm2y ago

It’s coming

https://news.ycombinator.com/item?id=37499067

TheAceOfHearts2y ago· 2 in thread

Does this model support / "understand" concepts of spatial audio? For example, something like "an alarm moving around you in a circle".

When AudioGen was announced this was my first question, but from what I've been able to test the model just ignores spatial audio prompts.

Unfortunately I haven't been able to find any discussion or interest in online discussion about the importance / significance of spatial audio. Why not?

cheald2y ago

Jerrrry2y ago

Dolby wouldn't appreciate it.

Jeff_Brown2y ago· 2 in thread

4RealFreedom2y ago

Jeff_Brown2y ago

It's mushy and inconsistent, absolutely. But it also comes up with wild yet coherent changes that I've seen from nothing else.

At this comment I listed a few instances:

Cloudef2y ago· 2 in thread

Is the extreme metal music lacking from the training set? Why do the extreme metal examples always sound horrible?

Jeff_Brown2y ago

Metal is especially hard to mix in a way that keeps the voices distinct and clear. Maybe the training catalog includes a lot of low-budget metal.

rafaelero2y ago

Because metal sounds horrible.

2Gkashmiri2y ago· 1 in thread

systoll2y ago

Just wait for iOS17 https://youtu.be/oMt02DNbQlk

stared2y ago· 1 in thread

I would love to use it for background music when I am working. I have specific tastes that depend on the task, mood, energy level, and ambiance.

k12sosse2y ago

If you're not attuned to Cryo Chamber (label), check them out. Maybe not fitting all use-cases, but a strong and deep catalogue.

hoosieree2y ago· 1 in thread

Humans take a long time to get good at art; in the meantime they still have to eat.

So they compete with generative AI for a fixed number of jobs. The AI is cheaper and faster. Humans stop training to become artists.

Without new training data, the generative AI models stagnate. Progress in art stops globally, forever.

But for a brief glorious moment, we were able to say "huh, that's not bad".

phone86753092y ago

This is by design - the capital that backs modern art isn’t doing it for love of the art but for money.

For fine art, it’s a way for them to launder money and keep it out of bank accounts where it can be seized trivially.

For mass art, it’s about selling to enough rubes to make a profit.

skilled2y ago

Relevant,

https://news.ycombinator.com/item?id=37493741

https://www.stableaudio.com/

dylan6042y ago