I'm still early in development and I would love to build this into an actual product that can be integrated into existing DAWs or even turn it into a musical framework itself for AR and VR experiences.
If you're interested in working on it or if you simply want to know more, feel free to contact me.
This is only useful if you actually do something with the extra dimensions, e.g. lifting with a kernel. Duplicating the exact same data into more dimensions is not helpful
Moreover, music and dancing are two sides of the same coin and in styles such as hip hop and popping you have 'beat-killers'[0] who basically represent music through their motion.
If you build an appropriate spatial language, you can compose music from dancing or you can build a form of generalized dancing. I've explored this concept artistically[1],[2] and I'm currently working on formalizing it and making it computational.
[0] https://www.youtube.com/watch?v=IeNn4XG0QHo [1]https://imgur.com/gallery/pfrvIot [2]https://imgur.com/gallery/MoDVr6v
I personally have problems saying two different letters differently, and I just recorded myself trying to say both letters. Sure enough, there is very little difference in the way they are both displayed. I will ask other people to say this letters, and see if I can spot the difference. Then I should be able to play around with differing lip, mouth, tongue, and larynx positions to see how close I can come.
Any tips for highlighting subtle differences in specific places would be appreciated!
However I don't know if I have the time to do it, as there are so many concepts I'm trying to channel through code. Hopefully as I build more computational tools I will increase my bandwidth.
Providing visual tools for phonetic assistance is definitely something I've had in mind for a while and solving the dataset building problem should solve that along the way.
As of now, you can: 1) set the amount_sphere_tube slider to 1 2) decrease the playback_rate to 0.1(in the GUI on the right) 3) on the media controller, go to options(the naming might vary depending on the browser) and set the playback rate to normal or x1. This should give you maximal temporal resolution.
One way features can be further highlighted by generating embeddings from magenta ddsp[0]. Speech sounds are fairly complex though so I don't know how models built on music data would generalize to them. I think tech to do is is there, but for the time being it seems to be scattered around fairly siloed fields. I also tried to use live voice recordings but there are latency issues with the subset of the Web Audio API I'm currently using. However there definitely is value in having a live, spatial feedback of pronunciation.
[0]https://github.com/magenta/magenta-js/tree/master/music#ddsp
Jot my email address down anyway, and if you ever get around to working on those speech tools I'll be thrilled to help.
Try loading smaller samples or audio snippets.
I got two different tracks to work, and it's clear that one was a lot harder to process than the other. It took noticeably more time on the second one, to start and the CPU utilization was higher as well. They were both instrumental tracks in the same format and around the same length. The one simpler to process was the instrumental of Britney Spear's Baby One More Time. The harder one was Porter Robinson's Divinity.
Neither audio had an effect similar to the one from the demo video, but were interesting regardless. They both looked like how I imagine sound waves echo and bounce around if contained in a cube shape.
I appreciate the notebook writeup where you described the goals because the visualization wasn't inherently intuitive with the sound. I chose much more complex tones than your demo. I imagine the feature extraction is much easier on isolated sounds. This reminds me a lot of project milkdrop and so I was expecting it to be closer to that but in 3d. That was probably a misunderstanding on my part of the goals for this.
I think exposing more parameters about how features get mapped and scaled would be really helpful in making it feel more intuitive. Zooming the cube in and out is nice but didn't seem to help convey more information with the tracks I chose. If anything it got in the way because on my computer the zoom sensitivity was very very high.
I look forward to seeing where this goes.
This demo is meant for small audio samples. My initial goal was to use it to visualize and compare drum samples by looking at their 'spatial signature'.
Right now, I'm using arbitrarily defined 'shapes' (sphere and tube) but the goal is to recover those from real-world data. Unfortunately, building the appropriate data-set and the model to go with it is currently out of my reach but I know that's how my brain sort of learned these audio/spatial associations.
In order for this to work on complete tracks, I would need to add source separation, transcription and some form of information compression.
Additionally, I'm working on ways to deal with richer sounds, by laying them out in space or by splitting them into 'voices of unison'. Here's a demo of what it would look like: https://imgur.com/gallery/gkFPXXu
There are two directions I could take this, either making music from interacting with spatial representations or building spatial representations from music. I don't have the bandwidth to do both alone so I would love to work on it with other developers.
I found a slightly-to-the side isometric closely-zoomed-in view to best show the finer waves and cadences being visualized
You should provide example track/s
10/10