One practical reality it doesn't share is that your audio processing (or generation) code is often going to be running in a bus shared by a ton of other modules and so you don't have the luxury of using "5.6ms" as your deadline for a 5.6ms buffer. Your responsibility, often, is to just get as performant as reasonably possible so that everything on the bus can be processed in those 5.6ms. The pressure is usually much higher than the buffer length suggests.
A much different experience from embedded programming, where 99% occupancy is no problem at all.
Assuming the context is a desktop OS (which is the context of TFA), I think that the main source of non-determinism is scheduling jitter (the time between the ideal start of your computation, and the time when the OS gives you the CPU to start the computation). Of course if you can't arrange exclusive or max-priority access to a CPU core you're also going to be competing with other processes. Then there is non-deterministic execution time on most modern CPUs due to cache timing effects, superscalar out of order instruction scheduling, inter-core synchronisation, and so on. So yeah, you're going to need some margin unless you're on dedicated hardware with deterministic compute (e.g. a DSP chip).
Most people learning audio programming aren't making a standalone audio app where they do all the processing, or at least not an interesting one. They're usually either making something like a plugin that ends up in somebody else's bus/graph, or something like a game or application that creates a bus/graph and shoves a bunch of different stuff into it.
I would love to see a modern take on the real-world risk of various operations that are technically nondeterministic. I wouldn’t be surprised if there are cases where the risk of >1ms latency is like 1e-30, and dogmatically following this advice might be overkill.
It depends on your appetite for risk and the cost of failure.
A big part of the problem is that general purpose computing systems (operating systems and hardware) are not engineered as real-time systems and there are rarely vendor guarantees with respect to real-time behavior. Under such circumstances, my position is that you need to code defensively. For example, if your operating system memory allocator does not guarantee a worst-case bound on execution time, do not use it in a real-time context.
I tend to agree, but...
From my recollection of using Zoom-- it has this bizarre but workable recovery method for network interruptions. Either the server or the client keeps some amount of the last input audio in a buffer. Then if the server detects connection problems at time 't', it grabs the buffer from t - 1 seconds all the way until the server detects better connectivity. Then it starts a race condition, playing back that amount of the buffer to all clients at something like 1.5 speed. From what I remember, this algo typically wins the race and saves the client from having to repeat themselves.
That's not happening inside a DSP routine. But my point is that some clever engineer(s) at Zoom realized that missing deadlines in audio delivery does not necessarily mean "hosed." I'm also going to rankly speculate that every other video conferencing tool hard-coupled missing deadlines with "hosed," and that's why Zoom is the only one where I've ever experienced the benefit of that feature.
Indeed, like all real-time systems you need to think in terms of worst-case time complexity, not amortized complexity.
An audio glitch is very annoying by comparison, especially if the application is a live musical instrument or something like that. Even the choppy rocket motor sounds of Kerbal Space Program (caused by garbage collector pauses) are infuriating.
It's kind of the difference between soft and hard real time systems. Although most audio applications don't strictly qualify as hard real time (missing a deadline is as bad as a total failure) but failing a deadline is much worse than in graphics.
You might sometimes build an app where (through your operating system) you connect directly with an input device and/or output device and then do all the audio processing yourself. In this case, you'd more or less control the whole bus and all the code processing samples on it and have a fairly true sense of your deadline. (The OS and drivers would still be introducing some overhead for mixing or resampling, etc, but that's generally of small concern and hard to avoid)
Often, though, you're either going to be building a bus and applying your own effects and some others (from your OS, from team members, from third party plugins/libraries, etc) or you're going to be writing some kind of effect/generator that gets inserted into somebody else's bus in something like a DAW or game. In all these cases, you need to assume that all processing code that isn't yours needs all the time that you can leave for it and just make your own code as efficient as is reasonable.
A bus is a shared medium of communication[1]. Often, busses are time-division multiplexed[2], so if you want to use the bus, but another module is already using it, you need to wait.
For example, if your audio buffers are ultimately submitted to a sound card over a PCI bus, the submission may need to wait for any ongoing transactions on the PCI bus, such as messages to a graphics card.
[1]: https://en.wikipedia.org/wiki/Bus_(computing)
[2]: https://en.wikipedia.org/wiki/Time-division_multiplexing
the cpal library in Rust is excellent for developing cross-platform desktop applications. I'm currently maintaining this library:
https://github.com/chaosprint/asak
It's a cross-platform audio recording/playback CLI tool with TUI. The source code is very simple to read. PRs are welcomed and I really hope Linux users can help to test and review new PRs :)
When developing Glicol(https://glicol.org), I documented my experience of "fighting" with real-time audio in the browser in this paper:
https://webaudioconf.com/_data/papers/pdf/2021/2021_8.pdf
Throughout the process, Paul Adenot's work was immensely helpful. I highly recommend his blog:
https://blog.paul.cx/post/profiling-firefox-real-time-media-...
I am currently writing a wasm audio module system, and hope to publish it here soon.
If your tempo drifts, then you're not going to hear the rhythm correctly. If you have a bit of latency on your instrument, it's like turning on a delay pedal where the only signal coming through is the delay.
One might assume if you just follow audio programming guides then you can do all this, but you still need to have your system setup to handle real time audio, in addition to your program.
It's all noticeable.
As a former developer of real time software, the usage of "real time" to mean "fast" makes me cringe a bit whenever I read it. If there's a TCP/IP stack in the middle of something, it's probably not "real time."
"real time" means there's a deadline. Soft real time means missing the deadline is a problem, possibly a bug, and quite bad. Hard real time means the "dead" part of "deadline" could be literal, either in terms of your program (a missed deadline is an irrecoverable error) or the humans that need the program to make the deadline are no longer alive.
Modern computers are ridiculously fast, relatively speaking you don’t need much resources to calculate a missile trajectory, so “simply” 100% sure doing some calculations at a fixed rate, with even a GC cycle that has a deterministic higher bound (e.g. it will go throw the whole, non-resizable heap, but it will surely always take n seconds), you can pass the requirements. Though a desktop computer pretty much already begets the hard part of hard realtime, due to all the stuff that makes it fast - memory caching, CPU pipelining, branch prediction, normal OSs scheduling, etc.
I suppose it's hard to make guarantees with different environments and hardware, but I realized when we (non-realtime people) ship software we don't really have guarantees for when our functions run.
- You are at the mercy of the browser. If browser engineers mess up the audio thread or garbage collection, even the most resilient web audio app breaks. It happens.
- Security mitigations prevent or restrict use of some useful APIs. For example, SharedArrayBuffer and high resolution clocks.
It's worth noting that these are practically the only case where extreme real-time audio programming measures are necessary.
If you're making, for example, a video game the requirements aren't actually that steep. You can trivially trade latency for consistency. You don't need to do all your audio processing inside a 5ms window. You need to provide an audio buffer every 5 milliseconds. You can easily queue up N buffers to smooth out any variance.
Highly optimized competitive video games average like ~100ms of audio latency [1]. Some slightly better. Some in the 150ms and even 200ms range. Input latency is hyper optimized, but people rarely pay attention to audio latency. My testing indicates that ~50ms is sufficient.
Audio programming is fun. But you can inject latency to smooth out jitter in almost all use cases that don't involve a live musical instrument.
Yes, background sound in games can be handled with very large buffers, but most players expect music-performance-like latency for action-driven sound.
Musicians have keenly trained ears. I would imagine their much more sensitive to audio latency than even a pro gamer, nevermind average Joe off the street.
Where latency really matters is when you have a musical instrument that plays a sound and it's connected to a monitor. If those sounds are separated by more than 8ms or so the difference will be super noticeable to anyone, including Joe off the street.
I'd be interested for someone to run a user study on MIDI keyboard latency. I'd bet $3.50 that anything under 40 milliseconds would be sufficient. Maybe 30 milliseconds. I'd be utterly shocked if it needed to be 8 milliseconds. And I'd be extremely shocked if every popular MIDI keyboard on the market actually hit that level of latency.
I would love to see a UI system that has predictable low-latency real-time perf, so you could confidently achieve something like single frame latency on 144Hz display.
A graphics micro-stutter not so much.
> I'm not aware of any UI toolkits designed with real-time in mind.
What would be the point? The human eye can only notice so much FPS (gamers might disagree with their 244 FPS displays).
The insight is that with two threads contending on one lock, there are efficient ways to build the lock that minimizes cpu on the non-realtime thread.
DDMF's VirtualAudioStream does that. It allows you to create virtual audio devices with chains of arbitrary VST plugins. As for the VST plugins, there are thousands of free and paid plugins for everything. I'm using VirtualAudio stream to put a Wave's noise cancelling and a good compressor between my mic and Zoom. It increases latency, of course.
I think so. TBH I'm quite new to the world of DSPs so I don't know the right terminology. The purpose of the DSP (which I should've mentioned in my original post now that I think of it) is to tweak the speakers on my laptop - there are for example ways to "fake" bass (through missing harmonics), or have dynamically changing bass. I'll have a look at VirtualAudioStream, thanks for the recommendation.
Text based: SuperCollider, Csound, Chuck
What are you even talking about, man.
Obviously the happy case is when all the audio processing is done in a DSP where scheduling is deterministic, but it's rare to be able to count on that. Part of the problem is that modern computers are so fast that people expect them to handle audio tasks without breathing hard. But that speed is usually measured as throughput rather than worst-case latency.
The advice I'd give to anybody building audio today is to relentlessly measure all potential sources of scheduling jitter end-to-end. Once you know that, it becomes clearer how to address it.