Creating ad hoc microphone arrays from personal devices (2019) (opens in new tab)

(microsoft.com)

186 pointstomstokes6y ago53 comments

53 comments

43 comments · 11 top-level

crazygringo6y ago· 16 in thread

This is a really interesting technical concept.

Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem.

Microphones are crazy sensitive and pick up things like footsteps and conversations outside the door, shuffling feet and tapping on keyboards, and construction and HVAC noise like you wouldn't believe.

So filtering those things out, and then capturing the best quality audio from the current speaker, and trying to get everyone's voice at roughly the same volume whether they're sitting directly across from the microphone or are piping up from the corner of the room...

...and do this all while cancelling 100% of the echo that might be coming from two or three speakers at once...

...it's an insanely hard problem. Beamforming microphones absolutely help in a huge way, because if you know the speaker's voice is coming from 45° then knowing that any sound coming from any other angle can be removed is a really helpful piece of info.

Now, with beamforming microphones, the precise relative location and direction of each mic is known. The idea of creating one big beamforming mic for the room out of people's individual mics is... insanely hard, but super cool.

It's interesting to me that this article is about measuring the quality of voice transcription, rather than about the quality of audio in an actual meeting. But I suppose the voice transcription quality measurement is simply a proxy for the speaker audio quality generally, no?

This could actually be a huge step forward in not needing videoconferencing equipment in meeting rooms. So far, one of the biggest reasons has actually been dealing with echo and feedback -- when people are in the same call with multiple devices in the same room, it tends to end badly. But if the audio processing is designed for that... the results could actually be quite amazing.

And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

vxNsr6y ago

> And it's well-known that the "bowling alley" visual of meeting participants (camera at the end of a long conference table) isn't ideal. If each participant has their own laptop camera on themselves, it could be a vastly better experience for remote participants.

My company pushes us to have any conference that will include remote people from our desks, even if some or most of the attendees are in the same physical local. It means that no audio is dropped bec of too much cross-talk and that all attendees are on the same footing. Only real issue is that we don’t automatically get headsets, you need to request/expense it.

gregmac6y ago

Yeah, this is just a much better way to conduct meetings.

I've been in many meeting rooms where there's a single projector/tv, and the person controlling it only shows either the remote cameras OR their screen (while they're sharing), so that isolates the remote people even more. (I've also been the remote person in this situation, and it definitely feels more like being an occasionally noisy fly on the wall then a full participant).

Everyone also gets their full desktop (big/multiple monitors, full keyboard, etc).

It'll be interesting to see what happens post-lockdown.. will the people miss the benefits of "one remote = all remote" and have more empathy for remote people, or will we go back to the same old?

1 more reply

wnoise6y ago

> have any conference that will include remote people from our desks,

This is great for the participants, but absolute hell for everyone else in open offices, or even shared offices.

irjustin6y ago

Really late to the party, but I love this concept. I feel like this would be really difficult in an open office/shared office.

I enjoy team based offices, 7-10 man rooms. Even in there, this would probably be a nightmare unless you had this tech running in real time so you don't get microphone crosstalk/echo.

None the less, I really like the spirit of the system.

1 more reply

Zenst6y ago

> Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem

Not from my experience of 20 years ago setting up VC systems, biggest issue was video and making sure lighting was good, and plane wall behind (sky blue was good colour for that).

Audio wise, was many desk standing mic's (can't recall main brand) but was a few.

Did have one issue once with setting up a connection to a remote french company, was no audio from there end - turned out that the technician at the other end was sat on the end of the table in front of the camera and also was sat upon the mic that was on the table. Soon solved but still, most funny.

Back then we had VC systems that could roll into a room on a cart and worked well - picturTel IIRC being one solution back then and PolyCon being another that soon overtook them as well as doing wonderful conferencing microphones.

But as bandwidth got cheaper and more accessible, many meeting rooms that would be too noisy visually for VC became accessible and the need for dedicated rooms drifted away for more client usage.

Though audio from my experience back then was the easy part.

montroser6y ago

Quality video is definitely hard too, but it's just not as important.

If we have beautiful, well-lit video feeds if every participant, but no one can hear what they're saying -- that's a deal breaker. The other way around, if we have clean, crisp audio from everyone and inconsistent video, at least the conversion can still move forward.

1 more reply

crazygringo6y ago

What I meant was it's a complicated problem for the software and microphone engineers. Not for installation! :)

m4636y ago

"Please, before we start the meeting, can everyone in the room allow app microphone access for the best experience?"

qmmmur6y ago

It's a hard and interesting signals problem with surely many other benefits but surely money would be better spent just buying better mics and audio gear for an office.

qppo6y ago

Or better gear that uses it.

https://www.shure.com/en-US/products/microphones/mxa910

airstrike6y ago

> This could actually be a huge step forward in not needing videoconferencing equipment in meeting rooms. So far, one of the biggest reasons has actually been dealing with echo and feedback -- when people are in the same call with multiple devices in the same room, it tends to end badly. But if the audio processing is designed for that... the results could actually be quite amazing.

It seems to me that these days the simpler solution to both these problems is to just have people use airpods.

crazygringo6y ago

That doesn't work for multiple people in the same room all wearing AirPods. Everyone's mic picks up everyone's voice, not just the "real" speaker.

And a lot of meetings have most (e.g. 10) people in a single room, with another handful (e.g. 5) of remote participants.

2 more replies

LegitShady6y ago

Perhaps this is why simpler isn't always better. You aren't really solving the problem and the cost of your solution outweighs its "simplicity".

15 people around the table in a conference room and each person wearing airpods (thus being connected to their own device) is an expensive solution with a lot of points of failure.

eeZah7Ux6y ago

> it's an insanely hard problem

Not much: each participant has a pair of (amplitude, phase) values for each microphone. Filtering human voices and correlating sources to find the phase is not new.

crazygringo6y ago

That's not how amplitude and phase work, you don't "set them" on a microphone.

Filtering sources isn't new, of course, and doing it manually fiddling around with a recording you already made is one thing.

But doing it in automatically in real-time on equipment with unknown characteristics that leaves zero ghost signal behind... is, yes, insanely hard.

1 more reply

jcims6y ago

Agree. It's even easy to do manually in audacity with recorded tracks. There are probably some ML innovations here (maybe isolating the voice signal from the background in a way that lets you label the phase info sufficiently to correlate it) but the main innovation that I can see is packaging it in a way that's useful in this context.

itchyjunk6y ago· 10 in thread

There are obvious(?) privacy issues and what not here. But ignoring all that for a second, it does sound pretty cool to be able to leverage all the little computers we walk around with.

Think of all those shitty little video clips people take at a concert. Could all those be combined to make some high quality panoramic video? Probably a lot of other cool applications that I can't even comprehend for now. What a time to be alive.

crazygringo6y ago

> Think of all those shitty little video clips people take at a concert. Could all those be combined to make some high quality panoramic video?

People definitely have created very cool, close to seeming professionally-produced, full-length concert films out of tens of separate YouTube uploads. It takes a lot of editing skill though.

An actual 3D panorama is vastly harder, but remember Microsoft PhotoSynth? That rendered a 3D point cloud out of hundreds/thousands of tourist photos, and positioned the photos from where they were taken.

kick6y ago

Panoramic no (panoramas work on a single axis), but interesting despite that.

One of the applications for a thing like that would be creating 3D environments of concerts and other historical events that were fairly accurate from any angle, though, which could have some pretty interesting effects (could you imagine how interesting it would be if you could watch old concerts of dead artists, or a politician's speech from a hundred years ago, with 6 degrees of freedom, "accurate to the millimeter!" or something?) and outcomes and so on. Much more interesting historical record-keeping.

jcims6y ago

The word(s) for your 3d application is photogrammetry and/or videogrammetry.

1 more reply

stallmanite6y ago

It was more premeditated than what you describe but the Beastie Boys handed out hundreds of cameras before a show in Madison Square Garden in 2006 and then post-processed the resulting footage into a pretty epic concert film.

https://en.m.wikipedia.org/wiki/Awesome;_I_Fuckin'_Shot_That!

eeZah7Ux6y ago

Phones and servers in the offices have more than enough processing power to do this, without the cloud surveillance service.

Also, it's been done by sound engineers for decades.

ape46y ago

Sounds like they aren't using any computing power of the little computing devices.

ComodoHacker6y ago

Looks like they're going to leverage only our microphones, not the computers. Just another plausible way to suck even more data into the cloud.

m4636y ago

"Going forward, we will fund business ideas that: allow microphone access, allow camera access, allow location services, allow calendar access, allow..."

It's like webex - it turns on 24x7 microphone access "to detect nearby video devices"

amelius6y ago

> What a time to be alive.

This technology was already available in the 60s with Kalman filters.

adrianmonk6y ago

The key word here is "available". Computers were available in the 1960s as well, but there wasn't one in everybody's pocket.

The innovation here seems to be primarily about making this available in circumstances where it wouldn't otherwise be.

Zenst6y ago· 2 in thread

Interesting, doable and from my experience of this area, need a reference sound to calibrate, though that calibration could be ongoing for such things like this.

Gets down to matching a single sound and working out the timing of that sound from the multiple sources. Then you also need to factor in the frequency response as well.

That last part would be important to handle things like the table the devices are sat upon picking up vibrations from the desk. Remember that phones don't have a rubber base to isolate them from the table so any vibration of that surface would propagate into the device and microphone. Then the whole aspect of varying devices and with that, varying microphone quality and device housings. So calibrating at some level would be key for this to work, though doable and processing wise you could even run a master device and handle the processing there and remove the server aspect with some of the processing done upon each local device and passed onto the main device for correlating. Certainly some phones have the power to handle this type of affair to replace the server aspect. But that would be more work/effort and something that may well see later on. Though makes it harder to sell a bit of server processing software then.

Though one test I'd like to see this system handle would be how well it filters out those vibrations.

After all you don't want to hear somebody writing or putting a cup or other object down whilst somebody else is talking.

I'd also wonder what type of jitter tolerances they are working with across those devices and how that scales with devices/jitter - does jitter increase after so many devices.

ftio6y ago

Could you do the reference sound beyond the range of human hearing so that you could do it continuously?

Zenst6y ago

Nope as different frequencies propergate at different speeds.

However the initial greeting at start of the meeting would be good enough to cover that. Though some feedback and constant recalibration would be ideal and doable, That covers things like people entering the room and briefly changing the rooms acoustics with the door open briefly. Then somebody closes a blind and things like that, even somebody moving a coffee cup on the table would have (whilst small) an impact upon the acoustics. Though in that last instance, somebody moving a cap nearer a device would have a bigger impact upon that single source.

Though easiest way would be having a sound source on the main camera that did a simple frequency sweep - if you wanted to use a reference point sound source for calibration. You may even get away with single calibration then, though dynamic calibration and using the meeting itself to constantly recalibrate, whilst more effort, would give a better result.

But be interesting seeing this in action and how they handle aspects like that.

Indeed, thinking it thru you could have each device as it joins into the meeting do a calibration tone sweep that the other devices would pick up. That approach may well be better as you could get a more accurate map of all microphones in relation to each other that way. So initial login/join of the devices would handle that aspect nicely.

kohtatsu6y ago· 2 in thread

Would be cool if Microsoft gave more shits about privacy.

Edit: This would be cool if I trusted Microsoft to properly handle privacy.

moron4hire6y ago

I trust MS to not sell my data to every random jabroni on the net more than I trust Google.

airstrike6y ago

While I agree, that's also an incredibly low bar

geokon6y ago· 1 in thread

Does anyone have any insight into why neural nets are used for the "blind" beamforming? I don't have first hand experience with machine learning, but this just doesn't seem to me like a machine learning type of problem. I get it's not trivial, but it seems like there should be an analytic solution - more or less

crazygringo6y ago

Acoustics are modified in extremely non-linear ways depending on the shape of the room, bodies within it, materials, acoustic reflection, acting differently at different frequencies, and so on.

In theory if the entire 3D layout and material properties were known known in advance you could get clear audio analytically. But reverse-engineering the 3D layout and materials from existing audio is essentially impossible.

So machine learning is used to find approximate solutions that work.

stuaxo6y ago· 1 in thread

Oh, I wanted this years ago when phones had terrible microphones and audio codes.

The idea was that at a gig loads of people would record and you could reconstruct a much better recording.

dannypgh6y ago

I'd assume a lot of the losses would be the same across all devices - e.g. GSM and associated preprocessing will result in dynamic compression in a uniform way regardless of placement, no? It's an interesting idea but it seems like you'd need a mixture of different compression types.

pjc506y ago

My employer calls this "far field" audio, and has a number of hardware/firmware solutions: https://www.cirrus.com/products/cs48lv41f/ (we're also very secretive, so I can't really discuss it beyond the public website)

The specific improvement Microsoft are touting is blind beamforming, without knowing where the microphones are located relative to each other. Regular beamforming is already in use in some products.

peter_d_sherman6y ago

Excerpt:

"While the idea sounds simple, it requires overcoming many technical challenges to be effective. The audio quality of devices varies significantly. The speech signals captured by different microphones are not aligned with each other. The number of devices and their relative positions are unknown. For these reasons and others, consolidating the information streams from multiple independent devices in a coherent way is much more complicated than it may seem. In fact, although the concept of ad hoc microphone arrays dates back to the beginning of this century, to our knowledge it has not been realized as a product or public prototype so far."

Thoughts:

There's something deep here, not with respect to microphones and speech transcription (although I wish Microsoft and whoever else attempts to wrestle with those problems the greatest of success!)

There's a related deep problem in physics here.

If we consider signals that emanate from outer space, let's say they're from the big bang, or heck, let's just say they're from one of our past-the-edge-of-this-solar-system satelites -- that wants to communicate back to earth.

Well, due to the incredible distances involved, the signal will get garbled in various ways...

So here's the $64,000 question:

When that signal from deep space gets garbled, isn't it possible that it turns into various other signals, at various different other frequencies and wavelengths?

In other words, space itself, over long distances, acts as a prism (not really, but as an easy way to wrap your mind around this concept), for radio, and other electromagnetic waves...

Now, if you want to reconstruct the orignal message at these long distances, you must be able to reconstruct garbled radio (and other em) waves, which are moving at different frequencies, and may even arrive at the destination at different rates of speed with various time shifts...

Basically, you've got to take those pieces -- move them to the correct frequency, time correct them, speed them up or slow them down, sync them, and overlay them -- to reconstruct the original message...

That's the greater question in physics -- the ability to do all of that, with em signals from a long way off in space...

The article referenced -- is the microphone/audio/slow speed equivalent -- of that larger problem...

pabs36y ago

This reminds me of this open source project (and its predecessor manyears and open hardware projects 8/16soundsusb).

https://github.com/introlab/odas https://github.com/introlab/manyears https://github.com/introlab/16SoundsUSB

Website of the team behind these:

https://introlab.3it.usherbrooke.ca/

stragies6y ago

I look forward to exploring that github source drop.

andrewfromx6y ago

wow i just added https://news.ycombinator.com/item?id=22956082 a few days ago, on point no?

j / k navigate · click thread line to collapse

53 comments

43 comments · 11 top-level

crazygringo6y ago· 16 in thread

This is a really interesting technical concept.

Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem.

...and do this all while cancelling 100% of the echo that might be coming from two or three speakers at once...

vxNsr6y ago

gregmac6y ago

Yeah, this is just a much better way to conduct meetings.

Everyone also gets their full desktop (big/multiple monitors, full keyboard, etc).

It'll be interesting to see what happens post-lockdown.. will the people miss the benefits of "one remote = all remote" and have more empathy for remote people, or will we go back to the same old?

1 more reply

wnoise6y ago

> have any conference that will include remote people from our desks,

This is great for the participants, but absolute hell for everyone else in open offices, or even shared offices.

irjustin6y ago

Really late to the party, but I love this concept. I feel like this would be really difficult in an open office/shared office.

I enjoy team based offices, 7-10 man rooms. Even in there, this would probably be a nightmare unless you had this tech running in real time so you don't get microphone crosstalk/echo.

None the less, I really like the spirit of the system.

1 more reply

Zenst6y ago

> Capturing high-quality audio in a meeting room for videoconferencing is a notoriously complicated problem

Not from my experience of 20 years ago setting up VC systems, biggest issue was video and making sure lighting was good, and plane wall behind (sky blue was good colour for that).

Audio wise, was many desk standing mic's (can't recall main brand) but was a few.

But as bandwidth got cheaper and more accessible, many meeting rooms that would be too noisy visually for VC became accessible and the need for dedicated rooms drifted away for more client usage.

Though audio from my experience back then was the easy part.

montroser6y ago

Quality video is definitely hard too, but it's just not as important.

1 more reply

crazygringo6y ago

What I meant was it's a complicated problem for the software and microphone engineers. Not for installation! :)

m4636y ago

"Please, before we start the meeting, can everyone in the room allow app microphone access for the best experience?"

qmmmur6y ago

It's a hard and interesting signals problem with surely many other benefits but surely money would be better spent just buying better mics and audio gear for an office.

qppo6y ago

Or better gear that uses it.

https://www.shure.com/en-US/products/microphones/mxa910

airstrike6y ago

It seems to me that these days the simpler solution to both these problems is to just have people use airpods.

crazygringo6y ago

That doesn't work for multiple people in the same room all wearing AirPods. Everyone's mic picks up everyone's voice, not just the "real" speaker.

And a lot of meetings have most (e.g. 10) people in a single room, with another handful (e.g. 5) of remote participants.

2 more replies

LegitShady6y ago

Perhaps this is why simpler isn't always better. You aren't really solving the problem and the cost of your solution outweighs its "simplicity".

15 people around the table in a conference room and each person wearing airpods (thus being connected to their own device) is an expensive solution with a lot of points of failure.

eeZah7Ux6y ago

> it's an insanely hard problem

Not much: each participant has a pair of (amplitude, phase) values for each microphone. Filtering human voices and correlating sources to find the phase is not new.

crazygringo6y ago

That's not how amplitude and phase work, you don't "set them" on a microphone.

Filtering sources isn't new, of course, and doing it manually fiddling around with a recording you already made is one thing.

But doing it in automatically in real-time on equipment with unknown characteristics that leaves zero ghost signal behind... is, yes, insanely hard.

1 more reply

jcims6y ago

itchyjunk6y ago· 10 in thread

There are obvious(?) privacy issues and what not here. But ignoring all that for a second, it does sound pretty cool to be able to leverage all the little computers we walk around with.

crazygringo6y ago

> Think of all those shitty little video clips people take at a concert. Could all those be combined to make some high quality panoramic video?

People definitely have created very cool, close to seeming professionally-produced, full-length concert films out of tens of separate YouTube uploads. It takes a lot of editing skill though.

kick6y ago

Panoramic no (panoramas work on a single axis), but interesting despite that.

jcims6y ago

The word(s) for your 3d application is photogrammetry and/or videogrammetry.

1 more reply

stallmanite6y ago

https://en.m.wikipedia.org/wiki/Awesome;_I_Fuckin'_Shot_That!

eeZah7Ux6y ago

Phones and servers in the offices have more than enough processing power to do this, without the cloud surveillance service.

Also, it's been done by sound engineers for decades.

ape46y ago

Sounds like they aren't using any computing power of the little computing devices.

ComodoHacker6y ago

Looks like they're going to leverage only our microphones, not the computers. Just another plausible way to suck even more data into the cloud.

m4636y ago

"Going forward, we will fund business ideas that: allow microphone access, allow camera access, allow location services, allow calendar access, allow..."

It's like webex - it turns on 24x7 microphone access "to detect nearby video devices"

amelius6y ago

> What a time to be alive.

This technology was already available in the 60s with Kalman filters.

adrianmonk6y ago

The key word here is "available". Computers were available in the 1960s as well, but there wasn't one in everybody's pocket.

The innovation here seems to be primarily about making this available in circumstances where it wouldn't otherwise be.

Zenst6y ago· 2 in thread

Interesting, doable and from my experience of this area, need a reference sound to calibrate, though that calibration could be ongoing for such things like this.

Gets down to matching a single sound and working out the timing of that sound from the multiple sources. Then you also need to factor in the frequency response as well.

Though one test I'd like to see this system handle would be how well it filters out those vibrations.

After all you don't want to hear somebody writing or putting a cup or other object down whilst somebody else is talking.

I'd also wonder what type of jitter tolerances they are working with across those devices and how that scales with devices/jitter - does jitter increase after so many devices.

ftio6y ago

Could you do the reference sound beyond the range of human hearing so that you could do it continuously?

Zenst6y ago

Nope as different frequencies propergate at different speeds.

But be interesting seeing this in action and how they handle aspects like that.

kohtatsu6y ago· 2 in thread

Would be cool if Microsoft gave more shits about privacy.

Edit: This would be cool if I trusted Microsoft to properly handle privacy.

moron4hire6y ago

I trust MS to not sell my data to every random jabroni on the net more than I trust Google.

airstrike6y ago

While I agree, that's also an incredibly low bar

geokon6y ago· 1 in thread

crazygringo6y ago

Acoustics are modified in extremely non-linear ways depending on the shape of the room, bodies within it, materials, acoustic reflection, acting differently at different frequencies, and so on.

So machine learning is used to find approximate solutions that work.

stuaxo6y ago· 1 in thread

Oh, I wanted this years ago when phones had terrible microphones and audio codes.

The idea was that at a gig loads of people would record and you could reconstruct a much better recording.

dannypgh6y ago

pjc506y ago

peter_d_sherman6y ago

Excerpt:

Thoughts:

There's something deep here, not with respect to microphones and speech transcription (although I wish Microsoft and whoever else attempts to wrestle with those problems the greatest of success!)

There's a related deep problem in physics here.

Well, due to the incredible distances involved, the signal will get garbled in various ways...

So here's the $64,000 question:

When that signal from deep space gets garbled, isn't it possible that it turns into various other signals, at various different other frequencies and wavelengths?

In other words, space itself, over long distances, acts as a prism (not really, but as an easy way to wrap your mind around this concept), for radio, and other electromagnetic waves...

That's the greater question in physics -- the ability to do all of that, with em signals from a long way off in space...

The article referenced -- is the microphone/audio/slow speed equivalent -- of that larger problem...

pabs36y ago

This reminds me of this open source project (and its predecessor manyears and open hardware projects 8/16soundsusb).

https://github.com/introlab/odas https://github.com/introlab/manyears https://github.com/introlab/16SoundsUSB

Website of the team behind these:

https://introlab.3it.usherbrooke.ca/

stragies6y ago

I look forward to exploring that github source drop.

andrewfromx6y ago

wow i just added https://news.ycombinator.com/item?id=22956082 a few days ago, on point no?

j / k navigate · click thread line to collapse