Skip to main content

ProBeat: The question of cloud AI or edge AI is far from settled

Microsoft Teams and Google Meet

This week, I did a deep dive into Google Meet’s noise cancellation, a couple months after detailing Microsoft Teams’ noise suppression. Both use supervised learning. Both try to filter out typing, vacuum cleaners, and rustling bags while keeping speech, singing, and laughter. Sure, Google Meet cancels out musical instruments while Microsoft Teams keeps them, but other than that they’re nearly identical. At least it looks like they are, until you look under the hood.

The timing is no coincidence either — collaboration and video conferencing tools have never been more important than during the age of the coronavirus, when millions have to learn and work from home. Google and Microsoft are putting their machine learning chops to the test in the hopes of one-upping Zoom and crushing Slack. Google Meet and Microsoft Teams use AI to remove background noise in real time so you hear only speech on a meeting call. And yet what struck me after I interviewed their respective product leads is how differently the companies are approaching the same problem.

Here’s the simple version: Google put its machine learning model in the cloud, while Microsoft put its machine learning model on the edge. But there’s more to it than that — let me quote the product leads directly.

Here is Serge Lachapelle, G Suite director of product management:

Our job has always been passing through the cloud as quickly as possible. But now with these TensorFlow processors, and basically the way that our infrastructure is built, we discovered that we could do media manipulation in real time and add sometimes only around 20 milliseconds of delay. So that’s the road we took.

Here is Robert Aichner, Microsoft Teams group program manager:

A lot of the machine learning happens in the cloud. So for speech recognition, for example, you speak into the microphone, that’s sent to the cloud. The cloud has huge compute, and then you run these large models to recognize your speech. For us, since it’s real-time communication, I need to process every frame. Let’s say it’s 10 or 20 millisecond frames. I need to now process that within that time so that I can send that immediately to you. I can’t send it to the cloud, wait for some noise suppression, and send it back.

That latency question also leads to a question around cost. Every additional network hop adds latency and doing a lot of server processing for each call increases cost.

Google’s Lachapelle, on cost:

There’s a cost associated with it. Absolutely. But in our modeling, we felt that this just moves the needle so much that this is something we need to do. And it’s a feature that we will be bringing at first to our paying G Suite customers. As we see how much it’s being used and we continue to improve it, hopefully we’ll be able to bring it to a larger and larger group of users.

Microsoft’s Aichner, on cost:

You want to make sure that you push as much of the compute to the endpoint of the user because there isn’t really any cost involved in that. You already have your laptop or your PC or your mobile phone, so now let’s do some additional processing. As long as you’re not overloading the CPU, that should be fine.

But then there are other trade-offs to consider.

Google’s Lachapelle, on speed:

Doing this without slowing things down is so important because that’s basically what a big chunk of our team does — try to optimize everything for speed, all the time. We can’t introduce features that slow things down. And so I would say that just optimizing the code so that it becomes as fast as possible is probably more than half of the work. More than creating the model, more than the whole machine learning part. It’s just like optimize, optimize, optimize. That’s been the hardest hurdle.

Microsoft’s Aichner, on battery life:

Yeah, battery life, we are obviously paying attention to that too. We don’t want you now to have much lower battery life just because we added some noise suppression. That’s definitely another requirement we have when we are shipping. We need to make sure that we are not regressing there.

At first glance, these different approaches make sense. It’s right there in the companies’ respective DNA. Google was born in the internet age, while Microsoft pioneered the software era. Microsoft is traditionally about software installed locally, while Google is all about apps hosted in the cloud. This is Microsoft Office versus G Suite in a nutshell.

Still, it’s never that simple. Sure, Office dwarfs G Suite, but Microsoft Azure is more successful than Google Cloud. Meanwhile, Google Chrome won so thoroughly that Edge is now based on Chromium.

But I digress. In building out noise filtering for their respective video calling solutions, Google and Microsoft took decisively different approaches. Google went with the cloud to bring the same experience to everyone, cost be damned. Microsoft went with the edge to bring the best experience to everyone, complexity be damned.

Both Lachapelle and Aichner acknowledge to me that they may have to change their approach based on how the rollout of each feature goes. It’s too early to say which solution is superior, or whether there will even be a winner. If, however, one of these companies backpedals, there will be a clear loser: either the cloud or the edge.

ProBeat is a column in which Emil rants about whatever crosses him that week.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.