Skip to main content

Microsoft’s AI generates voices that sing in Chinese and English

Image Credit: LuckyImages/Shutterstock

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Researchers at Zhejiang University and Microsoft claim they’ve developed an AI system — DeepSinger — that can generate singing voices in multiple languages by training on data from music websites. In a paper published on the preprint Arxiv.org, they describe the novel approach, which leverages a specially-designed component to capture the timbre of singers from noisy singing data.

The work — like OpenAI’s music-generating Jukebox AI — has obvious commercial implications. Music artists are often pulled in for pick-up sessions to address mistakes, changes, or additions after a recording finishes. AI-assisted voice synthesis could eliminate the need for these, saving time and money on the part of the singers’ employers. But there’s a darker side: It could also be used to create deepfakes that stand in for musicians, making it seem as though they sang lyrics they never did (or put them out of work). In what could be a sign of legal battles to come, Jay-Z’s Roc Nation label recently filed copyright notices against videos that used AI to make him rap Billy Joel’s “We Didn’t Start the Fire.”

As the researchers explain, singing voices have more complicated patterns and rhythms than normal speaking voices. Synthesizing them requires information to control the duration and the pitch, which makes the task challenging. Plus, there aren’t many publicly available singing training data sets, and songs used in training must be manually analyzed at the lyrics and audio level.

DeepSinger ostensibly hurdles these challenges with a pipeline comprising several data-mining and data-modeling steps. First, the system crawls popular songs performed by top singers in multiple languages from a music website. It then extracts the singing voices from the accompaniments with an open source music separation tool called Spleeter before segmenting the audio into sentences. Next, DeepSinger extracts the singing duration of each phoneme (units of sound that distinguish one word from another) in the lyrics. After filtering the lyrics and singing voices according to confidence scores generated by a model, the system taps the aforementioned component to handle imperfect or distorted training data.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

Here’s a few samples it produced. The second is in the style of Groove Coverage’s Melanie Munch, singing a lyric from “Far Away From Home.”

 

 

In experiments, DeepSinger crawled tens of thousands of songs from the internet in Chinese, Cantonese, and English that were filtered for length and normalized with respect to volume range. Those with poor voice quality or lyrics that didn’t belong in the songs were discarded, netting a training data set — the Singing-Wild data set — containing 92 hours of songs sung by 89 singers.

The researchers report that from lyrics, duration, pitch information, and reference audio, DeepSinger can synthesize singing voices that are high quality in terms of both pitch accuracy and “voice naturalness.” They calculate the quantitative pitch accuracy of its songs to be higher than 85% across all three. In a user study involving 20 people, the mean opinion score gap between DeepSinger-generated songs and the original training audio was just 0.34 to 0.76.

In the future, the researchers plan to take advantage of more sophisticated AI-based technologies like WaveNet and jointly train the various submodels within DeepSinger for improved voice quality.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.