SoundStorm

Efficient Parallel Audio Generation

Abstract. We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

official demo page: https://google-research.github.io/seanet/soundstorm/examples

my implementation: https://github.com/lifeiteng/SoundStorm

This page is for showing reproduced results only.

Model Overview

Unlike the paper, I trained SoundStorm directly on the phoneme sequences. Model is still training(less than one epoch now.) and the Codec is not ideal.

LibriSpeech Samples

Text	Speaker Prompt	Ground Truth	Official VALL-E	Unofficial VALL-E LibriTTS Model	Unofficial SoundStorm
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.
And lay me down in thy cold bed and leave my shining lot.
Number ten, fresh nelly is waiting on you, good night husband.
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Acoustic Environment Maintenance

VALL-E can synthesize personalized speech while maintaining the acoustic environment of the speaker prompt. The audio and transcriptions are sampled from the Fisher dataset.

Text	Speaker Prompt	Ground Truth	Official VALL-E	Unofficial VALL-E LibriTTS Model	Unofficial SoundStorm
I think it's like you know um more convenient too.
Um we have to pay have this security fee just in case she would damage something but um.
Everything is run by computer but you got to know how to think before you can do a computer.
As friends thing I definitely I've got more male friends.

Speaker’s Emotion Maintenance

VALL-E can synthesize personalized speech while maintaining the emotion in the speaker prompt. The audio prompts are sampled from the Emotional Voices Database.