Abstract. We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.
official demo page: https://google-research.github.io/seanet/soundstorm/examples
my implementation: https://github.com/lifeiteng/SoundStorm
This page is for showing reproduced results only.
Model Overview
Unlike the paper, I trained SoundStorm directly on the phoneme sequences. Model is still training(less than one epoch now.) and the Codec is not ideal.
LibriSpeech Samples
Text | Speaker Prompt | Ground Truth | Official VALL-E | Unofficial VALL-E LibriTTS Model | Unofficial SoundStorm |
---|---|---|---|---|---|
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission. | |||||
And lay me down in thy cold bed and leave my shining lot. | |||||
Number ten, fresh nelly is waiting on you, good night husband. | |||||
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. |
Acoustic Environment Maintenance
VALL-E can synthesize personalized speech while maintaining the acoustic environment of the speaker prompt. The audio and transcriptions are sampled from the Fisher dataset.
Text | Speaker Prompt | Ground Truth | Official VALL-E | Unofficial VALL-E LibriTTS Model | Unofficial SoundStorm |
---|---|---|---|---|---|
I think it's like you know um more convenient too. | |||||
Um we have to pay have this security fee just in case she would damage something but um. | |||||
Everything is run by computer but you got to know how to think before you can do a computer. | |||||
As friends thing I definitely I've got more male friends. |
Speaker’s Emotion Maintenance
VALL-E can synthesize personalized speech while maintaining the emotion in the speaker prompt. The audio prompts are sampled from the Emotional Voices Database.
Text | Emotion | Speaker Prompt | Official VALL-E | Unofficial VALL-E LibriTTS Model | Unofficial SoundStorm |
---|---|---|---|---|---|
We have to reduce the number of plastic bags. | Anger | ||||
Sleepy | |||||
Neutral | |||||
Amused | |||||
Disgusted |
Ethics Statement
To avoid abuse, Well-trained models will not be provided.