
SpeechFlow is a self-supervised generative model pre-trained with unlabeled speech. It shares similar architecture and objective function with Voicebox, and can be fine-tuned for different tasks at a significantly lower cost. Read paper for more details

Zero-shot TTS

With 62.5x less labeled data, SpeechFlow can be fine-tuned to perform zero-shot TTS at Voicebox-level.

Target Text

Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings

Target Text

They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission

Target Text

And lay me down in thy cold bed and leave my shining lot

Target Text

And the whole night the tree stood still and in deep thought

Target Text

Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid

Target Text

The army found the people in poverty and left them in comparative wealth

Target Text

Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech

Target Text

He was in deep converse with the clerk and entered the hall holding him by the arm

Target Text

Number ten fresh nelly is waiting on you good night husband

Speech Separation

SpeechFlow can also be fine-tuned to separate overlapped speech.

Speech Enhancement

Fine-tuning SpeechFlow to remove noise in speech.