Through in-context learning, Voicebox can synthesize speech with any audio style by taking as input a reference audio of the desired style and the text to synthesize. It produces speech that sounds coherent to the reference in every aspects, including voice, background noise, and speaking style.
Target Text: Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings
Target Text: They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission
Target Text: And lay me down in thy cold bed and leave my shining lot
Target Text: And the whole night the tree stood still and in deep thought
Target Text: Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid
Target Text: The army found the people in poverty and left them in comparative wealth
Target Text: Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech
Target Text: He was in deep converse with the clerk and entered the hall holding him by the arm
Target Text: Number ten fresh nelly is waiting on you good night husband
Target Text: Rather a hypothetical question colonel but i should say it might be a fifty fifty proposition
Target Text: How much wood could a woodchuck chuck if a woodchuck could chuck wood
Target Text: When feline magicians enchant the city and crafty canine illusionists work to restore balance, don’t miss the uproarious clash in ‘magic and mischief: the paws of mystery.’
Target Text: Peter piper picked a peck of pickled peppers.
Target Text: Voicebox is the swiss army knife of text to speech acing multiple languages, changing voice styles, and dishing out custom samples.
Target Text: In a land where cat pirates sail the high seas and dog buccaneers chase their tails, embark on a swashbuckling comedy adventure in ‘furry buccaneers: the quest for the golden bone.’