We compare various ordinary differential equation solver configurations on the zero-shot TTS task. The midpoint method is used for fixed-step inference. With the number of function evaluations (NFE) equal to 2, the average generation time is 0.35 second for producing a 10 second audio.
α denotes the weight of the classifier-free guidance. When α=0, each midpoint step requires 2 function evaluations. When α>0, each step requires 4 function evaluations.