For example, it is unclear how to label audio samples with the same text but different emphasis or emotion. Such non-textual information, conveying the meaning and human expressiveness, is difficult to express because they are unlabeled. However, current text-to-speech models do not give you enough control over how the generated speech sounds, disregarding the acoustic properties of the voice. You can also use FastPitch to generate mel spectrograms in parallel, achieving good speedup compared to Tacotron 2. For example, you can use Tacotron 2 and WaveGlow to convert text into high quality, natural-sounding speech in real time. Recent conversational AI research has demonstrated automatically generating high quality, human-like audio from text. Sign up for the latest Speech AI news from NVIDIA.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |