The pipeline was largely divided into three stages: pre-training, fine-tuning, and inference, and two techniques are used to solve the existing difficulties. However, there is a problem with actual use because there are too many parameters or it does not produce satisfactory quality.ĪdaSpeech is a TTS model that can efficiently generate new users’ (or speakers) voices with high quality while solving the above problems. In other words, the more adaptive parameters you use, the better quality you can produce, but the higher memory usage and the higher cost of deploying the model.Įxisting studies have approached by specifying a method of fine-tuning the entire model or part (especially decoder), fine-tuning only speaker embedding used to distinguish speakers in multi-speaker speech synthesis, training speaker encoder module, and assuming that the domain of source speech and adaptive data is same. Second, when adapting the source TTS model to a new voice, there is trade-off in fine-tuning parameters and voice quality. For example, there are a variety of rhymes, styles, emotions, strengths, and recording environments of speakers, and differences in speech data resulting from them can hinder the generalization performance of the source model, resulting in poor adaptation quality. There are two main problems with training neural nets with customized voice.įirst, certain user’s voices often have acoustic conditions different from the speech data learned from the source TTS model. Most of the user’s speech data used at this time is small for convenience purposes, and since the amount is small, it is a very difficult task to make the generated voice feel natural and similar to the original voice. The technology to generate custom voice is mainly done through the process of adapting the pre-trained source TTS model to the user’s voice. Today, we will look at a text-to-speech (TTS) model called AdaSpeech that appeared for custom voice synthesis. And there is a growing demand to use not only other people’s voices but also their voices as AI voice, which is called custom voice synthesis in the field of speech synthesis research. I set the speaker voice with the voice of my favorite actor Yoo In-na, and it has become important to synthesize speech with various voices as speech synthesis technology has been incorporated into various parts of life, such as personal assistants, news broadcasts, and voice directions. You may have experienced changing the voice of the guided voice while using AI speakers or navigation. |By DeepBrain AI Deep Learning Team : Colin Abstract
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |