TY - JOUR
T1 - TriniTTS
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
AU - Ju, Yoon Cheol
AU - Kim, Il Hwan
AU - Yang, Hong Sun
AU - Kim, Ji Hoon
AU - Kim, Byeong Yeol
AU - Maiti, Soumi
AU - Watanabe, Shinji
N1 - Funding Information:
We appreciate Hyung Yong Kim, Jihwan Park, Yushin Lim, Yunkyu Lim and Shukjae Choi for helpful discussions and advice in preparation for this paper.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - Three research directions that have recently advanced the text-to-speech (TTS) field are end-to-end architecture, prosody control modeling, and on-the-fly duration alignment of non-auto-regressive models. However, these three agendas have yet to be tackled at once in a single solution. Current studies are limited either by a lack of control over prosody modeling or by the inefficient training inherent in building a two-stage TTS pipeline. We propose TriniTTS, a pitch-controllable end-to-end TTS without an external aligner that generates natural speech by addressing the issues mentioned above at once. It eliminates the training inefficiency in the two-stage TTS pipeline by the end-to-end architecture. Moreover, it manages to learn the latent vector representing the data distribution of the speeches through performing tasks (alignment search, pitch estimation, waveform generation) simultaneously. Experimental results demonstrate that TriniTTS enables prosody modeling with user input parameters to generate deterministic speech, while synthesizing comparable speech to the state-of-the-art VITS. Furthermore, eliminating normalizing flow modules used in VITS increases the inference speed by 28.84% in CPU environment and by 29.16% in GPU environment.
AB - Three research directions that have recently advanced the text-to-speech (TTS) field are end-to-end architecture, prosody control modeling, and on-the-fly duration alignment of non-auto-regressive models. However, these three agendas have yet to be tackled at once in a single solution. Current studies are limited either by a lack of control over prosody modeling or by the inefficient training inherent in building a two-stage TTS pipeline. We propose TriniTTS, a pitch-controllable end-to-end TTS without an external aligner that generates natural speech by addressing the issues mentioned above at once. It eliminates the training inefficiency in the two-stage TTS pipeline by the end-to-end architecture. Moreover, it manages to learn the latent vector representing the data distribution of the speeches through performing tasks (alignment search, pitch estimation, waveform generation) simultaneously. Experimental results demonstrate that TriniTTS enables prosody modeling with user input parameters to generate deterministic speech, while synthesizing comparable speech to the state-of-the-art VITS. Furthermore, eliminating normalizing flow modules used in VITS increases the inference speed by 28.84% in CPU environment and by 29.16% in GPU environment.
KW - end-to-end architecture
KW - pitch control
KW - speech synthesis
KW - text-to-speech
KW - TriniTTS
UR - http://www.scopus.com/inward/record.url?scp=85140081086&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140081086&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-925
DO - 10.21437/Interspeech.2022-925
M3 - Conference article
AN - SCOPUS:85140081086
SN - 2308-457X
VL - 2022-September
SP - 16
EP - 20
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Y2 - 18 September 2022 through 22 September 2022
ER -