TY - GEN
T1 - Cyborg speech
T2 - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
AU - Henter, Gustav Eje
AU - Lorenzo-Trueba, Jaime
AU - Wang, Xin
AU - Kondo, Mariko
AU - Yamagishi, Junichi
N1 - Funding Information:
This work was partially supported by MEXT KAKENHI Grant Numbers (15H02729, 17K12720).
Publisher Copyright:
© 2018 IEEE.
PY - 2018/9/10
Y1 - 2018/9/10
N2 - We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.
AB - We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.
KW - DNN
KW - Foreign accent
KW - Multilingual speech synthesis
KW - Phonetic manipulation
UR - http://www.scopus.com/inward/record.url?scp=85054288945&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054288945&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2018.8462470
DO - 10.1109/ICASSP.2018.8462470
M3 - Conference contribution
AN - SCOPUS:85054288945
SN - 9781538646588
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 4799
EP - 4803
BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 15 April 2018 through 20 April 2018
ER -