Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody

Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Mariko Kondo, Junichi Yamagishi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.

Original languageEnglish
Title of host publication2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4799-4803
Number of pages5
Volume2018-April
ISBN (Print)9781538646588
DOIs
Publication statusPublished - 2018 Sep 10
Event2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Calgary, Canada
Duration: 2018 Apr 152018 Apr 20

Other

Other2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
CountryCanada
CityCalgary
Period18/4/1518/4/20

Fingerprint

Cyborgs
Speech synthesis
Copying
Linguistics
Substitution reactions
Acoustics

Keywords

  • DNN
  • Foreign accent
  • Multilingual speech synthesis
  • Phonetic manipulation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Henter, G. E., Lorenzo-Trueba, J., Wang, X., Kondo, M., & Yamagishi, J. (2018). Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings (Vol. 2018-April, pp. 4799-4803). [8462470] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2018.8462470

Cyborg speech : Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody. / Henter, Gustav Eje; Lorenzo-Trueba, Jaime; Wang, Xin; Kondo, Mariko; Yamagishi, Junichi.

2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April Institute of Electrical and Electronics Engineers Inc., 2018. p. 4799-4803 8462470.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Henter, GE, Lorenzo-Trueba, J, Wang, X, Kondo, M & Yamagishi, J 2018, Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody. in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. vol. 2018-April, 8462470, Institute of Electrical and Electronics Engineers Inc., pp. 4799-4803, 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018, Calgary, Canada, 18/4/15. https://doi.org/10.1109/ICASSP.2018.8462470
Henter GE, Lorenzo-Trueba J, Wang X, Kondo M, Yamagishi J. Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April. Institute of Electrical and Electronics Engineers Inc. 2018. p. 4799-4803. 8462470 https://doi.org/10.1109/ICASSP.2018.8462470
Henter, Gustav Eje ; Lorenzo-Trueba, Jaime ; Wang, Xin ; Kondo, Mariko ; Yamagishi, Junichi. / Cyborg speech : Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody. 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April Institute of Electrical and Electronics Engineers Inc., 2018. pp. 4799-4803
@inproceedings{f84959d50e6a4cd39d7462f9f9c90d27,
title = "Cyborg speech: Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody",
abstract = "We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.",
keywords = "DNN, Foreign accent, Multilingual speech synthesis, Phonetic manipulation",
author = "Henter, {Gustav Eje} and Jaime Lorenzo-Trueba and Xin Wang and Mariko Kondo and Junichi Yamagishi",
year = "2018",
month = "9",
day = "10",
doi = "10.1109/ICASSP.2018.8462470",
language = "English",
isbn = "9781538646588",
volume = "2018-April",
pages = "4799--4803",
booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Cyborg speech

T2 - Deep multilingual speech synthesis for generating segmental foreign accent with natural prosody

AU - Henter, Gustav Eje

AU - Lorenzo-Trueba, Jaime

AU - Wang, Xin

AU - Kondo, Mariko

AU - Yamagishi, Junichi

PY - 2018/9/10

Y1 - 2018/9/10

N2 - We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.

AB - We describe a new application of deep-learning-based speech synthesis, namely multilingual speech synthesis for generating controllable foreign accent. Specifically, we train a DBLSTM-based acoustic model on non-accented multilingual speech recordings from a speaker native in several languages. By copying durations and pitch contours from a pre-recorded utterance of the desired prompt, natural prosody is achieved. We call this paradigm 'cyborg speech' as it combines human and machine speech parameters. Segmentally accented speech is produced by interpolating specific quinphone linguistic features towards phones from the other language that represent non-native mispronunciations. Experiments on synthetic American-English-accented Japanese speech show that subjective synthesis quality matches monolingual synthesis, that natural pitch is maintained, and that naturalistic phone substitutions generate output that is perceived as having an American foreign accent, even though only non-accented training data was used.

KW - DNN

KW - Foreign accent

KW - Multilingual speech synthesis

KW - Phonetic manipulation

UR - http://www.scopus.com/inward/record.url?scp=85054288945&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054288945&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2018.8462470

DO - 10.1109/ICASSP.2018.8462470

M3 - Conference contribution

AN - SCOPUS:85054288945

SN - 9781538646588

VL - 2018-April

SP - 4799

EP - 4803

BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -