Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Katsuki Inoue, Sunao Hara, Masanobu Abe, Tomoki Hayashi, Ryuichi Yamamoto, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recently, end-to-end text-to-speech (TTS) models have achieved a remarkable performance, however, requiring a large amount of paired text and speech data for training. On the other hand, we can easily collect unpaired dozen minutes of speech recordings for a target speaker without corresponding text data. To make use of such accessible data, the proposed method leverages the recent great success of state-of-the-art end-to-end automatic speech recognition (ASR) systems and obtains corresponding transcriptions from pretrained ASR models. Although these models could only provide text output instead of intermediate linguistic features like phonemes, end-to-end TTS can be well trained with such raw text data directly. Thus, the proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems. The experimental results show that our proposed method achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7634-7638
Number of pages5
ISBN (Electronic)9781509066315
DOIs
Publication statusPublished - 2020 May
Externally publishedYes
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: 2020 May 42020 May 8

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
CountrySpain
CityBarcelona
Period20/5/420/5/8

Keywords

  • End-to-End speech synthesis
  • joint training of ASR-TTS
  • sequence-to-sequence
  • speaker adaptation
  • unpaired data

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models'. Together they form a unique fingerprint.

Cite this