Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders

Shigeki Karita, Shinji Watanabe, Tomoharu Iwata, Marc Delcroix, Atsunori Ogawa, Tomohiro Nakatani

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6166-6170
Number of pages5
ISBN (Electronic)9781479981311
DOIs
Publication statusPublished - 2019 May 1
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 2019 May 122019 May 17

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period19/5/1219/5/17

Fingerprint

Speech recognition

Keywords

  • autoencoder
  • encoder-decoder
  • semi-supervised learning
  • speech recognition

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 6166-6170). [8682890] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682890

Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders. / Karita, Shigeki; Watanabe, Shinji; Iwata, Tomoharu; Delcroix, Marc; Ogawa, Atsunori; Nakatani, Tomohiro.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 6166-6170 8682890 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Karita, S, Watanabe, S, Iwata, T, Delcroix, M, Ogawa, A & Nakatani, T 2019, Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682890, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 6166-6170, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8682890
Karita S, Watanabe S, Iwata T, Delcroix M, Ogawa A, Nakatani T. Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 6166-6170. 8682890. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8682890
Karita, Shigeki ; Watanabe, Shinji ; Iwata, Tomoharu ; Delcroix, Marc ; Ogawa, Atsunori ; Nakatani, Tomohiro. / Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 6166-6170 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{a24e46bc70ac4e8fba87ce575588b8b8,
title = "Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders",
abstract = "We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4{\%} to 8.4{\%} and word error rate from 20.6{\%} to 18.0{\%} by retraining the model with a large unpaired subset of the corpus.",
keywords = "autoencoder, encoder-decoder, semi-supervised learning, speech recognition",
author = "Shigeki Karita and Shinji Watanabe and Tomoharu Iwata and Marc Delcroix and Atsunori Ogawa and Tomohiro Nakatani",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682890",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "6166--6170",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders

AU - Karita, Shigeki

AU - Watanabe, Shinji

AU - Iwata, Tomoharu

AU - Delcroix, Marc

AU - Ogawa, Atsunori

AU - Nakatani, Tomohiro

PY - 2019/5/1

Y1 - 2019/5/1

N2 - We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

AB - We introduce speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. To build the speech and text autoencoders, we leverage state-of-the-art ASR and text-to-speech (TTS) encoder decoder architectures. These autoencoders learn features from speech only and text only datasets by switching the encoders and decoders used in the ASR and TTS models. Simultaneously, they aim to encode features to be compatible with ASR and TTS models by a multi-task loss. Additionally, we anticipate that TTS joint training can also improve the ASR performance because both ASR and TTS models learn transformations between speech and text. The experimental result we obtained with our semi-supervised end-to-end ASR/TTS training revealed reductions from a model initially trained with a small paired subset of the LibriSpeech corpus in the character error rate from 10.4% to 8.4% and word error rate from 20.6% to 18.0% by retraining the model with a large unpaired subset of the corpus.

KW - autoencoder

KW - encoder-decoder

KW - semi-supervised learning

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85068998276&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068998276&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682890

DO - 10.1109/ICASSP.2019.8682890

M3 - Conference contribution

AN - SCOPUS:85068998276

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6166

EP - 6170

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -