Back-Translation-Style Data Augmentation for end-to-end ASR

Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, Kazuya Takeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.

Original languageEnglish
Title of host publication2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages426-433
Number of pages8
ISBN (Electronic)9781538643341
DOIs
Publication statusPublished - 2019 Feb 11
Externally publishedYes
Event2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Athens, Greece
Duration: 2018 Dec 182018 Dec 21

Publication series

Name2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

Conference

Conference2018 IEEE Spoken Language Technology Workshop, SLT 2018
CountryGreece
CityAthens
Period18/12/1818/12/21

Fingerprint

Acoustics
acoustics
Speech recognition
Sampling
Costs
costs
evaluation
learning
performance

Keywords

  • automatic speech recognition
  • back-translation
  • data augmentation
  • end-to-end

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Linguistics and Language

Cite this

Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., Astudillo, R., & Takeda, K. (2019). Back-Translation-Style Data Augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings (pp. 426-433). [8639619] (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT.2018.8639619

Back-Translation-Style Data Augmentation for end-to-end ASR. / Hayashi, Tomoki; Watanabe, Shinji; Zhang, Yu; Toda, Tomoki; Hori, Takaaki; Astudillo, Ramon; Takeda, Kazuya.

2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 426-433 8639619 (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hayashi, T, Watanabe, S, Zhang, Y, Toda, T, Hori, T, Astudillo, R & Takeda, K 2019, Back-Translation-Style Data Augmentation for end-to-end ASR. in 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings., 8639619, 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 426-433, 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, 18/12/18. https://doi.org/10.1109/SLT.2018.8639619
Hayashi T, Watanabe S, Zhang Y, Toda T, Hori T, Astudillo R et al. Back-Translation-Style Data Augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 426-433. 8639619. (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings). https://doi.org/10.1109/SLT.2018.8639619
Hayashi, Tomoki ; Watanabe, Shinji ; Zhang, Yu ; Toda, Tomoki ; Hori, Takaaki ; Astudillo, Ramon ; Takeda, Kazuya. / Back-Translation-Style Data Augmentation for end-to-end ASR. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 426-433 (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings).
@inproceedings{f1af7e2ff50b4c74b573e07b9a4b3c66,
title = "Back-Translation-Style Data Augmentation for end-to-end ASR",
abstract = "In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.",
keywords = "automatic speech recognition, back-translation, data augmentation, end-to-end",
author = "Tomoki Hayashi and Shinji Watanabe and Yu Zhang and Tomoki Toda and Takaaki Hori and Ramon Astudillo and Kazuya Takeda",
year = "2019",
month = "2",
day = "11",
doi = "10.1109/SLT.2018.8639619",
language = "English",
series = "2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "426--433",
booktitle = "2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings",

}

TY - GEN

T1 - Back-Translation-Style Data Augmentation for end-to-end ASR

AU - Hayashi, Tomoki

AU - Watanabe, Shinji

AU - Zhang, Yu

AU - Toda, Tomoki

AU - Hori, Takaaki

AU - Astudillo, Ramon

AU - Takeda, Kazuya

PY - 2019/2/11

Y1 - 2019/2/11

N2 - In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.

AB - In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.

KW - automatic speech recognition

KW - back-translation

KW - data augmentation

KW - end-to-end

UR - http://www.scopus.com/inward/record.url?scp=85063102812&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063102812&partnerID=8YFLogxK

U2 - 10.1109/SLT.2018.8639619

DO - 10.1109/SLT.2018.8639619

M3 - Conference contribution

T3 - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

SP - 426

EP - 433

BT - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -