TY - GEN
T1 - Back-Translation-Style Data Augmentation for end-to-end ASR
AU - Hayashi, Tomoki
AU - Watanabe, Shinji
AU - Zhang, Yu
AU - Toda, Tomoki
AU - Hori, Takaaki
AU - Astudillo, Ramon
AU - Takeda, Kazuya
N1 - Publisher Copyright:
© 2018 IEEE.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2019/2/11
Y1 - 2019/2/11
N2 - In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.
AB - In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.
KW - automatic speech recognition
KW - back-translation
KW - data augmentation
KW - end-to-end
UR - http://www.scopus.com/inward/record.url?scp=85063102812&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063102812&partnerID=8YFLogxK
U2 - 10.1109/SLT.2018.8639619
DO - 10.1109/SLT.2018.8639619
M3 - Conference contribution
AN - SCOPUS:85063102812
T3 - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
SP - 426
EP - 433
BT - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2018 IEEE Spoken Language Technology Workshop, SLT 2018
Y2 - 18 December 2018 through 21 December 2018
ER -