TY - JOUR
T1 - Semi-supervised sequence-to-sequence ASR using unpaired speech and text
AU - Baskar, Murali Karthick
AU - Watanabe, Shinji
AU - Astudillo, Ramon
AU - Hori, Takaaki
AU - Burget, Lukáš
AU - Černocký, Jan
N1 - Funding Information:
This work was supported by Czech Ministry of Education, Youth and Sports from the National Program of Sustainability (NPU II) project ”IT4Innovations excellence in science - LQ1602” and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) MATERIAL program, via Air Force Research Laboratory (AFRL) contract # FA8650-17-C-9118 and U.S. DARPA LORELEI contract No. HR0011-15-C-0115. The views and conclusions contained herein are those of the authors and should not be interpreted as official policies, either expressed or implied, of ODNI, IARPA, AFRL or the U.S. Government. The work was also supported by Facebook (Research Award on Speech and Audio Technology for Voice Interaction and Video Understanding.
Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models. In particular, this work proposes a new semi-supervised loss combining an end-to-end differentiable ASR→TTS loss with TTS→ASR loss. The method is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of %WER. We provide extensive results analyzing the impact of data quantity and speech and text modalities and show consistent gains across WSJ and Librispeech corpora. Our code is provided in ESPnet to reproduce the experiments.
AB - Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models. In particular, this work proposes a new semi-supervised loss combining an end-to-end differentiable ASR→TTS loss with TTS→ASR loss. The method is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of %WER. We provide extensive results analyzing the impact of data quantity and speech and text modalities and show consistent gains across WSJ and Librispeech corpora. Our code is provided in ESPnet to reproduce the experiments.
KW - ASR
KW - Cycle consistency
KW - End-to-end
KW - Semi-supervised
KW - Sequence-to-sequence
KW - TTS
KW - Unsupervised
UR - http://www.scopus.com/inward/record.url?scp=85074698215&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074698215&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-3167
DO - 10.21437/Interspeech.2019-3167
M3 - Conference article
AN - SCOPUS:85074698215
SN - 2308-457X
VL - 2019-September
SP - 3790
EP - 3794
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -