TY - JOUR
T1 - Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration
AU - Karita, Shigeki
AU - Soplin, Nelson Enrique Yalta
AU - Watanabe, Shinji
AU - Delcroix, Marc
AU - Ogawa, Atsunori
AU - Nakatani, Tomohiro
N1 - Publisher Copyright:
Copyright © 2019 ISCA
PY - 2019
Y1 - 2019
N2 - The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
AB - The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-to-sequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effective joint training and decoding methods. To realize a faster and more accurate ASR system, we combine Transformer and the advances in RNN-based ASR. In our experiments, we found that the training of Transformer is slower than that of RNN as regards the learning curve and integration with the naive language model (LM) is difficult. To address these problems, we integrate connectionist temporal classification (CTC) with Transformer for joint training and decoding. This approach makes training faster than with RNNs and assists LM integration. Our proposed ASR system realizes significant improvements in various ASR tasks. For example, it reduced the WERs from 11.1% to 4.5% on the Wall Street Journal and from 16.1% to 11.6% on the TED-LIUM by introducing CTC and LM integration into the Transformer baseline.
KW - Connectionist temporal classification
KW - Language model
KW - Speech recognition
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85074720013&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074720013&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2019-1938
DO - 10.21437/Interspeech.2019-1938
M3 - Conference article
AN - SCOPUS:85074720013
VL - 2019-September
SP - 1408
EP - 1412
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019
Y2 - 15 September 2019 through 19 September 2019
ER -