TY - JOUR
T1 - Insertion-based modeling for end-to-end automatic speech recognition
AU - Fujita, Yuya
AU - Watanabe, Shinji
AU - Omachi, Motoi
AU - Chang, Xuankai
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist temporal classification (CTC) and its variants. However, left-to-right decoding cannot consider the future output context, and it is not always optimal for ASR. One of the non-left-to-right models is known as non-autoregressive Transformer (NAT) and has been intensively investigated in the area of neural machine translation (NMT) research. One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence. This paper proposes to apply another type of NAT called insertion-based models, that were originally proposed for NMT, to ASR tasks. Insertion-based models solve the above mask-predict issues and can generate an arbitrary generation order of an output sequence. In addition, we introduce a new formulation of joint training of the insertion-based models and CTC. This formulation reinforces CTC by making it dependent on insertion-based token generation in a non-autoregressive manner. We conducted experiments on three public benchmarks and achieved competitive performance to strong autoregressive Transformer with a similar decoding condition.
AB - End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist temporal classification (CTC) and its variants. However, left-to-right decoding cannot consider the future output context, and it is not always optimal for ASR. One of the non-left-to-right models is known as non-autoregressive Transformer (NAT) and has been intensively investigated in the area of neural machine translation (NMT) research. One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence. This paper proposes to apply another type of NAT called insertion-based models, that were originally proposed for NMT, to ASR tasks. Insertion-based models solve the above mask-predict issues and can generate an arbitrary generation order of an output sequence. In addition, we introduce a new formulation of joint training of the insertion-based models and CTC. This formulation reinforces CTC by making it dependent on insertion-based token generation in a non-autoregressive manner. We conducted experiments on three public benchmarks and achieved competitive performance to strong autoregressive Transformer with a similar decoding condition.
KW - End-to-end
KW - Non-autoregressive
KW - Speech recognition
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85098156028&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098156028&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1619
DO - 10.21437/Interspeech.2020-1619
M3 - Conference article
AN - SCOPUS:85098156028
SN - 2308-457X
VL - 2020-October
SP - 3660
EP - 3664
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -