We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually autoregressive: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations as the output length. On the other hand, non-autoregressive models can simultaneously generate tokens within a constant number of iterations, which results in significant inference time reduction and better suits end-to-end ASR model for real-world scenarios. In this work, Mask CTC model is trained using a Transformer encoder-decoder with joint training of mask prediction and CTC. During inference, the target sequence is initialized with the greedy CTC outputs and low-confidence tokens are masked based on the CTC probabilities. Based on the conditional dependence between output tokens, these masked low-confidence tokens are then predicted conditioning on the high-confidence tokens. Experimental results on different speech recognition tasks show that Mask CTC outperforms the standard CTC model (e.g., 17.9% ? 12.1% WER on WSJ) and approaches the autoregressive model, requiring much less inference time using CPUs (0.07 RTF in Python implementation). All of our codes are publicly available at https://github.com/espnet/espnet.
|ジャーナル||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|出版ステータス||Published - 2020|
|イベント||21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China|
継続期間: 2020 10 25 → 2020 10 29
ASJC Scopus subject areas