TY - JOUR
T1 - Non-Autoregressive Transformer for Speech Recognition
AU - Chen, Nanxin
AU - Watanabe, Shinji
AU - Villalba, Jesus
AU - Zelasko, Piotr
AU - Dehak, Najim
N1 - Publisher Copyright:
© 1994-2012 IEEE.
PY - 2021
Y1 - 2021
N2 - Very deep transformers outperform conventional bi-directional long short-term memory networks for automatic speech recognition (ASR) by a significant margin. However, being autoregressive models, their computational complexity is still a prohibitive factor in their deployment into production systems. To amend this problem, we study two different non-autoregressive transformer structures for ASR: Audio-Conditional Masked Language Model (A-CMLM) and Audio-Factorized Masked Language Model (A-FMLM). When training these frameworks, the decoder input tokens are randomly replaced by special mask tokens. Then, the network is optimized to predict the masked tokens by taking both the unmasked context tokens and the input speech into consideration. During inference, we start from all masked tokens and the network iteratively predicts missing tokens based on partial results. A new decoding strategy is proposed as an example, which starts from the most confident predictions to the rest. Results on Mandarin (AISHELL), Japanese (CSJ), English (LibriSpeech) benchmarks show promising results to train such a non-autoregressive network for ASR. Especially in AISHELL, the proposed method outperformed the Kaldi ASR system and matched the performance of the state-of-the-art autoregressive transformer with 7\times speedup.
AB - Very deep transformers outperform conventional bi-directional long short-term memory networks for automatic speech recognition (ASR) by a significant margin. However, being autoregressive models, their computational complexity is still a prohibitive factor in their deployment into production systems. To amend this problem, we study two different non-autoregressive transformer structures for ASR: Audio-Conditional Masked Language Model (A-CMLM) and Audio-Factorized Masked Language Model (A-FMLM). When training these frameworks, the decoder input tokens are randomly replaced by special mask tokens. Then, the network is optimized to predict the masked tokens by taking both the unmasked context tokens and the input speech into consideration. During inference, we start from all masked tokens and the network iteratively predicts missing tokens based on partial results. A new decoding strategy is proposed as an example, which starts from the most confident predictions to the rest. Results on Mandarin (AISHELL), Japanese (CSJ), English (LibriSpeech) benchmarks show promising results to train such a non-autoregressive network for ASR. Especially in AISHELL, the proposed method outperformed the Kaldi ASR system and matched the performance of the state-of-the-art autoregressive transformer with 7\times speedup.
KW - Neural networks
KW - non-autoregressive
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85098783807&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098783807&partnerID=8YFLogxK
U2 - 10.1109/LSP.2020.3044547
DO - 10.1109/LSP.2020.3044547
M3 - Article
AN - SCOPUS:85098783807
VL - 28
SP - 121
EP - 125
JO - IEEE Signal Processing Letters
JF - IEEE Signal Processing Letters
SN - 1070-9908
M1 - 9292943
ER -