TY - JOUR
T1 - End-to-end ASR with adaptive span self-attention
AU - Chang, Xuankai
AU - Subramanian, Aswin Shanmugam
AU - Guo, Pengcheng
AU - Watanabe, Shinji
AU - Fujita, Yuya
AU - Omachi, Motoi
N1 - Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Transformers have demonstrated state-of-the-art performance on many tasks in natural language processing and speech processing. One of the key components in Transformers is self-attention, which attends to the whole input sequence at every layer. However, the computational and memory cost of self-attention is square of the input sequence length, which is a major concern in automatic speech recognition (ASR) where the input sequence can be very long. In this paper, we propose to use a technique called adaptive span self-attention for ASR tasks, which is originally proposed for language modeling. Our method enables the network to learn an appropriate size and position of the window for each layer and head, and our newly introduced scheme can further control the window size depending on the future and past contexts. Thus, it can save both computational complexity and memory size from the square order of the input length to the adaptive linear order. We show the effectiveness of the proposed method by using several ASR tasks, and the proposed adaptive span methods consistently improved the performance from the conventional fixed span methods.
AB - Transformers have demonstrated state-of-the-art performance on many tasks in natural language processing and speech processing. One of the key components in Transformers is self-attention, which attends to the whole input sequence at every layer. However, the computational and memory cost of self-attention is square of the input sequence length, which is a major concern in automatic speech recognition (ASR) where the input sequence can be very long. In this paper, we propose to use a technique called adaptive span self-attention for ASR tasks, which is originally proposed for language modeling. Our method enables the network to learn an appropriate size and position of the window for each layer and head, and our newly introduced scheme can further control the window size depending on the future and past contexts. Thus, it can save both computational complexity and memory size from the square order of the input length to the adaptive linear order. We show the effectiveness of the proposed method by using several ASR tasks, and the proposed adaptive span methods consistently improved the performance from the conventional fixed span methods.
KW - Adaptive
KW - End-to-end
KW - Self-attention
KW - Speech recognition
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85098227176&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098227176&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2816
DO - 10.21437/Interspeech.2020-2816
M3 - Conference article
AN - SCOPUS:85098227176
VL - 2020-October
SP - 3595
EP - 3599
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -