TY - JOUR
T1 - Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
AU - Shinohara, Yusuke
AU - Watanabe, Shinji
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.
AB - Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.
KW - end-to-end
KW - latency
KW - sequence transducer
KW - speech recognition
KW - streaming
UR - http://www.scopus.com/inward/record.url?scp=85140049368&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140049368&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10989
DO - 10.21437/Interspeech.2022-10989
M3 - Conference article
AN - SCOPUS:85140049368
SN - 2308-457X
VL - 2022-September
SP - 2098
EP - 2102
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -