TY - GEN
T1 - A purely end-to-end system for multi-speaker speech recognition
AU - Seki, Hiroshi
AU - Hori, Takaaki
AU - Watanabe, Shinji
AU - Le Roux, Jonathan
AU - Hershey, John R.
N1 - Funding Information:
This work was done while H. Seki, Ph.D. candidate at Toyohashi University of Technology, Japan, was an intern at MERL.
Publisher Copyright:
© 2018 Association for Computational Linguistics
PY - 2018
Y1 - 2018
N2 - Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
AB - Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
UR - http://www.scopus.com/inward/record.url?scp=85063098725&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063098725&partnerID=8YFLogxK
U2 - 10.18653/v1/p18-1244
DO - 10.18653/v1/p18-1244
M3 - Conference contribution
AN - SCOPUS:85063098725
T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 2620
EP - 2630
BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
Y2 - 15 July 2018 through 20 July 2018
ER -