TY - JOUR
T1 - End-to-end neural speaker diarization with permutation-free objectives
AU - Fujita, Yusuke
AU - Kanda, Naoyuki
AU - Horiguchi, Shota
AU - Nagamatsu, Kenji
AU - Watanabe, Shinji
N1 - Publisher Copyright:
Copyright © 2019, The Authors. All rights reserved.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2019/9/12
Y1 - 2019/9/12
N2 - In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachispeech/EEND.
AB - In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset. Our source code is available online at https://github.com/hitachispeech/EEND.
KW - End-to-end speaker diarization
KW - Neural network
KW - Overlapping speech
KW - Permutation-free scheme
UR - http://www.scopus.com/inward/record.url?scp=85094076358&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85094076358&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:85094076358
JO - Nuclear Physics A
JF - Nuclear Physics A
SN - 0375-9474
ER -