TY - JOUR
T1 - Joint speaker diarization and speech recognition based on region proposal networks
AU - Huang, Zili
AU - Delcroix, Marc
AU - Garcia, Leibny Paola
AU - Watanabe, Shinji
AU - Raj, Desh
AU - Khudanpur, Sanjeev
N1 - Funding Information:
This work was partially supported by grants from Nanyang Technological University , and the Government of Israel .
Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2022/3
Y1 - 2022/3
N2 - Speaker diarization, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity, is an important task for speech processing. The standard clustering-based diarization pipeline (1) segments the whole utterance into small chunks, (2) extracts speaker embedding for each chunk, and (3) groups the chunks into clusters, where each cluster represents one speaker. It has two major disadvantages: first, it contains several individually optimized modules in the pipeline, and second, it cannot handle overlapping speech. To address these issues, we proposed region proposal network-based speaker diarization (RPNSD) (Huang et al., 2020). In this paper, we perform a detailed study of the RPNSD system, and make two important contributions. First, we report its diarization performance on additional datasets and empirically investigate the impact of different system settings. Second, we integrate an automatic speech recognition (ASR) component into the RPNSD system and propose a new framework called RPN-JOINT that simultaneously performs diarization and ASR. Our experiments reveal that (1) the RPNSD system can consistently achieve diarization results that are competitive with state-of-the-art methods, and (2) the RPN-JOINT system offers several advantages over the conventional cascade of diarization and ASR systems.
AB - Speaker diarization, the process of partitioning an input audio stream into homogeneous segments according to the speaker identity, is an important task for speech processing. The standard clustering-based diarization pipeline (1) segments the whole utterance into small chunks, (2) extracts speaker embedding for each chunk, and (3) groups the chunks into clusters, where each cluster represents one speaker. It has two major disadvantages: first, it contains several individually optimized modules in the pipeline, and second, it cannot handle overlapping speech. To address these issues, we proposed region proposal network-based speaker diarization (RPNSD) (Huang et al., 2020). In this paper, we perform a detailed study of the RPNSD system, and make two important contributions. First, we report its diarization performance on additional datasets and empirically investigate the impact of different system settings. Second, we integrate an automatic speech recognition (ASR) component into the RPNSD system and propose a new framework called RPN-JOINT that simultaneously performs diarization and ASR. Our experiments reveal that (1) the RPNSD system can consistently achieve diarization results that are competitive with state-of-the-art methods, and (2) the RPN-JOINT system offers several advantages over the conventional cascade of diarization and ASR systems.
KW - Faster R-CNN
KW - Multi-speaker speech recognition
KW - Region proposal network
KW - Speaker diarization
UR - http://www.scopus.com/inward/record.url?scp=85119598120&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119598120&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101316
DO - 10.1016/j.csl.2021.101316
M3 - Article
AN - SCOPUS:85119598120
SN - 0885-2308
VL - 72
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101316
ER -