TY - JOUR
T1 - Speaker-conditional chain model for speech separation and extraction
AU - Shi, Jing
AU - Xu, Jiaming
AU - Fujita, Yusuke
AU - Watanabe, Shinji
AU - Xu, Bo
N1 - Funding Information:
5. Conclusions We introduced the Speaker-conditional chain model as a common framework to process audio recordings with multiple speakers. Our model could be applied to tackle the separation problem towards fully-overlapped speech with variable and unknown number of speakers. Meanwhile, multi-round long audio recordings in natural scenes can also be modeled and extracted effectively using this method. Experimental results showed the effectiveness and good adaptability of the proposed model. Our following work will extend this model to the real scenes with noisy and reverberant multi-channel recordings. We would also like to explore the factors to improve the generalization ability of this approach, like the introduction of more speakers or changes in the network and training objectives. 6. Acknowledgements This work was supported by the Major Project for New Generation of AI (Grant No. 2018AAA0100400), and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB32070000).
Publisher Copyright:
© 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Speech separation has been extensively explored to tackle the cocktail party problem. However, these studies are still far from having enough generalization capabilities for real scenarios. In this work, we raise a common strategy named Speaker-Conditional Chain Model to process complex speech recordings. In the proposed method, our model first infers the identities of variable numbers of speakers from the observation based on a sequence-to-sequence model. Then, it takes the information from the inferred speakers as conditions to extract their speech sources. With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings. The experiments from standard fully-overlapped speech separation benchmarks show comparable results with prior studies, while our proposed model gets better adaptability for multi-round long recordings.
AB - Speech separation has been extensively explored to tackle the cocktail party problem. However, these studies are still far from having enough generalization capabilities for real scenarios. In this work, we raise a common strategy named Speaker-Conditional Chain Model to process complex speech recordings. In the proposed method, our model first infers the identities of variable numbers of speakers from the observation based on a sequence-to-sequence model. Then, it takes the information from the inferred speakers as conditions to extract their speech sources. With the predicted speaker information from whole observation, our model is helpful to solve the problem of conventional speech separation and speaker extraction for multi-round long recordings. The experiments from standard fully-overlapped speech separation benchmarks show comparable results with prior studies, while our proposed model gets better adaptability for multi-round long recordings.
KW - Cocktail party problem
KW - Speaker extraction
KW - Speech separation
UR - http://www.scopus.com/inward/record.url?scp=85098154003&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098154003&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-2418
DO - 10.21437/Interspeech.2020-2418
M3 - Conference article
AN - SCOPUS:85098154003
VL - 2020-October
SP - 2707
EP - 2711
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -