TY - GEN
T1 - Deep Speech Extraction with Time-Varying Spatial Filtering Guided by Desired Direction Attractor
AU - Nakagome, Yu
AU - Togami, Masahito
AU - Ogawa, Tetsuji
AU - Kobayashi, Tetsunori
N1 - Funding Information:
The research was supported by NII CRIS collaborative research program operated by NII CRIS and LINE Corporation.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - In this investigation, a deep neural network (DNN) based speech extraction method is proposed to enhance a speech signal propagating from the desired direction. The proposed method integrates knowledge based on a sound propagation model and the time-varying characteristics of a speech source, into a DNN-based separation framework. This approach outputs a separated speech source using time-varying spatial filtering, which achieves superior speech extraction performance compared with time-invariant spatial filtering. Given that the gradient of all modules can be calculated, back-propagation can be performed to maximize the speech quality of the output signal in an end-to-end manner. Guided information is also modeled based on the sound propagation model, which facilitates disentangled representations of the target speech source and noise signals. The experimental results demonstrate that the proposed method can extract the target speech source more accurately than conventional DNN-based speech source separation and conventional speech extraction using time-invariant spatial filtering.
AB - In this investigation, a deep neural network (DNN) based speech extraction method is proposed to enhance a speech signal propagating from the desired direction. The proposed method integrates knowledge based on a sound propagation model and the time-varying characteristics of a speech source, into a DNN-based separation framework. This approach outputs a separated speech source using time-varying spatial filtering, which achieves superior speech extraction performance compared with time-invariant spatial filtering. Given that the gradient of all modules can be calculated, back-propagation can be performed to maximize the speech quality of the output signal in an end-to-end manner. Guided information is also modeled based on the sound propagation model, which facilitates disentangled representations of the target speech source and noise signals. The experimental results demonstrate that the proposed method can extract the target speech source more accurately than conventional DNN-based speech source separation and conventional speech extraction using time-invariant spatial filtering.
KW - attractor
KW - direction-of-arrival information
KW - end-to-end speech source separation
KW - time-varying spatial filtering
UR - http://www.scopus.com/inward/record.url?scp=85089224929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089224929&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9053629
DO - 10.1109/ICASSP40776.2020.9053629
M3 - Conference contribution
AN - SCOPUS:85089224929
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 671
EP - 675
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -