Deep beamforming networks for multi-channel speech recognition

Xiong Xiao, Shinji Watanabe, Hakan Erdogan, Liang Lu, John Hershey, Michael L. Seltzer, Guoguo Chen, Yu Zhang, Michael Mandel, Dong Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

60 Citations (Scopus)

Abstract

Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.

Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5745-5749
Number of pages5
Volume2016-May
ISBN (Electronic)9781479999880
DOIs
Publication statusPublished - 2016 May 18
Externally publishedYes
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Shanghai, China
Duration: 2016 Mar 202016 Mar 25

Other

Other41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
CountryChina
CityShanghai
Period16/3/2016/3/25

Fingerprint

Beamforming
Speech recognition
Acoustics
Reverberation
Microphones
Processing
Network performance
Backpropagation
Acoustic noise
Feature extraction
Entropy
Pipelines
Experiments

Keywords

  • deep neural networks
  • direction of arrival
  • filter- and-sum beamforming
  • microphone arrays
  • speech recognition

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Electrical and Electronic Engineering

Cite this

Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M. L., ... Yu, D. (2016). Deep beamforming networks for multi-channel speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings (Vol. 2016-May, pp. 5745-5749). [7472778] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2016.7472778

Deep beamforming networks for multi-channel speech recognition. / Xiao, Xiong; Watanabe, Shinji; Erdogan, Hakan; Lu, Liang; Hershey, John; Seltzer, Michael L.; Chen, Guoguo; Zhang, Yu; Mandel, Michael; Yu, Dong.

2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. p. 5745-5749 7472778.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xiao, X, Watanabe, S, Erdogan, H, Lu, L, Hershey, J, Seltzer, ML, Chen, G, Zhang, Y, Mandel, M & Yu, D 2016, Deep beamforming networks for multi-channel speech recognition. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. vol. 2016-May, 7472778, Institute of Electrical and Electronics Engineers Inc., pp. 5745-5749, 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, 16/3/20. https://doi.org/10.1109/ICASSP.2016.7472778
Xiao X, Watanabe S, Erdogan H, Lu L, Hershey J, Seltzer ML et al. Deep beamforming networks for multi-channel speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May. Institute of Electrical and Electronics Engineers Inc. 2016. p. 5745-5749. 7472778 https://doi.org/10.1109/ICASSP.2016.7472778
Xiao, Xiong ; Watanabe, Shinji ; Erdogan, Hakan ; Lu, Liang ; Hershey, John ; Seltzer, Michael L. ; Chen, Guoguo ; Zhang, Yu ; Mandel, Michael ; Yu, Dong. / Deep beamforming networks for multi-channel speech recognition. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. pp. 5745-5749
@inproceedings{195cc6ec93674cdfa880db183c72b118,
title = "Deep beamforming networks for multi-channel speech recognition",
abstract = "Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2{\%} absolute word error rate reduction compared to a conventional pipeline of independent processing stages.",
keywords = "deep neural networks, direction of arrival, filter- and-sum beamforming, microphone arrays, speech recognition",
author = "Xiong Xiao and Shinji Watanabe and Hakan Erdogan and Liang Lu and John Hershey and Seltzer, {Michael L.} and Guoguo Chen and Yu Zhang and Michael Mandel and Dong Yu",
year = "2016",
month = "5",
day = "18",
doi = "10.1109/ICASSP.2016.7472778",
language = "English",
volume = "2016-May",
pages = "5745--5749",
booktitle = "2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Deep beamforming networks for multi-channel speech recognition

AU - Xiao, Xiong

AU - Watanabe, Shinji

AU - Erdogan, Hakan

AU - Lu, Liang

AU - Hershey, John

AU - Seltzer, Michael L.

AU - Chen, Guoguo

AU - Zhang, Yu

AU - Mandel, Michael

AU - Yu, Dong

PY - 2016/5/18

Y1 - 2016/5/18

N2 - Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.

AB - Despite the significant progress in speech recognition enabled by deep neural networks, poor performance persists in some scenarios. In this work, we focus on far-field speech recognition which remains challenging due to high levels of noise and reverberation in the captured speech signals. We propose to represent the stages of acoustic processing including beamforming, feature extraction, and acoustic modeling, as three components of a single unified computational network. The parameters of a frequency-domain beam-former are first estimated by a network based on features derived from the microphone channels. These filter coefficients are then applied to the array signals to form an enhanced signal. Conventional features are then extracted from this signal and passed to a second network that performs acoustic modeling for classification. The parameters of both the beamforming and acoustic modeling networks are trained jointly using back-propagation with a common cross-entropy objective function. In experiments on the AMI meeting corpus, we observed improvements by pre-training each sub-network with a network-specific objective function before joint training of both networks. The proposed method obtained a 3.2% absolute word error rate reduction compared to a conventional pipeline of independent processing stages.

KW - deep neural networks

KW - direction of arrival

KW - filter- and-sum beamforming

KW - microphone arrays

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=84973295082&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973295082&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2016.7472778

DO - 10.1109/ICASSP.2016.7472778

M3 - Conference contribution

VL - 2016-May

SP - 5745

EP - 5749

BT - 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -