Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition

Zhong Meng, Shinji Watanabe, John R. Hershey, Hakan Erdogan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Citations (Scopus)

Abstract

Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages271-275
Number of pages5
ISBN (Electronic)9781509041176
DOIs
Publication statusPublished - 2017 Jun 16
Externally publishedYes
Event2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - New Orleans, United States
Duration: 2017 Mar 52017 Mar 9

Other

Other2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017
CountryUnited States
CityNew Orleans
Period17/3/517/3/9

Fingerprint

Beamforming
Speech recognition
Microphones
Acoustics
Memory architecture
Recurrent neural networks
Impulse response
Labels
Long short-term memory

Keywords

  • beamforming
  • LSTM
  • multichannel
  • speech recognition

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Meng, Z., Watanabe, S., Hershey, J. R., & Erdogan, H. (2017). Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings (pp. 271-275). [7952160] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2017.7952160

Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. / Meng, Zhong; Watanabe, Shinji; Hershey, John R.; Erdogan, Hakan.

2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 271-275 7952160.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Meng, Z, Watanabe, S, Hershey, JR & Erdogan, H 2017, Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings., 7952160, Institute of Electrical and Electronics Engineers Inc., pp. 271-275, 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, New Orleans, United States, 17/3/5. https://doi.org/10.1109/ICASSP.2017.7952160
Meng Z, Watanabe S, Hershey JR, Erdogan H. Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 271-275. 7952160 https://doi.org/10.1109/ICASSP.2017.7952160
Meng, Zhong ; Watanabe, Shinji ; Hershey, John R. ; Erdogan, Hakan. / Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 271-275
@inproceedings{322977e93ccc45879f931db2573ecca6,
title = "Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition",
abstract = "Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97{\%} absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.",
keywords = "beamforming, LSTM, multichannel, speech recognition",
author = "Zhong Meng and Shinji Watanabe and Hershey, {John R.} and Hakan Erdogan",
year = "2017",
month = "6",
day = "16",
doi = "10.1109/ICASSP.2017.7952160",
language = "English",
pages = "271--275",
booktitle = "2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition

AU - Meng, Zhong

AU - Watanabe, Shinji

AU - Hershey, John R.

AU - Erdogan, Hakan

PY - 2017/6/16

Y1 - 2017/6/16

N2 - Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

AB - Far-field speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning breakthroughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beamforming over them. In this paper, we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adaptively estimate real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of timevarying room impulse responses. The LSTM adaptive beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further, we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming filter coefficients. The proposed system achieves 7.97% absolute gain over baseline systems with no beamforming on CHiME-3 real evaluation set.

KW - beamforming

KW - LSTM

KW - multichannel

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85023779532&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85023779532&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2017.7952160

DO - 10.1109/ICASSP.2017.7952160

M3 - Conference contribution

SP - 271

EP - 275

BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -