Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

Takaaki Hori*, Zhuo Chen, Hakan Erdogan, John R. Hershey, Jonathan Le Roux, Vikramjit Mitra, Shinji Watanabe

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

25 Citations (Scopus)


This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.

Original languageEnglish
Pages (from-to)401-418
Number of pages18
JournalComputer Speech and Language
Publication statusPublished - 2017 Nov
Externally publishedYes


  • Beamforming
  • CHiME-3
  • Noise robust feature
  • Robust speech recognition
  • System combination

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Human-Computer Interaction


Dive into the research topics of 'Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend'. Together they form a unique fingerprint.

Cite this