Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

Takaaki Hori, Zhuo Chen, Hakan Erdogan, John R. Hershey, Jonathan Le Roux, Vikramjit Mitra, Shinji Watanabe

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.

Original languageEnglish
JournalComputer Speech and Language
DOIs
Publication statusAccepted/In press - 2016 Apr 11
Externally publishedYes

Fingerprint

Recurrent neural networks
Recurrent Neural Networks
Beamforming
Microphones
Speech Recognition
Speech recognition
Feature Extraction
Error Rate
Feature extraction
Neural Networks
Speech Enhancement
Speech enhancement
Language Modeling
Acoustic Model
Memory Term
Language Model
Noisy Data
Network Model
Decoding
Baseline

Keywords

  • Beamforming
  • CHiME-3
  • Noise robust feature
  • Robust speech recognition
  • System combination,

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Human-Computer Interaction

Cite this

Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend. / Hori, Takaaki; Chen, Zhuo; Erdogan, Hakan; Hershey, John R.; Le Roux, Jonathan; Mitra, Vikramjit; Watanabe, Shinji.

In: Computer Speech and Language, 11.04.2016.

Research output: Contribution to journalArticle

Hori, Takaaki ; Chen, Zhuo ; Erdogan, Hakan ; Hershey, John R. ; Le Roux, Jonathan ; Mitra, Vikramjit ; Watanabe, Shinji. / Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend. In: Computer Speech and Language. 2016.
@article{ee4dbcb5c0b242fbb179993a91b06cd9,
title = "Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend",
abstract = "This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05{\%} WER for the real-test data, an 84.7{\%} reduction relative to the baseline of 32.99{\%} WER and a 44.5{\%} reduction from our official CHiME-3 challenge result of 9.1{\%} WER. Furthermore, this final result is better than the best result (5.8{\%} WER) reported in the CHiME-3 challenge.",
keywords = "Beamforming, CHiME-3, Noise robust feature, Robust speech recognition, System combination,",
author = "Takaaki Hori and Zhuo Chen and Hakan Erdogan and Hershey, {John R.} and {Le Roux}, Jonathan and Vikramjit Mitra and Shinji Watanabe",
year = "2016",
month = "4",
day = "11",
doi = "10.1016/j.csl.2017.01.013",
language = "English",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend

AU - Hori, Takaaki

AU - Chen, Zhuo

AU - Erdogan, Hakan

AU - Hershey, John R.

AU - Le Roux, Jonathan

AU - Mitra, Vikramjit

AU - Watanabe, Shinji

PY - 2016/4/11

Y1 - 2016/4/11

N2 - This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.

AB - This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.

KW - Beamforming

KW - CHiME-3

KW - Noise robust feature

KW - Robust speech recognition

KW - System combination,

UR - http://www.scopus.com/inward/record.url?scp=85015298222&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015298222&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2017.01.013

DO - 10.1016/j.csl.2017.01.013

M3 - Article

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

ER -