An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Emmanuel Vincent, Shinji Watanabe, Aditya Arie Nugraha, Jon Barker, Ricard Marxer

Research output: Contribution to journalArticle

53 Citations (Scopus)

Abstract

Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

Original languageEnglish
JournalComputer Speech and Language
DOIs
Publication statusAccepted/In press - 2016 Apr 25
Externally publishedYes

Fingerprint

Robust Speech Recognition
Microphones
Speech recognition
Automatic Speech Recognition
Speech enhancement
Speech Enhancement
Simulation
Acoustics
Beamforming
Acoustic noise
Microphone Array
Feature extraction
Minimum Variance
Exception
Feature Extraction
Enhancement
Testing
Training
Cover
Experiments

Keywords

  • Microphone array
  • Robust ASR
  • Speech enhancement
  • Train/test mismatch

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Software
  • Human-Computer Interaction

Cite this

An analysis of environment, microphone and data simulation mismatches in robust speech recognition. / Vincent, Emmanuel; Watanabe, Shinji; Nugraha, Aditya Arie; Barker, Jon; Marxer, Ricard.

In: Computer Speech and Language, 25.04.2016.

Research output: Contribution to journalArticle

@article{de011a5f9b804c09bdeb9d27676cae8c,
title = "An analysis of environment, microphone and data simulation mismatches in robust speech recognition",
abstract = "Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.",
keywords = "Microphone array, Robust ASR, Speech enhancement, Train/test mismatch",
author = "Emmanuel Vincent and Shinji Watanabe and Nugraha, {Aditya Arie} and Jon Barker and Ricard Marxer",
year = "2016",
month = "4",
day = "25",
doi = "10.1016/j.csl.2016.11.005",
language = "English",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - An analysis of environment, microphone and data simulation mismatches in robust speech recognition

AU - Vincent, Emmanuel

AU - Watanabe, Shinji

AU - Nugraha, Aditya Arie

AU - Barker, Jon

AU - Marxer, Ricard

PY - 2016/4/25

Y1 - 2016/4/25

N2 - Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

AB - Speech enhancement and automatic speech recognition (ASR) are most often evaluated in matched (or multi-condition) settings where the acoustic conditions of the training data match (or cover) those of the test data. Few studies have systematically assessed the impact of acoustic mismatches between training and test data, especially concerning recent speech enhancement and state-of-the-art ASR techniques. In this article, we study this issue in the context of the CHiME-3 dataset, which consists of sentences spoken by talkers situated in challenging noisy environments recorded using a 6-channel tablet based microphone array. We provide a critical analysis of the results published on this dataset for various signal enhancement, feature extraction, and ASR backend techniques and perform a number of new experiments in order to separately assess the impact of different noise environments, different numbers and positions of microphones, or simulated vs. real data on speech enhancement and ASR performance. We show that, with the exception of minimum variance distortionless response (MVDR) beamforming, most algorithms perform consistently on real and simulated data and can benefit from training on simulated data. We also find that training on different noise environments and different microphones barely affects the ASR performance, especially when several environments are present in the training data: only the number of microphones has a significant impact. Based on these results, we introduce the CHiME-4 Speech Separation and Recognition Challenge, which revisits the CHiME-3 dataset and makes it more challenging by reducing the number of microphones available for testing.

KW - Microphone array

KW - Robust ASR

KW - Speech enhancement

KW - Train/test mismatch

UR - http://www.scopus.com/inward/record.url?scp=85009170959&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85009170959&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2016.11.005

DO - 10.1016/j.csl.2016.11.005

M3 - Article

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

ER -