Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

Szu Jui Chen, Aswin Shanmugam Subramanian, Hainan Xu, Shinji Watanabe

Research output: Contribution to journalConference article

3 Citations (Scopus)

Abstract

This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74% WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.

Original languageEnglish
Pages (from-to)1571-1575
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Speech Enhancement
Speech enhancement
Speech Recognition
Speech recognition
Speech intelligibility
Baseline
Beamforming
Speech processing
Automatic Speech Recognition
Test Set
Microphones
Masks
Time delay
Generalized Eigenvalue
Neural networks
Speech Processing
Memory Term
Memory Model
Language Model
Mutual Information

Keywords

  • Lattice-free MMI
  • LSTM language modeling
  • Mask-based beamforming
  • Noise robustness
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. / Chen, Szu Jui; Subramanian, Aswin Shanmugam; Xu, Hainan; Watanabe, Shinji.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 1571-1575.

Research output: Contribution to journalConference article

@article{4ac07c46b8a94671b0624f0a9608adcf,
title = "Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline",
abstract = "This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74{\%} WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.",
keywords = "Lattice-free MMI, LSTM language modeling, Mask-based beamforming, Noise robustness, Speech recognition",
author = "Chen, {Szu Jui} and Subramanian, {Aswin Shanmugam} and Hainan Xu and Shinji Watanabe",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1262",
language = "English",
volume = "2018-September",
pages = "1571--1575",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline

AU - Chen, Szu Jui

AU - Subramanian, Aswin Shanmugam

AU - Xu, Hainan

AU - Watanabe, Shinji

PY - 2018/1/1

Y1 - 2018/1/1

N2 - This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74% WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.

AB - This paper describes a new baseline system for automatic speech recognition (ASR) in the CHiME-4 challenge to promote the development of noisy ASR in speech processing communities by providing 1) state-of-the-art system with a simplified single system comparable to the complicated top systems in the challenge, 2) publicly available and reproducible recipe through the main repository in the Kaldi speech recognition toolkit. The proposed system adopts generalized eigenvalue beamforming with bidirectional long short-term memory (LSTM) mask estimation. We also propose to use a time delay neural network (TDNN) based on the lattice-free version of the maximum mutual information (LF-MMI) trained with augmented all six microphones plus the enhanced data after beamforming. Finally, we use a LSTM language model for lattice and n-best re-scoring. The final system achieved 2.74% WER for the real test set in the 6-channel track, which corresponds to the 2nd place in the challenge. In addition, the proposed baseline recipe includes four different speech enhancement measures, short-time objective intelligibility measure (STOI), extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and speech distortion ratio (SDR) for the simulation test set. Thus, the recipe also provides an experimental platform for speech enhancement studies with these performance measures.

KW - Lattice-free MMI

KW - LSTM language modeling

KW - Mask-based beamforming

KW - Noise robustness

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054975722&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054975722&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1262

DO - 10.21437/Interspeech.2018-1262

M3 - Conference article

AN - SCOPUS:85054975722

VL - 2018-September

SP - 1571

EP - 1575

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -