Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

Zhuo Chen, Shinji Watanabe, Hakan Erdogan, John R. Hershey

Research output: Contribution to journalArticle

39 Citations (Scopus)

Abstract

Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.

Original languageEnglish
Pages (from-to)3274-3278
Number of pages5
JournalUnknown Journal
Volume2015-January
Publication statusPublished - 2015
Externally publishedYes

Fingerprint

Multi-task Learning
Speech Enhancement
Speech enhancement
Automatic Speech Recognition
Memory Term
Recurrent neural networks
Recurrent Neural Networks
Speech Recognition
Speech recognition
speech recognition
learning
augmentation
Robust Speech Recognition
Acoustic noise
Network Architecture
Error Rate
Long short-term memory
Enhancement
Short-term Memory
Network architecture

Keywords

  • Integration
  • LSTM
  • Noisy speech recognition
  • Sequence training
  • Speech enhancement

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks. / Chen, Zhuo; Watanabe, Shinji; Erdogan, Hakan; Hershey, John R.

In: Unknown Journal, Vol. 2015-January, 2015, p. 3274-3278.

Research output: Contribution to journalArticle

@article{f3715c7c6afd4a30b862c62c1a6d0236,
title = "Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks",
abstract = "Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.",
keywords = "Integration, LSTM, Noisy speech recognition, Sequence training, Speech enhancement",
author = "Zhuo Chen and Shinji Watanabe and Hakan Erdogan and Hershey, {John R.}",
year = "2015",
language = "English",
volume = "2015-January",
pages = "3274--3278",
journal = "Nuclear Physics A",
issn = "0375-9474",
publisher = "Elsevier",

}

TY - JOUR

T1 - Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks

AU - Chen, Zhuo

AU - Watanabe, Shinji

AU - Erdogan, Hakan

AU - Hershey, John R.

PY - 2015

Y1 - 2015

N2 - Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.

AB - Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.

KW - Integration

KW - LSTM

KW - Noisy speech recognition

KW - Sequence training

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=84959084802&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959084802&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84959084802

VL - 2015-January

SP - 3274

EP - 3278

JO - Nuclear Physics A

JF - Nuclear Physics A

SN - 0375-9474

ER -