Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR

Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, Björn Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

142 Citations (Scopus)

Abstract

We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘naïvely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76% average word error rate, which is, to our knowledge, the best score to date.

Original languageEnglish
Title of host publicationLatent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings
PublisherSpringer Verlag
Pages91-99
Number of pages9
Volume9237
ISBN (Print)9783319224817
DOIs
Publication statusPublished - 2015
Externally publishedYes
Event12th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2015 - Liberec, Czech Republic
Duration: 2015 Aug 252015 Aug 28

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9237
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other12th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2015
CountryCzech Republic
CityLiberec
Period15/8/2515/8/28

Fingerprint

Speech Enhancement
Speech enhancement
Automatic Speech Recognition
Memory Term
Recurrent neural networks
Recurrent Neural Networks
Speech recognition
Acoustic noise
Speech Recognition
Error Rate
Fusion
Evaluate
Fusion reactions
Demonstrate
Processing
Framework
Long short-term memory
Knowledge
Speech

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015). Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings (Vol. 9237, pp. 91-99). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9237). Springer Verlag. https://doi.org/10.1007/978-3-319-22482-4_11

Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. / Weninger, Felix; Erdogan, Hakan; Watanabe, Shinji; Vincent, Emmanuel; Le Roux, Jonathan; Hershey, John R.; Schuller, Björn.

Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings. Vol. 9237 Springer Verlag, 2015. p. 91-99 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9237).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Weninger, F, Erdogan, H, Watanabe, S, Vincent, E, Le Roux, J, Hershey, JR & Schuller, B 2015, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. in Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings. vol. 9237, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9237, Springer Verlag, pp. 91-99, 12th International Conference on Latent Variable Analysis and Signal Separation, LVA/ICA 2015, Liberec, Czech Republic, 15/8/25. https://doi.org/10.1007/978-3-319-22482-4_11
Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey JR et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings. Vol. 9237. Springer Verlag. 2015. p. 91-99. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-22482-4_11
Weninger, Felix ; Erdogan, Hakan ; Watanabe, Shinji ; Vincent, Emmanuel ; Le Roux, Jonathan ; Hershey, John R. ; Schuller, Björn. / Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings. Vol. 9237 Springer Verlag, 2015. pp. 91-99 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{7c4685bb91804a2a93e86f28c8a25199,
title = "Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR",
abstract = "We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘na{\"i}vely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76{\%} average word error rate, which is, to our knowledge, the best score to date.",
author = "Felix Weninger and Hakan Erdogan and Shinji Watanabe and Emmanuel Vincent and {Le Roux}, Jonathan and Hershey, {John R.} and Bj{\"o}rn Schuller",
year = "2015",
doi = "10.1007/978-3-319-22482-4_11",
language = "English",
isbn = "9783319224817",
volume = "9237",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "91--99",
booktitle = "Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings",
address = "Germany",

}

TY - GEN

T1 - Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR

AU - Weninger, Felix

AU - Erdogan, Hakan

AU - Watanabe, Shinji

AU - Vincent, Emmanuel

AU - Le Roux, Jonathan

AU - Hershey, John R.

AU - Schuller, Björn

PY - 2015

Y1 - 2015

N2 - We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘naïvely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76% average word error rate, which is, to our knowledge, the best score to date.

AB - We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘naïvely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76% average word error rate, which is, to our knowledge, the best score to date.

UR - http://www.scopus.com/inward/record.url?scp=84944675581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944675581&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-22482-4_11

DO - 10.1007/978-3-319-22482-4_11

M3 - Conference contribution

SN - 9783319224817

VL - 9237

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 91

EP - 99

BT - Latent Variable Analysis and Signal Separation - 12th International Conference, LVA/ICA 2015, Proceedings

PB - Springer Verlag

ER -