Speech enhancement using end-to-end speech recognition objectives

Aswin Shanmugam Subramanian, Xiaofei Wang, Murali Karthick Baskar, Shinji Watanabe, Toru Taniguchi, Dung Tran, Yuya Fujita

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Speech enhancement systems, which denoise and dereverberate distorted signals, are usually optimized based on signal reconstruction objectives including the maximum likelihood and minimum mean square error. However, emergent end-to-end neural methods enable to optimize the speech enhancement system with more application-oriented objectives. For example, we can jointly optimize speech enhancement and automatic speech recognition (ASR) only with ASR error minimization criteria. The major contribution of this paper is to investigate how a system optimized based on the ASR objective improves the speech enhancement quality on various signal level metrics in addition to the ASR word error rate (WER) metric. We use a recently developed multichannel end-to-end (ME2E) system, which integrates neural dereverberation, beamforming, and attention-based speech recognition within a single neural network. Additionally, we propose to extend the dereverberation sub network of ME2E by dynamically varying the filter order in linear prediction by using reinforcement learning, and extend the beamforming subnetwork by incorporating the estimation of a speech distortion factor. The experiments reveal how well different signal level metrics correlate with the WER metric, and verify that learning-based speech enhancement can be realized by end-to-end ASR training objectives without using parallel clean and noisy data.

Original languageEnglish
Title of host publication2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages234-238
Number of pages5
ISBN (Electronic)9781728111230
DOIs
Publication statusPublished - 2019 Oct
Externally publishedYes
Event2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019 - New Paltz, United States
Duration: 2019 Oct 202019 Oct 23

Publication series

NameIEEE Workshop on Applications of Signal Processing to Audio and Acoustics
Volume2019-October
ISSN (Print)1931-1168
ISSN (Electronic)1947-1629

Conference

Conference2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019
CountryUnited States
CityNew Paltz
Period19/10/2019/10/23

Keywords

  • neural beamformer
  • neural dereverberation
  • speech enhancement
  • speech recognition
  • training objectives

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Computer Science Applications

Fingerprint Dive into the research topics of 'Speech enhancement using end-to-end speech recognition objectives'. Together they form a unique fingerprint.

  • Cite this

    Subramanian, A. S., Wang, X., Baskar, M. K., Watanabe, S., Taniguchi, T., Tran, D., & Fujita, Y. (2019). Speech enhancement using end-to-end speech recognition objectives. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2019 (pp. 234-238). [8937250] (IEEE Workshop on Applications of Signal Processing to Audio and Acoustics; Vol. 2019-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/WASPAA.2019.8937250