Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives

Aswin Shanmugam Subramanian, Chao Weng, Meng Yu, Shi Xiong Zhang, Yong Xu, Shinji Watanabe, Dong Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Target speech extraction is a specific case of source separation where an auxiliary information like the location or some pre-saved anchor speech examples of the target speaker is used to resolve the permutation ambiguity. Traditionally such systems are optimized based on signal reconstruction objectives. Recently end-to-end automatic speech recognition (ASR) methods have enabled to optimize source separation systems with only the transcription based objective. This paper proposes a method to jointly optimize a location guided target speech extraction module along with a speech recognition module only with ASR error minimization criteria. Experimental comparisons with corresponding conventional pipeline systems verify that this task can be realized by end-to-end ASR training objectives without using parallel clean data. We show promising target speech recognition results in mixtures of two speakers and noise, and discuss interesting properties of the proposed system in terms of speech enhancement/separation objectives and word error rates. Finally, we design a system that can take both location and anchor speech as input at the same time and show that the performance can be further improved.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7299-7303
Number of pages5
ISBN (Electronic)9781509066315
DOIs
Publication statusPublished - 2020 May
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: 2020 May 42020 May 8

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
CountrySpain
CityBarcelona
Period20/5/420/5/8

Keywords

  • end-to-end speech recognition
  • neural beamformer
  • target speech extraction

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives'. Together they form a unique fingerprint.

  • Cite this

    Subramanian, A. S., Weng, C., Yu, M., Zhang, S. X., Xu, Y., Watanabe, S., & Yu, D. (2020). Far-Field Location Guided Target Speech Extraction Using End-to-End Speech Recognition Objectives. In 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings (pp. 7299-7303). [9053692] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2020-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP40776.2020.9053692