An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer

Jiatong Shi, Chunlei Zhang*, Chao Weng, Shinji Watanabe, Meng Yu, Dong Yu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Target-speaker speech recognition aims to recognize the speech of an enrolled speaker from an environment with background noise and interfering speakers. This study presents a joint framework that combines time-domain target speaker extraction and recurrent neural network transducer (RNN-T) for speech recognition. To alleviate the adverse effects of residual noise and artifacts introduced by the target speaker extraction module to the speech recognition back-end, we explore to training the target speaker extraction and RNN-T jointly. We find a multi-stage training strategy that pre-trains and fine-tunes each module before joint training is crucial in stabilizing the training process. In addition, we propose a novel neural uncertainty estimation that leverages useful information from the target speaker extraction module to further improve the back-end speech recognizer (i.e., speaker identity uncertainty and speech enhancement uncertainty). Compared to a recognizer with target speech extraction front-end, our experiments show that joint-training and the neural uncertainty module reduce 7% and 17% relative character error rate (CER) on multi-talker simulation data, respectively. The multi-condition experiments indicate that our method can reduce 9% relative CER in the noisy condition without losing performance in the clean condition. We also observe consistent improvements in further evaluation of real-world data based on vehicular speech.

Original languageEnglish
Article number101327
JournalComputer Speech and Language
Publication statusPublished - 2022 May
Externally publishedYes


  • Target-speaker speech extraction
  • Target-speaker speech recognition
  • Uncertainty estimation

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Human-Computer Interaction


Dive into the research topics of 'An investigation of neural uncertainty estimation for target speaker extraction equipped RNN transducer'. Together they form a unique fingerprint.

Cite this