Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

Original languageEnglish
Title of host publication2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages31-38
Number of pages8
ISBN (Electronic)9781728103068
DOIs
Publication statusPublished - 2019 Dec
Externally publishedYes
Event2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore
Duration: 2019 Dec 152019 Dec 18

Publication series

Name2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
CountrySingapore
CitySingapore
Period19/12/1519/12/18

Keywords

  • deep learning
  • multi-Talker speech recognition
  • speaker diarization

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Signal Processing
  • Linguistics and Language
  • Communication

Fingerprint Dive into the research topics of 'Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models'. Together they form a unique fingerprint.

  • Cite this

    Kanda, N., Horiguchi, S., Fujita, Y., Xue, Y., Nagamatsu, K., & Watanabe, S. (2019). Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings (pp. 31-38). [9004009] (2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU46091.2019.9004009