Online End-To-End Neural Diarization with Speaker-Tracing Buffer

Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54 % for CALLHOME and 20.77 % for CSJ with 1.4 s actual latency.

Original languageEnglish
Title of host publication2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages841-848
Number of pages8
ISBN (Electronic)9781728170664
DOIs
Publication statusPublished - 2021 Jan 19
Event2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Virtual, Shenzhen, China
Duration: 2021 Jan 192021 Jan 22

Publication series

Name2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings

Conference

Conference2021 IEEE Spoken Language Technology Workshop, SLT 2021
CountryChina
CityVirtual, Shenzhen
Period21/1/1921/1/22

Keywords

  • end-to-end
  • Online speaker diarization
  • self-attention
  • speaker-tracing buffer

ASJC Scopus subject areas

  • Linguistics and Language
  • Language and Linguistics
  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Online End-To-End Neural Diarization with Speaker-Tracing Buffer'. Together they form a unique fingerprint.

Cite this