Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances

Naohiro Tawara, Atsunori Ogawa, Tomoharu Iwata, Marc Delcroix, Tetsuji Ogawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper investigates a phoneme-invariant speaker embedding approach for speaker recognition on extremely short utterances. Intuitively, phonemes are nuisance information for text-independent speaker recognition task since the contents of the speech are usually mismatched between enrolling and testing time. However, many studies have shown that incorporating phoneme information is quite effective to improve the performance of the speaker recognition system. One reasonable explanation for this counter-intuitive result is that the pooling mechanism of segment-based speaker embedding can focus on the specific phonemes which contain rich speaker information, and phoneme information may help this. From this insight, we hypothesize that the pooling mechanism and phoneme-aware training are harmful to extract the speaker embeddings from extremely short utterances. To verify this hypothesis, an adversarial framework is introduced to remove phoneme-variability from the frame-wise speaker embeddings. The experimental results on the Librispeech corpus confirm that our frame-wise, phoneme-adversarial approach outperforms the conventional segment-wise, phoneme-aware approach for short utterances of less than about 1.4 seconds.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6799-6803
Number of pages5
ISBN (Electronic)9781509066315
DOIs
Publication statusPublished - 2020 May
Event2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Barcelona, Spain
Duration: 2020 May 42020 May 8

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2020-May
ISSN (Print)1520-6149

Conference

Conference2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
CountrySpain
CityBarcelona
Period20/5/420/5/8

Keywords

  • adversarial learning
  • deep neural networks
  • phoneme-invariant feature
  • speaker embedding
  • text-independent speaker recognition

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances'. Together they form a unique fingerprint.

Cite this