TY - GEN
T1 - Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances
AU - Tawara, Naohiro
AU - Ogawa, Atsunori
AU - Iwata, Tomoharu
AU - Delcroix, Marc
AU - Ogawa, Tetsuji
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - This paper investigates a phoneme-invariant speaker embedding approach for speaker recognition on extremely short utterances. Intuitively, phonemes are nuisance information for text-independent speaker recognition task since the contents of the speech are usually mismatched between enrolling and testing time. However, many studies have shown that incorporating phoneme information is quite effective to improve the performance of the speaker recognition system. One reasonable explanation for this counter-intuitive result is that the pooling mechanism of segment-based speaker embedding can focus on the specific phonemes which contain rich speaker information, and phoneme information may help this. From this insight, we hypothesize that the pooling mechanism and phoneme-aware training are harmful to extract the speaker embeddings from extremely short utterances. To verify this hypothesis, an adversarial framework is introduced to remove phoneme-variability from the frame-wise speaker embeddings. The experimental results on the Librispeech corpus confirm that our frame-wise, phoneme-adversarial approach outperforms the conventional segment-wise, phoneme-aware approach for short utterances of less than about 1.4 seconds.
AB - This paper investigates a phoneme-invariant speaker embedding approach for speaker recognition on extremely short utterances. Intuitively, phonemes are nuisance information for text-independent speaker recognition task since the contents of the speech are usually mismatched between enrolling and testing time. However, many studies have shown that incorporating phoneme information is quite effective to improve the performance of the speaker recognition system. One reasonable explanation for this counter-intuitive result is that the pooling mechanism of segment-based speaker embedding can focus on the specific phonemes which contain rich speaker information, and phoneme information may help this. From this insight, we hypothesize that the pooling mechanism and phoneme-aware training are harmful to extract the speaker embeddings from extremely short utterances. To verify this hypothesis, an adversarial framework is introduced to remove phoneme-variability from the frame-wise speaker embeddings. The experimental results on the Librispeech corpus confirm that our frame-wise, phoneme-adversarial approach outperforms the conventional segment-wise, phoneme-aware approach for short utterances of less than about 1.4 seconds.
KW - adversarial learning
KW - deep neural networks
KW - phoneme-invariant feature
KW - speaker embedding
KW - text-independent speaker recognition
UR - http://www.scopus.com/inward/record.url?scp=85089242859&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089242859&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9053871
DO - 10.1109/ICASSP40776.2020.9053871
M3 - Conference contribution
AN - SCOPUS:85089242859
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6799
EP - 6803
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -