Model-based speaker normalization methods for speech recognition

Masaki Naito, Li Deng, Yoshinori Sagisaka

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

A speaker normalization method using a speech generation model is proposed in order to achieve high-performance speaker adaptation with a small amount of adaptation data. The speaker-and phoneme-dependent vocal tract area function is approximated by the corresponding area function produced by the articulatory model of a standard speaker, combined with phoneme-independent feature quantities of the vocal-tract shape of the normalized target speaker as estimated from the formant frequencies of two vowels. The frequency warping functions are determined from the formant frequencies of speech calculated from the vocal-tract area functions thus obtained, and normalization of the uttered speech is performed by stretching the speech spectrum in the frequency-axis direction. Continuous phoneme recognition experiments using phoneme connection rules show that the recognition error using a gender-dependent model is reduced by about 30% in the proposed method and that recognition performance superior to that of vocal-tract length normalization is obtained. The recognition performance of the proposed method is also equivalent to that of speaker adaptation by moving vector field smoothing (VFS) using 10 phonetically balanced sentences, showing that high-performance speaker adaptation using a small amount of adaptation data can be achieved by the proposed method.

Original languageEnglish
Pages (from-to)45-56
Number of pages12
JournalElectronics and Communications in Japan, Part II: Electronics (English translation of Denshi Tsushin Gakkai Ronbunshi)
Volume86
Issue number2
DOIs
Publication statusPublished - 2003 Feb
Externally publishedYes

Fingerprint

speech recognition
phonemes
Speech recognition
sentences
Probability density function
vowels
Stretching
smoothing
Experiments

Keywords

  • Articulatory model
  • Frequency warping
  • Speaker normalization
  • Vocal tract shape
  • Vocal-tract area functions

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

@article{d48c3a8becde45b19b333388fe90a89c,
title = "Model-based speaker normalization methods for speech recognition",
abstract = "A speaker normalization method using a speech generation model is proposed in order to achieve high-performance speaker adaptation with a small amount of adaptation data. The speaker-and phoneme-dependent vocal tract area function is approximated by the corresponding area function produced by the articulatory model of a standard speaker, combined with phoneme-independent feature quantities of the vocal-tract shape of the normalized target speaker as estimated from the formant frequencies of two vowels. The frequency warping functions are determined from the formant frequencies of speech calculated from the vocal-tract area functions thus obtained, and normalization of the uttered speech is performed by stretching the speech spectrum in the frequency-axis direction. Continuous phoneme recognition experiments using phoneme connection rules show that the recognition error using a gender-dependent model is reduced by about 30{\%} in the proposed method and that recognition performance superior to that of vocal-tract length normalization is obtained. The recognition performance of the proposed method is also equivalent to that of speaker adaptation by moving vector field smoothing (VFS) using 10 phonetically balanced sentences, showing that high-performance speaker adaptation using a small amount of adaptation data can be achieved by the proposed method.",
keywords = "Articulatory model, Frequency warping, Speaker normalization, Vocal tract shape, Vocal-tract area functions",
author = "Masaki Naito and Li Deng and Yoshinori Sagisaka",
year = "2003",
month = "2",
doi = "10.1002/ecjb.10119",
language = "English",
volume = "86",
pages = "45--56",
journal = "Electronics and Communications in Japan, Part II: Electronics (English translation of Denshi Tsushin Gakkai Ronbunshi)",
issn = "8756-663X",
publisher = "Scripta Technica",
number = "2",

}

TY - JOUR

T1 - Model-based speaker normalization methods for speech recognition

AU - Naito, Masaki

AU - Deng, Li

AU - Sagisaka, Yoshinori

PY - 2003/2

Y1 - 2003/2

N2 - A speaker normalization method using a speech generation model is proposed in order to achieve high-performance speaker adaptation with a small amount of adaptation data. The speaker-and phoneme-dependent vocal tract area function is approximated by the corresponding area function produced by the articulatory model of a standard speaker, combined with phoneme-independent feature quantities of the vocal-tract shape of the normalized target speaker as estimated from the formant frequencies of two vowels. The frequency warping functions are determined from the formant frequencies of speech calculated from the vocal-tract area functions thus obtained, and normalization of the uttered speech is performed by stretching the speech spectrum in the frequency-axis direction. Continuous phoneme recognition experiments using phoneme connection rules show that the recognition error using a gender-dependent model is reduced by about 30% in the proposed method and that recognition performance superior to that of vocal-tract length normalization is obtained. The recognition performance of the proposed method is also equivalent to that of speaker adaptation by moving vector field smoothing (VFS) using 10 phonetically balanced sentences, showing that high-performance speaker adaptation using a small amount of adaptation data can be achieved by the proposed method.

AB - A speaker normalization method using a speech generation model is proposed in order to achieve high-performance speaker adaptation with a small amount of adaptation data. The speaker-and phoneme-dependent vocal tract area function is approximated by the corresponding area function produced by the articulatory model of a standard speaker, combined with phoneme-independent feature quantities of the vocal-tract shape of the normalized target speaker as estimated from the formant frequencies of two vowels. The frequency warping functions are determined from the formant frequencies of speech calculated from the vocal-tract area functions thus obtained, and normalization of the uttered speech is performed by stretching the speech spectrum in the frequency-axis direction. Continuous phoneme recognition experiments using phoneme connection rules show that the recognition error using a gender-dependent model is reduced by about 30% in the proposed method and that recognition performance superior to that of vocal-tract length normalization is obtained. The recognition performance of the proposed method is also equivalent to that of speaker adaptation by moving vector field smoothing (VFS) using 10 phonetically balanced sentences, showing that high-performance speaker adaptation using a small amount of adaptation data can be achieved by the proposed method.

KW - Articulatory model

KW - Frequency warping

KW - Speaker normalization

KW - Vocal tract shape

KW - Vocal-tract area functions

UR - http://www.scopus.com/inward/record.url?scp=0037318542&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037318542&partnerID=8YFLogxK

U2 - 10.1002/ecjb.10119

DO - 10.1002/ecjb.10119

M3 - Article

AN - SCOPUS:0037318542

VL - 86

SP - 45

EP - 56

JO - Electronics and Communications in Japan, Part II: Electronics (English translation of Denshi Tsushin Gakkai Ronbunshi)

JF - Electronics and Communications in Japan, Part II: Electronics (English translation of Denshi Tsushin Gakkai Ronbunshi)

SN - 8756-663X

IS - 2

ER -