Prior-shared feature and model space speaker adaptation by consistently employing map estimation

Seong Jun Hahm, Shinji Watanabe, Atsunori Ogawa, Masakiyo Fujimoto, Takaaki Hori, Atsushi Nakamura

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT OpenCourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training.

Original languageEnglish
Pages (from-to)415-431
Number of pages17
JournalSpeech Communication
Volume55
Issue number3
DOIs
Publication statusPublished - 2013
Externally publishedYes

Fingerprint

MAP Estimation
Speaker Adaptation
Maximum a Posteriori
Prior distribution
Linear regression
Speech Recognition
Feature Space
Parameter estimation
Parameter Estimation
regression
Continuous speech recognition
Acoustic Model
Transformation Matrix
Dependent
Bayesian Estimation
Speech recognition
Model
Normalization
normalization
Sharing

Keywords

  • Feature space normalization
  • Model space adaptation
  • Prior distribution sharing
  • Speaker adaptation
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Communication
  • Software
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Modelling and Simulation

Cite this

Prior-shared feature and model space speaker adaptation by consistently employing map estimation. / Hahm, Seong Jun; Watanabe, Shinji; Ogawa, Atsunori; Fujimoto, Masakiyo; Hori, Takaaki; Nakamura, Atsushi.

In: Speech Communication, Vol. 55, No. 3, 2013, p. 415-431.

Research output: Contribution to journalArticle

Hahm, Seong Jun ; Watanabe, Shinji ; Ogawa, Atsunori ; Fujimoto, Masakiyo ; Hori, Takaaki ; Nakamura, Atsushi. / Prior-shared feature and model space speaker adaptation by consistently employing map estimation. In: Speech Communication. 2013 ; Vol. 55, No. 3. pp. 415-431.
@article{6c15ba479bb24fa8bc2fee1480cefb7b,
title = "Prior-shared feature and model space speaker adaptation by consistently employing map estimation",
abstract = "The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT OpenCourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training.",
keywords = "Feature space normalization, Model space adaptation, Prior distribution sharing, Speaker adaptation, Speech recognition",
author = "Hahm, {Seong Jun} and Shinji Watanabe and Atsunori Ogawa and Masakiyo Fujimoto and Takaaki Hori and Atsushi Nakamura",
year = "2013",
doi = "10.1016/j.specom.2012.12.002",
language = "English",
volume = "55",
pages = "415--431",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",
number = "3",

}

TY - JOUR

T1 - Prior-shared feature and model space speaker adaptation by consistently employing map estimation

AU - Hahm, Seong Jun

AU - Watanabe, Shinji

AU - Ogawa, Atsunori

AU - Fujimoto, Masakiyo

AU - Hori, Takaaki

AU - Nakamura, Atsushi

PY - 2013

Y1 - 2013

N2 - The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT OpenCourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training.

AB - The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT OpenCourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training.

KW - Feature space normalization

KW - Model space adaptation

KW - Prior distribution sharing

KW - Speaker adaptation

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=84875448634&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84875448634&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2012.12.002

DO - 10.1016/j.specom.2012.12.002

M3 - Article

VL - 55

SP - 415

EP - 431

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

IS - 3

ER -