Variational Bayesian estimation and clustering for speech recognition

Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Research output: Contribution to journalArticle

76 Citations (Scopus)

Abstract

In this paper, we propose variational Bayesian estimation and clustering for speech recognition (VBEC), which is based on the variational Bayesian (VB) approach. VBEC is a total Bayesian framework: all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the maximum likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC: 1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone hidden Markov model states and determining the number of Gaussians and 2) enables robust speech classification, based on Bayesian predictive classification using VB posterior distributions. The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracies due to the VBEC functions.

Original languageEnglish
Pages (from-to)365-381
Number of pages17
JournalIEEE Transactions on Speech and Audio Processing
Volume12
Issue number4
DOIs
Publication statusPublished - 2004 Jul
Externally publishedYes

Fingerprint

speech recognition
Speech recognition
Maximum likelihood
Acoustics
acoustics
education
Continuous speech recognition
Hidden Markov models
Model structures
Experiments

Keywords

  • Acoustic model selection
  • Bayesian predictive classification
  • Speech recognition
  • Total Bayesian framework
  • Variational Bayes

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Variational Bayesian estimation and clustering for speech recognition. / Watanabe, Shinji; Minami, Yasuhiro; Nakamura, Atsushi; Ueda, Naonori.

In: IEEE Transactions on Speech and Audio Processing, Vol. 12, No. 4, 07.2004, p. 365-381.

Research output: Contribution to journalArticle

Watanabe, Shinji ; Minami, Yasuhiro ; Nakamura, Atsushi ; Ueda, Naonori. / Variational Bayesian estimation and clustering for speech recognition. In: IEEE Transactions on Speech and Audio Processing. 2004 ; Vol. 12, No. 4. pp. 365-381.
@article{1af538ec1f7a49d183e940dc6b9f46c4,
title = "Variational Bayesian estimation and clustering for speech recognition",
abstract = "In this paper, we propose variational Bayesian estimation and clustering for speech recognition (VBEC), which is based on the variational Bayesian (VB) approach. VBEC is a total Bayesian framework: all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the maximum likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC: 1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone hidden Markov model states and determining the number of Gaussians and 2) enables robust speech classification, based on Bayesian predictive classification using VB posterior distributions. The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracies due to the VBEC functions.",
keywords = "Acoustic model selection, Bayesian predictive classification, Speech recognition, Total Bayesian framework, Variational Bayes",
author = "Shinji Watanabe and Yasuhiro Minami and Atsushi Nakamura and Naonori Ueda",
year = "2004",
month = "7",
doi = "10.1109/TSA.2004.828640",
language = "English",
volume = "12",
pages = "365--381",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "4",

}

TY - JOUR

T1 - Variational Bayesian estimation and clustering for speech recognition

AU - Watanabe, Shinji

AU - Minami, Yasuhiro

AU - Nakamura, Atsushi

AU - Ueda, Naonori

PY - 2004/7

Y1 - 2004/7

N2 - In this paper, we propose variational Bayesian estimation and clustering for speech recognition (VBEC), which is based on the variational Bayesian (VB) approach. VBEC is a total Bayesian framework: all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the maximum likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC: 1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone hidden Markov model states and determining the number of Gaussians and 2) enables robust speech classification, based on Bayesian predictive classification using VB posterior distributions. The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracies due to the VBEC functions.

AB - In this paper, we propose variational Bayesian estimation and clustering for speech recognition (VBEC), which is based on the variational Bayesian (VB) approach. VBEC is a total Bayesian framework: all speech recognition procedures (acoustic modeling and speech classification) are based on VB posterior distribution, unlike the maximum likelihood (ML) approach based on ML parameters. The total Bayesian framework generates two major Bayesian advantages over the ML approach for the mitigation of over-training effects, as it can select an appropriate model structure without any data set size condition, and can classify categories robustly using a predictive posterior distribution. By using these advantages, VBEC: 1) allows the automatic construction of acoustic models along two separate dimensions, namely, clustering triphone hidden Markov model states and determining the number of Gaussians and 2) enables robust speech classification, based on Bayesian predictive classification using VB posterior distributions. The capabilities of the VBEC functions were confirmed in large vocabulary continuous speech recognition experiments for read and spontaneous speech tasks. The experiments confirmed that VBEC automatically constructed accurate acoustic models and robustly classified speech, i.e., totally mitigated the over-training effects with high word accuracies due to the VBEC functions.

KW - Acoustic model selection

KW - Bayesian predictive classification

KW - Speech recognition

KW - Total Bayesian framework

KW - Variational Bayes

UR - http://www.scopus.com/inward/record.url?scp=3042741069&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=3042741069&partnerID=8YFLogxK

U2 - 10.1109/TSA.2004.828640

DO - 10.1109/TSA.2004.828640

M3 - Article

AN - SCOPUS:3042741069

VL - 12

SP - 365

EP - 381

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 4

ER -