Gibbs sampling based multi-scale mixture model for speaker clustering

Shinji Watanabe, Daichi Mochihashi, Takaaki Hori, Atsushi Nakamura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mixture Model (M3). The proposed approach focuses on the multi-scale property of speech dynamics, i.e., dynamics in speech can be observed on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M 3 is an extension of the Gaussian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at intervals of the corresponding time unit. We derive a fully Bayesian treatment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sampling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and obtained a significant improvement over the conventional BIC based approaches.

Original languageEnglish
Title of host publication2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings
Pages4524-4527
Number of pages4
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Prague
Duration: 2011 May 222011 May 27

Other

Other36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011
CityPrague
Period11/5/2211/5/27

Fingerprint

Sampling
Linguistics
Experiments

Keywords

  • Fully Bayesian approach
  • Gaussian mixture
  • Gibbs sampling
  • multi-scale mixture model
  • speaker clustering

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Electrical and Electronic Engineering

Cite this

Watanabe, S., Mochihashi, D., Hori, T., & Nakamura, A. (2011). Gibbs sampling based multi-scale mixture model for speaker clustering. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings (pp. 4524-4527). [5947360] https://doi.org/10.1109/ICASSP.2011.5947360

Gibbs sampling based multi-scale mixture model for speaker clustering. / Watanabe, Shinji; Mochihashi, Daichi; Hori, Takaaki; Nakamura, Atsushi.

2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 4524-4527 5947360.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Watanabe, S, Mochihashi, D, Hori, T & Nakamura, A 2011, Gibbs sampling based multi-scale mixture model for speaker clustering. in 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings., 5947360, pp. 4524-4527, 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, Prague, 11/5/22. https://doi.org/10.1109/ICASSP.2011.5947360
Watanabe S, Mochihashi D, Hori T, Nakamura A. Gibbs sampling based multi-scale mixture model for speaker clustering. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. p. 4524-4527. 5947360 https://doi.org/10.1109/ICASSP.2011.5947360
Watanabe, Shinji ; Mochihashi, Daichi ; Hori, Takaaki ; Nakamura, Atsushi. / Gibbs sampling based multi-scale mixture model for speaker clustering. 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings. 2011. pp. 4524-4527
@inproceedings{e9c66af2eb4a48a4a4694462c4da18b1,
title = "Gibbs sampling based multi-scale mixture model for speaker clustering",
abstract = "The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mixture Model (M3). The proposed approach focuses on the multi-scale property of speech dynamics, i.e., dynamics in speech can be observed on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M 3 is an extension of the Gaussian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at intervals of the corresponding time unit. We derive a fully Bayesian treatment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sampling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and obtained a significant improvement over the conventional BIC based approaches.",
keywords = "Fully Bayesian approach, Gaussian mixture, Gibbs sampling, multi-scale mixture model, speaker clustering",
author = "Shinji Watanabe and Daichi Mochihashi and Takaaki Hori and Atsushi Nakamura",
year = "2011",
doi = "10.1109/ICASSP.2011.5947360",
language = "English",
isbn = "9781457705397",
pages = "4524--4527",
booktitle = "2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings",

}

TY - GEN

T1 - Gibbs sampling based multi-scale mixture model for speaker clustering

AU - Watanabe, Shinji

AU - Mochihashi, Daichi

AU - Hori, Takaaki

AU - Nakamura, Atsushi

PY - 2011

Y1 - 2011

N2 - The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mixture Model (M3). The proposed approach focuses on the multi-scale property of speech dynamics, i.e., dynamics in speech can be observed on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M 3 is an extension of the Gaussian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at intervals of the corresponding time unit. We derive a fully Bayesian treatment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sampling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and obtained a significant improvement over the conventional BIC based approaches.

AB - The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mixture Model (M3). The proposed approach focuses on the multi-scale property of speech dynamics, i.e., dynamics in speech can be observed on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M 3 is an extension of the Gaussian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at intervals of the corresponding time unit. We derive a fully Bayesian treatment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sampling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and obtained a significant improvement over the conventional BIC based approaches.

KW - Fully Bayesian approach

KW - Gaussian mixture

KW - Gibbs sampling

KW - multi-scale mixture model

KW - speaker clustering

UR - http://www.scopus.com/inward/record.url?scp=80051606569&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80051606569&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2011.5947360

DO - 10.1109/ICASSP.2011.5947360

M3 - Conference contribution

SN - 9781457705397

SP - 4524

EP - 4527

BT - 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings

ER -