Statistical voice conversion based on noisy channel model

Daisuke Saito, Shinji Watanabe, Atsushi Nakamura, Nobuaki Minematsu

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.

Original languageEnglish
Article number6156420
Pages (from-to)1784-1794
Number of pages11
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume20
Issue number6
DOIs
Publication statusPublished - 2012
Externally publishedYes

Fingerprint

linguistics
Linguistics
education
estimates
Experiments

Keywords

  • Joint density model
  • noisy channel model
  • probabilistic integration
  • speaker model
  • voice conversion (VC)

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Statistical voice conversion based on noisy channel model. / Saito, Daisuke; Watanabe, Shinji; Nakamura, Atsushi; Minematsu, Nobuaki.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 6, 6156420, 2012, p. 1784-1794.

Research output: Contribution to journalArticle

Saito, Daisuke ; Watanabe, Shinji ; Nakamura, Atsushi ; Minematsu, Nobuaki. / Statistical voice conversion based on noisy channel model. In: IEEE Transactions on Audio, Speech and Language Processing. 2012 ; Vol. 20, No. 6. pp. 1784-1794.
@article{41b5bbad2fd740d998c216702b8f47b6,
title = "Statistical voice conversion based on noisy channel model",
abstract = "This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.",
keywords = "Joint density model, noisy channel model, probabilistic integration, speaker model, voice conversion (VC)",
author = "Daisuke Saito and Shinji Watanabe and Atsushi Nakamura and Nobuaki Minematsu",
year = "2012",
doi = "10.1109/TASL.2012.2188628",
language = "English",
volume = "20",
pages = "1784--1794",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "6",

}

TY - JOUR

T1 - Statistical voice conversion based on noisy channel model

AU - Saito, Daisuke

AU - Watanabe, Shinji

AU - Nakamura, Atsushi

AU - Minematsu, Nobuaki

PY - 2012

Y1 - 2012

N2 - This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.

AB - This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.

KW - Joint density model

KW - noisy channel model

KW - probabilistic integration

KW - speaker model

KW - voice conversion (VC)

UR - http://www.scopus.com/inward/record.url?scp=84859768504&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859768504&partnerID=8YFLogxK

U2 - 10.1109/TASL.2012.2188628

DO - 10.1109/TASL.2012.2188628

M3 - Article

AN - SCOPUS:84859768504

VL - 20

SP - 1784

EP - 1794

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 6

M1 - 6156420

ER -