Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer

Marc Delcroix, Shinji Watanabe, Tomohiro Nakatani, Atsushi Nakamura

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.

Original languageEnglish
Pages (from-to)350-368
Number of pages19
JournalComputer Speech and Language
Volume27
Issue number1
DOIs
Publication statusPublished - 2013 Jan
Externally publishedYes

Fingerprint

Speech Enhancement
Speech enhancement
Uncertainty
Robust Speech Recognition
Acoustic Model
Point Estimate
Noise Reduction
Subtraction
Noise abatement
Speech recognition
Acoustic noise
Interconnection
Time-varying
Acoustics
Optimise
Speech
Acoustic waves
Roots
Vary
Model

Keywords

  • Model adaptation
  • Robust speech recognition
  • Speech enhancement
  • Variance compensation

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Theoretical Computer Science

Cite this

Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer. / Delcroix, Marc; Watanabe, Shinji; Nakatani, Tomohiro; Nakamura, Atsushi.

In: Computer Speech and Language, Vol. 27, No. 1, 01.2013, p. 350-368.

Research output: Contribution to journalArticle

@article{4e5ee9f6ae4c4694bcaeca726c949a15,
title = "Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer",
abstract = "A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.",
keywords = "Model adaptation, Robust speech recognition, Speech enhancement, Variance compensation",
author = "Marc Delcroix and Shinji Watanabe and Tomohiro Nakatani and Atsushi Nakamura",
year = "2013",
month = "1",
doi = "10.1016/j.csl.2012.07.001",
language = "English",
volume = "27",
pages = "350--368",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "1",

}

TY - JOUR

T1 - Cluster-based dynamic variance adaptation for interconnecting speech enhancement pre-processor and speech recognizer

AU - Delcroix, Marc

AU - Watanabe, Shinji

AU - Nakatani, Tomohiro

AU - Nakamura, Atsushi

PY - 2013/1

Y1 - 2013/1

N2 - A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.

AB - A conventional approach to noise robust speech recognition consists of employing a speech enhancement pre-processor prior to recognition. However, such a pre-processor usually introduces artifacts that limit recognition performance improvement. In this paper we discuss a framework for improving the interconnection between speech enhancement pre-processors and a recognizer. The framework relies on recent proposals for increasing robustness by replacing the point estimate of the enhanced features with a distribution with a dynamic (i.e. time varying) feature variance. We have recently proposed a model for the dynamic feature variance consisting of a dynamic feature variance root obtained from the pre-processor, which is multiplied by a weight representing the pre-processor uncertainty, and that uses adaptation data to optimize the pre-processor uncertainty weight. The formulation of the method is general and could be used with any speech enhancement pre-processor. However, we observed that in case of noise reduction based on spectral subtraction or related approaches, adaptation could fail because the proposed model is weak at representing well the actual dynamic feature variance. The dynamic feature variance changes according to the level of speech sound, which varies with the HMM states. Therefore, we propose improving the model by introducing HMM state dependency. We achieve this by using a cluster-based representation, i.e. the Gaussians of the acoustic model are grouped into clusters and a different pre-processor uncertainty weight is associated with each cluster. Experiments with various pre-processors and recognition tasks prove the generality of the proposed integration scheme and show that the proposed extension improves the performance with various speech enhancement pre-processors.

KW - Model adaptation

KW - Robust speech recognition

KW - Speech enhancement

KW - Variance compensation

UR - http://www.scopus.com/inward/record.url?scp=84867336669&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84867336669&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2012.07.001

DO - 10.1016/j.csl.2012.07.001

M3 - Article

AN - SCOPUS:84867336669

VL - 27

SP - 350

EP - 368

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 1

ER -