Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Yuuki Tachioka, Shinji Watanabe

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using the expectation of ASR decoding/training process with respect to the probabilistic representation of input features. However, unlike the Gaussian mixture model, it is difficult for Deep Neural Network (DNN) to deal with this expectation analytically due to the nonlinear activations. This paper proposes efficient Monte-Carlo approximation methods for this expectation calculation to realize DNN based uncertainty decoding and training. It first models the uncertainty of input features with linear interpolation between original and enhanced feature vectors with a random interpolation coefficient. By sampling input features based on this stochastic process in training, DNN can learn to generalize the variations of enhanced features. Our method also samples input features in decoding, and integrates multiple recognition hypotheses obtained from the samples. Experiments on the reverberated noisy speech recognition tasks (the second CHiME and REVERB challenges) show the effectiveness of our techniques.

Original languageEnglish
Pages (from-to)3541-3545
Number of pages5
JournalUnknown Journal
Volume2015-January
Publication statusPublished - 2015
Externally publishedYes

Fingerprint

Stochastic Representation
Automatic Speech Recognition
speech recognition
decoding
Speech recognition
Decoding
education
Neural Networks
Speech Enhancement
Uncertainty
Speech enhancement
interpolation
Interpolation
Noise Suppression
Linear Interpolation
Speech Signal
Gaussian Mixture Model
Speech Recognition
Feature Vector
Approximation Methods

Keywords

  • Deep neural networks
  • Noise-robust speech recognition
  • Stochastic process of enhanced features
  • Uncertainty training/decoding

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features. / Tachioka, Yuuki; Watanabe, Shinji.

In: Unknown Journal, Vol. 2015-January, 2015, p. 3541-3545.

Research output: Contribution to journalArticle

@article{22bc2583fd6c41b7b5c2e45529c13773,
title = "Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features",
abstract = "Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using the expectation of ASR decoding/training process with respect to the probabilistic representation of input features. However, unlike the Gaussian mixture model, it is difficult for Deep Neural Network (DNN) to deal with this expectation analytically due to the nonlinear activations. This paper proposes efficient Monte-Carlo approximation methods for this expectation calculation to realize DNN based uncertainty decoding and training. It first models the uncertainty of input features with linear interpolation between original and enhanced feature vectors with a random interpolation coefficient. By sampling input features based on this stochastic process in training, DNN can learn to generalize the variations of enhanced features. Our method also samples input features in decoding, and integrates multiple recognition hypotheses obtained from the samples. Experiments on the reverberated noisy speech recognition tasks (the second CHiME and REVERB challenges) show the effectiveness of our techniques.",
keywords = "Deep neural networks, Noise-robust speech recognition, Stochastic process of enhanced features, Uncertainty training/decoding",
author = "Yuuki Tachioka and Shinji Watanabe",
year = "2015",
language = "English",
volume = "2015-January",
pages = "3541--3545",
journal = "Nuclear Physics A",
issn = "0375-9474",
publisher = "Elsevier",

}

TY - JOUR

T1 - Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

AU - Tachioka, Yuuki

AU - Watanabe, Shinji

PY - 2015

Y1 - 2015

N2 - Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using the expectation of ASR decoding/training process with respect to the probabilistic representation of input features. However, unlike the Gaussian mixture model, it is difficult for Deep Neural Network (DNN) to deal with this expectation analytically due to the nonlinear activations. This paper proposes efficient Monte-Carlo approximation methods for this expectation calculation to realize DNN based uncertainty decoding and training. It first models the uncertainty of input features with linear interpolation between original and enhanced feature vectors with a random interpolation coefficient. By sampling input features based on this stochastic process in training, DNN can learn to generalize the variations of enhanced features. Our method also samples input features in decoding, and integrates multiple recognition hypotheses obtained from the samples. Experiments on the reverberated noisy speech recognition tasks (the second CHiME and REVERB challenges) show the effectiveness of our techniques.

AB - Speech enhancement is an important front-end technique to improve automatic speech recognition (ASR) in noisy environments. However, the wrong noise suppression of speech enhancement often causes additional distortions in speech signals, which degrades the ASR performance. To compensate the distortions, ASR needs to consider the uncertainty of enhanced features, which can be achieved by using the expectation of ASR decoding/training process with respect to the probabilistic representation of input features. However, unlike the Gaussian mixture model, it is difficult for Deep Neural Network (DNN) to deal with this expectation analytically due to the nonlinear activations. This paper proposes efficient Monte-Carlo approximation methods for this expectation calculation to realize DNN based uncertainty decoding and training. It first models the uncertainty of input features with linear interpolation between original and enhanced feature vectors with a random interpolation coefficient. By sampling input features based on this stochastic process in training, DNN can learn to generalize the variations of enhanced features. Our method also samples input features in decoding, and integrates multiple recognition hypotheses obtained from the samples. Experiments on the reverberated noisy speech recognition tasks (the second CHiME and REVERB challenges) show the effectiveness of our techniques.

KW - Deep neural networks

KW - Noise-robust speech recognition

KW - Stochastic process of enhanced features

KW - Uncertainty training/decoding

UR - http://www.scopus.com/inward/record.url?scp=84959129666&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959129666&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84959129666

VL - 2015-January

SP - 3541

EP - 3545

JO - Nuclear Physics A

JF - Nuclear Physics A

SN - 0375-9474

ER -