Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Atsunori Ogawa, Takaaki Hori, Shinji Watanabe, Masakiyo Fujimoto, Takuya Yoshioka, Takanobu Oba, Yotaro Kubo, Mehrez Souden, Seong Jun Hahm, Atsushi Nakamura

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

Original languageEnglish
Article number547
Pages (from-to)851-873
Number of pages23
JournalComputer Speech and Language
Volume27
Issue number3
DOIs
Publication statusPublished - 2013 May 3
Externally publishedYes

Fingerprint

Speech Enhancement
Speech enhancement
Speech Recognition
Speech recognition
Acoustic waves
Modeling
Acoustic Model
Acoustics
Robust Speech Recognition
Human Performance
Adaptive Dynamics
Speech
Sound
Linear regression
Acoustic noise
Maximum likelihood
Maximum Likelihood
Time-varying
Enhancement
Model-based

Keywords

  • Dynamic variance adaptation
  • Example-based speech enhancement
  • Model adaptation
  • Model-based speech enhancement
  • Robust ASR

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Theoretical Computer Science

Cite this

Speech recognition in living rooms : Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds. / Delcroix, Marc; Kinoshita, Keisuke; Nakatani, Tomohiro; Araki, Shoko; Ogawa, Atsunori; Hori, Takaaki; Watanabe, Shinji; Fujimoto, Masakiyo; Yoshioka, Takuya; Oba, Takanobu; Kubo, Yotaro; Souden, Mehrez; Hahm, Seong Jun; Nakamura, Atsushi.

In: Computer Speech and Language, Vol. 27, No. 3, 547, 03.05.2013, p. 851-873.

Research output: Contribution to journalArticle

Delcroix, M, Kinoshita, K, Nakatani, T, Araki, S, Ogawa, A, Hori, T, Watanabe, S, Fujimoto, M, Yoshioka, T, Oba, T, Kubo, Y, Souden, M, Hahm, SJ & Nakamura, A 2013, 'Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds', Computer Speech and Language, vol. 27, no. 3, 547, pp. 851-873. https://doi.org/10.1016/j.csl.2012.07.006
Delcroix, Marc ; Kinoshita, Keisuke ; Nakatani, Tomohiro ; Araki, Shoko ; Ogawa, Atsunori ; Hori, Takaaki ; Watanabe, Shinji ; Fujimoto, Masakiyo ; Yoshioka, Takuya ; Oba, Takanobu ; Kubo, Yotaro ; Souden, Mehrez ; Hahm, Seong Jun ; Nakamura, Atsushi. / Speech recognition in living rooms : Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds. In: Computer Speech and Language. 2013 ; Vol. 27, No. 3. pp. 851-873.
@article{80d417d5cb914a9db78c80379ab6f701,
title = "Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds",
abstract = "Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.",
keywords = "Dynamic variance adaptation, Example-based speech enhancement, Model adaptation, Model-based speech enhancement, Robust ASR",
author = "Marc Delcroix and Keisuke Kinoshita and Tomohiro Nakatani and Shoko Araki and Atsunori Ogawa and Takaaki Hori and Shinji Watanabe and Masakiyo Fujimoto and Takuya Yoshioka and Takanobu Oba and Yotaro Kubo and Mehrez Souden and Hahm, {Seong Jun} and Atsushi Nakamura",
year = "2013",
month = "5",
day = "3",
doi = "10.1016/j.csl.2012.07.006",
language = "English",
volume = "27",
pages = "851--873",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "3",

}

TY - JOUR

T1 - Speech recognition in living rooms

T2 - Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds

AU - Delcroix, Marc

AU - Kinoshita, Keisuke

AU - Nakatani, Tomohiro

AU - Araki, Shoko

AU - Ogawa, Atsunori

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Fujimoto, Masakiyo

AU - Yoshioka, Takuya

AU - Oba, Takanobu

AU - Kubo, Yotaro

AU - Souden, Mehrez

AU - Hahm, Seong Jun

AU - Nakamura, Atsushi

PY - 2013/5/3

Y1 - 2013/5/3

N2 - Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

AB - Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

KW - Dynamic variance adaptation

KW - Example-based speech enhancement

KW - Model adaptation

KW - Model-based speech enhancement

KW - Robust ASR

UR - http://www.scopus.com/inward/record.url?scp=84887395149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84887395149&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2012.07.006

DO - 10.1016/j.csl.2012.07.006

M3 - Article

AN - SCOPUS:84887395149

VL - 27

SP - 851

EP - 873

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 3

M1 - 547

ER -