CENSREC-1-AV: An audio-visual corpus for noisy bimodal speech recognition

Satoshi Tamura*, Chiyomi Miyajima, Norihide Kitaoka, Takeshi Yamada, Satoru Tsuge, Tetsuya Takiguchi, Kazumasa Yamamoto, Takanobu Nishiura, Masato Nakayama, Yuki Denda, Masakiyo Fujimoto, Shigeki Matsuda, Tetsuji Ogawa, Shingo Kuroiwa, Kazuya Takeda, Satoshi Nakamura

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

18 Citations (Scopus)

Abstract

In this paper, an audio-visual speech corpus CENSREC-1-AV for noisy speech recognition is introduced. CENSREC-1-AV consists of an audio-visual database and a baseline system of bimodal speech recognition which uses audio and visual information. In the database, there are 3,234 and 1,963 utterances made by 42 and 51 speakers as a training and a test sets respectively. Each utterance consists of a speech signal as well as color and infrared pictures around a speaker's mouth. A baseline system is built so that a user can evaluate a proposed bimodal speech recognizer. In the baseline system, multi-stream HMMs are obtained using training data. A preliminary experiment was conducted to evaluate the baseline using acoustically noisy testing data. The results show that roughly a 35% relative error reduction was achieved in low SNR conditions compared with an audio-only ASR method.

Original languageEnglish
Publication statusPublished - 2010
Event2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010 - Hakone, Japan
Duration: 2010 Sep 302010 Oct 3

Conference

Conference2010 International Conference on Auditory-Visual Speech Processing, AVSP 2010
Country/TerritoryJapan
CityHakone
Period10/9/3010/10/3

Keywords

  • audio-visual database
  • bimodal speech recognition
  • eigenface
  • noise robustness
  • optical flow

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing
  • Otorhinolaryngology

Fingerprint

Dive into the research topics of 'CENSREC-1-AV: An audio-visual corpus for noisy bimodal speech recognition'. Together they form a unique fingerprint.

Cite this