Automatic speech recognition improved by two-layered audio-visual integration for robot audition

Takami Yoshida, Kazuhiro Nakadai, Hiroshi G. Okuno

研究成果: Conference contribution

31 被引用数 (Scopus)

抄録

The robustness and high performance of ASR is required for robot audition, because people usually speak to each other to communicate. This paper presents two-layered audio-visual integration to make automatic speech recognition (ASR) more robust against speaker's distance and interfering talkers or environmental noises. It consists of Audio-Visual Voice Activity Detection (AV-VAD) and Audio-Visual Speech Recognition (AVSR). The AV-VAD layer integrates several AV features based on a Bayesian network to robustly detect voice activity, or speaker's utterance duration. This is because the performance of VAD strongly affects that of ASR. The AVSR layer integrates the reliability estimation of acoustic features and that of visual features by using a missing-feature theory method. The reliability of audio features is more weighted in a clean acoustic environment, while that of visual features is more weighted in a noisy environment. This AVSR layer integration can cope with dynamically-changing environments in acoustics or vision. The proposed AV integrated ASR is implemented on HARK, our open-sourced robot audition software, with an 8ch microphone array. Empirical results show that our system improves 9.9 and 16.7 points of ASR results with/without microphone array processing, respectively, and also improves robustness against several auditory/visual noise conditions.

本文言語English
ホスト出版物のタイトル9th IEEE-RAS International Conference on Humanoid Robots, HUMANOIDS09
ページ604-609
ページ数6
DOI
出版ステータスPublished - 2009
外部発表はい
イベント9th IEEE-RAS International Conference on Humanoid Robots, HUMANOIDS09 - Paris
継続期間: 2009 12 72009 12 10

Other

Other9th IEEE-RAS International Conference on Humanoid Robots, HUMANOIDS09
CityParis
Period09/12/709/12/10

ASJC Scopus subject areas

  • Computer Science(all)

フィンガープリント 「Automatic speech recognition improved by two-layered audio-visual integration for robot audition」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル