Voice Activity Detection based on Fusion of Audio and Visual Information

Shin'ichi Takeuchi, Takashi Hashiba, Satoshi Tamura, Satoru Hayamizu

研究成果: Paper査読

25 被引用数 (Scopus)

抄録

In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. Audio-only VAD systems typically are not robust to (acoustic) noise. Incorporating visual information, for example information extracted from mouth images, can improve the robustness since the visual information is not affected by the acoustic noise. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audio-only and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectiveness of these methods and also compare model-based and model-free methods for VAD. Experimental results show feature fusion methods to generally be more effective, and decision fusion methods generally perform better using model-free methods.

本文言語English
ページ151-154
ページ数4
出版ステータスPublished - 2009
外部発表はい
イベント2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009 - Norwich, United Kingdom
継続期間: 2009 9月 102009 9月 13

Conference

Conference2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009
国/地域United Kingdom
CityNorwich
Period09/9/1009/9/13

ASJC Scopus subject areas

  • 言語および言語学
  • 言語聴覚療法
  • 耳鼻咽喉科学

フィンガープリント

「Voice Activity Detection based on Fusion of Audio and Visual Information」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル