Voice Activity Detection based on Fusion of Audio and Visual Information

Shin'ichi Takeuchi, Takashi Hashiba, Satoshi Tamura, Satoru Hayamizu

Research output: Contribution to conferencePaperpeer-review

25 Citations (Scopus)

Abstract

In this paper, we propose a multi-modal voice activity detection system (VAD) that uses audio and visual information. Audio-only VAD systems typically are not robust to (acoustic) noise. Incorporating visual information, for example information extracted from mouth images, can improve the robustness since the visual information is not affected by the acoustic noise. In multi-modal (speech) signal processing, there are two methods for fusing the audio and the visual information: concatenating the audio and visual features, and employing audio-only and visual-only classifiers, then fusing the unimodal decisions. We investigate the effectiveness of these methods and also compare model-based and model-free methods for VAD. Experimental results show feature fusion methods to generally be more effective, and decision fusion methods generally perform better using model-free methods.

Original languageEnglish
Pages151-154
Number of pages4
Publication statusPublished - 2009
Event2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009 - Norwich, United Kingdom
Duration: 2009 Sep 102009 Sep 13

Conference

Conference2009 International Conference on Auditory-Visual Speech Processing, AVSP 2009
Country/TerritoryUnited Kingdom
CityNorwich
Period09/9/1009/9/13

Keywords

  • AVVAD
  • multi-modal
  • voice activity detection

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing
  • Otorhinolaryngology

Fingerprint

Dive into the research topics of 'Voice Activity Detection based on Fusion of Audio and Visual Information'. Together they form a unique fingerprint.

Cite this