An improvement in audio-visual voice activity detection for automatic speech recognition

Takami Yoshida*, Kazuhiro Nakadai, Hiroshi G. Okuno

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are many acoustically and visually noises. In this paper, we improved Audio-Visual VAD for our two-layered audio visual integration framework for ASR by using hangover processing based on erosion and dilation. We implemented proposed method to our audio-visual speech recognition system for robot. Empirical results show the effectiveness of our proposed method in terms of VAD.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages51-61
Number of pages11
Volume6096 LNAI
EditionPART 1
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligence Systems, IEA/AIE 2010 - Cordoba
Duration: 2010 Jun 12010 Jun 4

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6096 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligence Systems, IEA/AIE 2010
CityCordoba
Period10/6/110/6/4

Keywords

  • Audio-Visual integration
  • Speech Recognition
  • Voice Activity Detection

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'An improvement in audio-visual voice activity detection for automatic speech recognition'. Together they form a unique fingerprint.

Cite this