Real-time speaker localization and speech separation by audio-visual integration

Kazuhiro Nakadai, Ken Ichi Hidai, Hiroshi G. Okuno, Hiroaki Kitano

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)

Abstract

Robot audition in real-world should cope with motor and other noises caused by the robot's own movements in addition to environmental noises and reverberation. This paper reports how auditory processing is improved by audio-visual integration with active movements. The key idea resides in hierarchical integration of auditory and visual streams to disambiguate auditory or visual processing. The system runs in real-time by using distributed processing on 4 PCs connected by Gigabit Ethernet. The system implemented in a upper-torso humanoid tracks multiple talkers and extracts speech from a mixture of sounds. The performance of epipolar geometry based sound source localization and sound source separation by active and adaptive direction-pass filtering is also reported.

Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Robotics and Automation
Pages1043-1049
Number of pages7
Volume1
Publication statusPublished - 2002
Externally publishedYes
Event2002 IEEE International Conference on Robotics and Automation - Washington, DC, United States
Duration: 2002 May 112002 May 15

Other

Other2002 IEEE International Conference on Robotics and Automation
CountryUnited States
CityWashington, DC
Period02/5/1102/5/15

Fingerprint

Acoustic waves
Processing
Robots
Source separation
Reverberation
Audition
Ethernet
Acoustic noise
Geometry

Keywords

  • Audio-visual integration
  • Multiple speaker tracking
  • Robot audition
  • Sound source localization
  • Sound source separation

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering

Cite this

Nakadai, K., Hidai, K. I., Okuno, H. G., & Kitano, H. (2002). Real-time speaker localization and speech separation by audio-visual integration. In Proceedings - IEEE International Conference on Robotics and Automation (Vol. 1, pp. 1043-1049)

Real-time speaker localization and speech separation by audio-visual integration. / Nakadai, Kazuhiro; Hidai, Ken Ichi; Okuno, Hiroshi G.; Kitano, Hiroaki.

Proceedings - IEEE International Conference on Robotics and Automation. Vol. 1 2002. p. 1043-1049.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nakadai, K, Hidai, KI, Okuno, HG & Kitano, H 2002, Real-time speaker localization and speech separation by audio-visual integration. in Proceedings - IEEE International Conference on Robotics and Automation. vol. 1, pp. 1043-1049, 2002 IEEE International Conference on Robotics and Automation, Washington, DC, United States, 02/5/11.
Nakadai K, Hidai KI, Okuno HG, Kitano H. Real-time speaker localization and speech separation by audio-visual integration. In Proceedings - IEEE International Conference on Robotics and Automation. Vol. 1. 2002. p. 1043-1049
Nakadai, Kazuhiro ; Hidai, Ken Ichi ; Okuno, Hiroshi G. ; Kitano, Hiroaki. / Real-time speaker localization and speech separation by audio-visual integration. Proceedings - IEEE International Conference on Robotics and Automation. Vol. 1 2002. pp. 1043-1049
@inproceedings{c5c3d206bc574b5d9fe3fc0ad7fa1c37,
title = "Real-time speaker localization and speech separation by audio-visual integration",
abstract = "Robot audition in real-world should cope with motor and other noises caused by the robot's own movements in addition to environmental noises and reverberation. This paper reports how auditory processing is improved by audio-visual integration with active movements. The key idea resides in hierarchical integration of auditory and visual streams to disambiguate auditory or visual processing. The system runs in real-time by using distributed processing on 4 PCs connected by Gigabit Ethernet. The system implemented in a upper-torso humanoid tracks multiple talkers and extracts speech from a mixture of sounds. The performance of epipolar geometry based sound source localization and sound source separation by active and adaptive direction-pass filtering is also reported.",
keywords = "Audio-visual integration, Multiple speaker tracking, Robot audition, Sound source localization, Sound source separation",
author = "Kazuhiro Nakadai and Hidai, {Ken Ichi} and Okuno, {Hiroshi G.} and Hiroaki Kitano",
year = "2002",
language = "English",
volume = "1",
pages = "1043--1049",
booktitle = "Proceedings - IEEE International Conference on Robotics and Automation",

}

TY - GEN

T1 - Real-time speaker localization and speech separation by audio-visual integration

AU - Nakadai, Kazuhiro

AU - Hidai, Ken Ichi

AU - Okuno, Hiroshi G.

AU - Kitano, Hiroaki

PY - 2002

Y1 - 2002

N2 - Robot audition in real-world should cope with motor and other noises caused by the robot's own movements in addition to environmental noises and reverberation. This paper reports how auditory processing is improved by audio-visual integration with active movements. The key idea resides in hierarchical integration of auditory and visual streams to disambiguate auditory or visual processing. The system runs in real-time by using distributed processing on 4 PCs connected by Gigabit Ethernet. The system implemented in a upper-torso humanoid tracks multiple talkers and extracts speech from a mixture of sounds. The performance of epipolar geometry based sound source localization and sound source separation by active and adaptive direction-pass filtering is also reported.

AB - Robot audition in real-world should cope with motor and other noises caused by the robot's own movements in addition to environmental noises and reverberation. This paper reports how auditory processing is improved by audio-visual integration with active movements. The key idea resides in hierarchical integration of auditory and visual streams to disambiguate auditory or visual processing. The system runs in real-time by using distributed processing on 4 PCs connected by Gigabit Ethernet. The system implemented in a upper-torso humanoid tracks multiple talkers and extracts speech from a mixture of sounds. The performance of epipolar geometry based sound source localization and sound source separation by active and adaptive direction-pass filtering is also reported.

KW - Audio-visual integration

KW - Multiple speaker tracking

KW - Robot audition

KW - Sound source localization

KW - Sound source separation

UR - http://www.scopus.com/inward/record.url?scp=0036058193&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036058193&partnerID=8YFLogxK

M3 - Conference contribution

VL - 1

SP - 1043

EP - 1049

BT - Proceedings - IEEE International Conference on Robotics and Automation

ER -