Real-time auditory and visual multiple-object tracking for humanoids

Kazuhiro Nakadai, Ken Ichi Hidai, Hiroshi Mizoguchi, Hiroshi G. Okuno, Hiroaki Kitano

Research output: Chapter in Book/Report/Conference proceedingConference contribution

87 Citations (Scopus)

Abstract

This paper presents a real-time auditory and visual tracking of multiple objects for humanoid under real-world environments. Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skin-color extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200 msec, which is forced by visual tracking and network latency.

Original languageEnglish
Title of host publicationIJCAI International Joint Conference on Artificial Intelligence
Pages1425-1432
Number of pages8
Publication statusPublished - 2001
Externally publishedYes
Event17th International Joint Conference on Artificial Intelligence, IJCAI 2001 - Seattle, WA, United States
Duration: 2001 Aug 42001 Aug 10

Other

Other17th International Joint Conference on Artificial Intelligence, IJCAI 2001
CountryUnited States
CitySeattle, WA
Period01/8/401/8/10

Fingerprint

Acoustic waves
Acoustic noise
Processing
Audition
Face recognition
Skin
Cameras
Sampling
Color

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Nakadai, K., Hidai, K. I., Mizoguchi, H., Okuno, H. G., & Kitano, H. (2001). Real-time auditory and visual multiple-object tracking for humanoids. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1425-1432)

Real-time auditory and visual multiple-object tracking for humanoids. / Nakadai, Kazuhiro; Hidai, Ken Ichi; Mizoguchi, Hiroshi; Okuno, Hiroshi G.; Kitano, Hiroaki.

IJCAI International Joint Conference on Artificial Intelligence. 2001. p. 1425-1432.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Nakadai, K, Hidai, KI, Mizoguchi, H, Okuno, HG & Kitano, H 2001, Real-time auditory and visual multiple-object tracking for humanoids. in IJCAI International Joint Conference on Artificial Intelligence. pp. 1425-1432, 17th International Joint Conference on Artificial Intelligence, IJCAI 2001, Seattle, WA, United States, 01/8/4.
Nakadai K, Hidai KI, Mizoguchi H, Okuno HG, Kitano H. Real-time auditory and visual multiple-object tracking for humanoids. In IJCAI International Joint Conference on Artificial Intelligence. 2001. p. 1425-1432
Nakadai, Kazuhiro ; Hidai, Ken Ichi ; Mizoguchi, Hiroshi ; Okuno, Hiroshi G. ; Kitano, Hiroaki. / Real-time auditory and visual multiple-object tracking for humanoids. IJCAI International Joint Conference on Artificial Intelligence. 2001. pp. 1425-1432
@inproceedings{52b20201523b4352868cd1e015f629b8,
title = "Real-time auditory and visual multiple-object tracking for humanoids",
abstract = "This paper presents a real-time auditory and visual tracking of multiple objects for humanoid under real-world environments. Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skin-color extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200 msec, which is forced by visual tracking and network latency.",
author = "Kazuhiro Nakadai and Hidai, {Ken Ichi} and Hiroshi Mizoguchi and Okuno, {Hiroshi G.} and Hiroaki Kitano",
year = "2001",
language = "English",
pages = "1425--1432",
booktitle = "IJCAI International Joint Conference on Artificial Intelligence",

}

TY - GEN

T1 - Real-time auditory and visual multiple-object tracking for humanoids

AU - Nakadai, Kazuhiro

AU - Hidai, Ken Ichi

AU - Mizoguchi, Hiroshi

AU - Okuno, Hiroshi G.

AU - Kitano, Hiroaki

PY - 2001

Y1 - 2001

N2 - This paper presents a real-time auditory and visual tracking of multiple objects for humanoid under real-world environments. Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skin-color extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200 msec, which is forced by visual tracking and network latency.

AB - This paper presents a real-time auditory and visual tracking of multiple objects for humanoid under real-world environments. Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skin-color extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200 msec, which is forced by visual tracking and network latency.

UR - http://www.scopus.com/inward/record.url?scp=84880877816&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880877816&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84880877816

SP - 1425

EP - 1432

BT - IJCAI International Joint Conference on Artificial Intelligence

ER -