Human-robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking

Hiroshi G. Okuno, Kazuhiro Nakadai, Ken Ichi Hidai, Hiroshi Mizoguchi, Hiroaki Kitano

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Sound is essential to enhance visual experience and human-robot interaction, but most research and development efforts are usually made mainly towards sound generation, speech synthesis and speech recognition. The reason why only little attention has been paid to auditory scene analysis is that real-time perception of a mixture of sounds is difficult. Recently, Nakadai et al. have developed real-time auditory and visual multiple-talker tracking technology. In this paper, this technology is applied to human-robot verbal and non-verbal interaction including a receptionist robot and a companion robot at a party. The system includes face identification, speech recognition, focus-of-attention control and a sensorimotor task in tracking multiple talkers. The system is implemented on an upper-torso humanoid called SIG and the talker tracking is attained by distributed processing on three nodes connected by a 100Base-TX network. The overall delay of tracking is 200 ms. Focus-of-attention is controlled by associating auditory and visual streams with using the sound source direction and talker position as a clue. Once an association is established, the humanoid keeps its face towards the direction of the associated talker.

Original languageEnglish
Pages (from-to)115-130
Number of pages16
JournalAdvanced Robotics
Volume17
Issue number2
DOIs
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

Acoustic waves
Robots
Speech recognition
Human robot interaction
Speech synthesis
Processing

Keywords

  • Active audition
  • Auditory tracking
  • Non-verbal interaction
  • Real-time tracking
  • Robot audition

ASJC Scopus subject areas

  • Control and Systems Engineering

Cite this

Human-robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking. / Okuno, Hiroshi G.; Nakadai, Kazuhiro; Hidai, Ken Ichi; Mizoguchi, Hiroshi; Kitano, Hiroaki.

In: Advanced Robotics, Vol. 17, No. 2, 2003, p. 115-130.

Research output: Contribution to journalArticle

Okuno, Hiroshi G. ; Nakadai, Kazuhiro ; Hidai, Ken Ichi ; Mizoguchi, Hiroshi ; Kitano, Hiroaki. / Human-robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking. In: Advanced Robotics. 2003 ; Vol. 17, No. 2. pp. 115-130.
@article{9eec2d8fea0b468aa9b5368244d6aeae,
title = "Human-robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking",
abstract = "Sound is essential to enhance visual experience and human-robot interaction, but most research and development efforts are usually made mainly towards sound generation, speech synthesis and speech recognition. The reason why only little attention has been paid to auditory scene analysis is that real-time perception of a mixture of sounds is difficult. Recently, Nakadai et al. have developed real-time auditory and visual multiple-talker tracking technology. In this paper, this technology is applied to human-robot verbal and non-verbal interaction including a receptionist robot and a companion robot at a party. The system includes face identification, speech recognition, focus-of-attention control and a sensorimotor task in tracking multiple talkers. The system is implemented on an upper-torso humanoid called SIG and the talker tracking is attained by distributed processing on three nodes connected by a 100Base-TX network. The overall delay of tracking is 200 ms. Focus-of-attention is controlled by associating auditory and visual streams with using the sound source direction and talker position as a clue. Once an association is established, the humanoid keeps its face towards the direction of the associated talker.",
keywords = "Active audition, Auditory tracking, Non-verbal interaction, Real-time tracking, Robot audition",
author = "Okuno, {Hiroshi G.} and Kazuhiro Nakadai and Hidai, {Ken Ichi} and Hiroshi Mizoguchi and Hiroaki Kitano",
year = "2003",
doi = "10.1163/156855303321165088",
language = "English",
volume = "17",
pages = "115--130",
journal = "Advanced Robotics",
issn = "0169-1864",
publisher = "Taylor and Francis Ltd.",
number = "2",

}

TY - JOUR

T1 - Human-robot non-verbal interaction empowered by real-time auditory and visual multiple-talker tracking

AU - Okuno, Hiroshi G.

AU - Nakadai, Kazuhiro

AU - Hidai, Ken Ichi

AU - Mizoguchi, Hiroshi

AU - Kitano, Hiroaki

PY - 2003

Y1 - 2003

N2 - Sound is essential to enhance visual experience and human-robot interaction, but most research and development efforts are usually made mainly towards sound generation, speech synthesis and speech recognition. The reason why only little attention has been paid to auditory scene analysis is that real-time perception of a mixture of sounds is difficult. Recently, Nakadai et al. have developed real-time auditory and visual multiple-talker tracking technology. In this paper, this technology is applied to human-robot verbal and non-verbal interaction including a receptionist robot and a companion robot at a party. The system includes face identification, speech recognition, focus-of-attention control and a sensorimotor task in tracking multiple talkers. The system is implemented on an upper-torso humanoid called SIG and the talker tracking is attained by distributed processing on three nodes connected by a 100Base-TX network. The overall delay of tracking is 200 ms. Focus-of-attention is controlled by associating auditory and visual streams with using the sound source direction and talker position as a clue. Once an association is established, the humanoid keeps its face towards the direction of the associated talker.

AB - Sound is essential to enhance visual experience and human-robot interaction, but most research and development efforts are usually made mainly towards sound generation, speech synthesis and speech recognition. The reason why only little attention has been paid to auditory scene analysis is that real-time perception of a mixture of sounds is difficult. Recently, Nakadai et al. have developed real-time auditory and visual multiple-talker tracking technology. In this paper, this technology is applied to human-robot verbal and non-verbal interaction including a receptionist robot and a companion robot at a party. The system includes face identification, speech recognition, focus-of-attention control and a sensorimotor task in tracking multiple talkers. The system is implemented on an upper-torso humanoid called SIG and the talker tracking is attained by distributed processing on three nodes connected by a 100Base-TX network. The overall delay of tracking is 200 ms. Focus-of-attention is controlled by associating auditory and visual streams with using the sound source direction and talker position as a clue. Once an association is established, the humanoid keeps its face towards the direction of the associated talker.

KW - Active audition

KW - Auditory tracking

KW - Non-verbal interaction

KW - Real-time tracking

KW - Robot audition

UR - http://www.scopus.com/inward/record.url?scp=0037219742&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0037219742&partnerID=8YFLogxK

U2 - 10.1163/156855303321165088

DO - 10.1163/156855303321165088

M3 - Article

VL - 17

SP - 115

EP - 130

JO - Advanced Robotics

JF - Advanced Robotics

SN - 0169-1864

IS - 2

ER -