Improved binaural sound localization and tracking for unknown time-varying number of speakers

Ui Hyun Kim, Hiroshi G. Okuno

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

A method based on the generalized cross-correlation (GCC) method weighted by the phase transform (PHAT) has been developed for binaural sound source localization (SSL) and tracking of multiple sound sources. Accurate binaural audition is important for applying inexpensive and widely applicable auditory capabilities to robots and systems. Conventional SSL based on the GCC-PHAT method is degraded by low resolution of the time difference of arrival estimation, by the interference created when the sound waves arrive at a microphone from two directions around the robot head, and by impaired performance when there are multiple speakers. The low-resolution problem is solved by using a maximum-likelihood-based SSL method in the frequency domain. The multipath interference problem is avoided by incorporating a new time delay factor into the GCC-PHAT method with assuming a spherical robot head. The performance when there are multiple speakers was improved by using a multisource speech tracking method consisting of voice activity detection (VAD) and K-means clustering. The standard K-means clustering algorithm was extended to enable tracking of an unknown time-varying number of speakers by adding two additional steps that increase the number of clusters automatically and eliminate clusters containing incorrect direction estimations. Experiments conducted on the SIG-2 humanoid robot show that this method outperforms the conventional SSL method; it reduces localization errors by 18.1° on average and by over 37° in the side directions. It also tracks multiple speakers in real time with tracking errors below 4.35°.

Original languageEnglish
Pages (from-to)1161-1173
Number of pages13
JournalAdvanced Robotics
Volume27
Issue number15
DOIs
Publication statusPublished - 2013 Jul
Externally publishedYes

Fingerprint

Acoustic waves
Robots
Correlation methods
Audition
Microphones
Clustering algorithms
Maximum likelihood
Time delay
Experiments

Keywords

  • Binaural sound localization
  • Human-robot interaction
  • Multisource sound tracking

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Human-Computer Interaction
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Improved binaural sound localization and tracking for unknown time-varying number of speakers. / Kim, Ui Hyun; Okuno, Hiroshi G.

In: Advanced Robotics, Vol. 27, No. 15, 07.2013, p. 1161-1173.

Research output: Contribution to journalArticle

@article{94f708f059ef4d52a425902eee088ad9,
title = "Improved binaural sound localization and tracking for unknown time-varying number of speakers",
abstract = "A method based on the generalized cross-correlation (GCC) method weighted by the phase transform (PHAT) has been developed for binaural sound source localization (SSL) and tracking of multiple sound sources. Accurate binaural audition is important for applying inexpensive and widely applicable auditory capabilities to robots and systems. Conventional SSL based on the GCC-PHAT method is degraded by low resolution of the time difference of arrival estimation, by the interference created when the sound waves arrive at a microphone from two directions around the robot head, and by impaired performance when there are multiple speakers. The low-resolution problem is solved by using a maximum-likelihood-based SSL method in the frequency domain. The multipath interference problem is avoided by incorporating a new time delay factor into the GCC-PHAT method with assuming a spherical robot head. The performance when there are multiple speakers was improved by using a multisource speech tracking method consisting of voice activity detection (VAD) and K-means clustering. The standard K-means clustering algorithm was extended to enable tracking of an unknown time-varying number of speakers by adding two additional steps that increase the number of clusters automatically and eliminate clusters containing incorrect direction estimations. Experiments conducted on the SIG-2 humanoid robot show that this method outperforms the conventional SSL method; it reduces localization errors by 18.1° on average and by over 37° in the side directions. It also tracks multiple speakers in real time with tracking errors below 4.35°.",
keywords = "Binaural sound localization, Human-robot interaction, Multisource sound tracking",
author = "Kim, {Ui Hyun} and Okuno, {Hiroshi G.}",
year = "2013",
month = "7",
doi = "10.1080/01691864.2013.812177",
language = "English",
volume = "27",
pages = "1161--1173",
journal = "Advanced Robotics",
issn = "0169-1864",
publisher = "Taylor and Francis Ltd.",
number = "15",

}

TY - JOUR

T1 - Improved binaural sound localization and tracking for unknown time-varying number of speakers

AU - Kim, Ui Hyun

AU - Okuno, Hiroshi G.

PY - 2013/7

Y1 - 2013/7

N2 - A method based on the generalized cross-correlation (GCC) method weighted by the phase transform (PHAT) has been developed for binaural sound source localization (SSL) and tracking of multiple sound sources. Accurate binaural audition is important for applying inexpensive and widely applicable auditory capabilities to robots and systems. Conventional SSL based on the GCC-PHAT method is degraded by low resolution of the time difference of arrival estimation, by the interference created when the sound waves arrive at a microphone from two directions around the robot head, and by impaired performance when there are multiple speakers. The low-resolution problem is solved by using a maximum-likelihood-based SSL method in the frequency domain. The multipath interference problem is avoided by incorporating a new time delay factor into the GCC-PHAT method with assuming a spherical robot head. The performance when there are multiple speakers was improved by using a multisource speech tracking method consisting of voice activity detection (VAD) and K-means clustering. The standard K-means clustering algorithm was extended to enable tracking of an unknown time-varying number of speakers by adding two additional steps that increase the number of clusters automatically and eliminate clusters containing incorrect direction estimations. Experiments conducted on the SIG-2 humanoid robot show that this method outperforms the conventional SSL method; it reduces localization errors by 18.1° on average and by over 37° in the side directions. It also tracks multiple speakers in real time with tracking errors below 4.35°.

AB - A method based on the generalized cross-correlation (GCC) method weighted by the phase transform (PHAT) has been developed for binaural sound source localization (SSL) and tracking of multiple sound sources. Accurate binaural audition is important for applying inexpensive and widely applicable auditory capabilities to robots and systems. Conventional SSL based on the GCC-PHAT method is degraded by low resolution of the time difference of arrival estimation, by the interference created when the sound waves arrive at a microphone from two directions around the robot head, and by impaired performance when there are multiple speakers. The low-resolution problem is solved by using a maximum-likelihood-based SSL method in the frequency domain. The multipath interference problem is avoided by incorporating a new time delay factor into the GCC-PHAT method with assuming a spherical robot head. The performance when there are multiple speakers was improved by using a multisource speech tracking method consisting of voice activity detection (VAD) and K-means clustering. The standard K-means clustering algorithm was extended to enable tracking of an unknown time-varying number of speakers by adding two additional steps that increase the number of clusters automatically and eliminate clusters containing incorrect direction estimations. Experiments conducted on the SIG-2 humanoid robot show that this method outperforms the conventional SSL method; it reduces localization errors by 18.1° on average and by over 37° in the side directions. It also tracks multiple speakers in real time with tracking errors below 4.35°.

KW - Binaural sound localization

KW - Human-robot interaction

KW - Multisource sound tracking

UR - http://www.scopus.com/inward/record.url?scp=84886095594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84886095594&partnerID=8YFLogxK

U2 - 10.1080/01691864.2013.812177

DO - 10.1080/01691864.2013.812177

M3 - Article

AN - SCOPUS:84886095594

VL - 27

SP - 1161

EP - 1173

JO - Advanced Robotics

JF - Advanced Robotics

SN - 0169-1864

IS - 15

ER -