Target speech detection and separation for communication with humanoid robots in noisy home environments

Hyun Don Kim, Jinsung Kim, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

People usually talk face to face when they communicate with their partner. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise ratio (SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased by about 17 points.

Original languageEnglish
Pages (from-to)2093-2111
Number of pages19
JournalAdvanced Robotics
Volume23
Issue number15
DOIs
Publication statusPublished - 2009 Oct 1
Externally publishedYes

Fingerprint

Robots
Communication
Signal to noise ratio
Audition
Microphones
Television
Speech recognition
Experiments

Keywords

  • Human-robot interaction
  • Robot audition
  • Sound source localization
  • Sound source separation
  • Voice activity detection

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Human-Computer Interaction
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Target speech detection and separation for communication with humanoid robots in noisy home environments. / Kim, Hyun Don; Kim, Jinsung; Komatani, Kazunori; Ogata, Tetsuya; Okuno, Hiroshi G.

In: Advanced Robotics, Vol. 23, No. 15, 01.10.2009, p. 2093-2111.

Research output: Contribution to journalArticle

@article{21c8acc7de6e45df9409ad978d7b4e0e,
title = "Target speech detection and separation for communication with humanoid robots in noisy home environments",
abstract = "People usually talk face to face when they communicate with their partner. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise ratio (SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased by about 17 points.",
keywords = "Human-robot interaction, Robot audition, Sound source localization, Sound source separation, Voice activity detection",
author = "Kim, {Hyun Don} and Jinsung Kim and Kazunori Komatani and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2009",
month = "10",
day = "1",
doi = "10.1163/016918609X12529300552105",
language = "English",
volume = "23",
pages = "2093--2111",
journal = "Advanced Robotics",
issn = "0169-1864",
publisher = "Taylor and Francis Ltd.",
number = "15",

}

TY - JOUR

T1 - Target speech detection and separation for communication with humanoid robots in noisy home environments

AU - Kim, Hyun Don

AU - Kim, Jinsung

AU - Komatani, Kazunori

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2009/10/1

Y1 - 2009/10/1

N2 - People usually talk face to face when they communicate with their partner. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise ratio (SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased by about 17 points.

AB - People usually talk face to face when they communicate with their partner. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise ratio (SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased by about 17 points.

KW - Human-robot interaction

KW - Robot audition

KW - Sound source localization

KW - Sound source separation

KW - Voice activity detection

UR - http://www.scopus.com/inward/record.url?scp=70449597778&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70449597778&partnerID=8YFLogxK

U2 - 10.1163/016918609X12529300552105

DO - 10.1163/016918609X12529300552105

M3 - Article

AN - SCOPUS:70449597778

VL - 23

SP - 2093

EP - 2111

JO - Advanced Robotics

JF - Advanced Robotics

SN - 0169-1864

IS - 15

ER -