Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments

Hyun Don Kim, Jinsung Kim, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

11 Citations (Scopus)

Abstract

In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.

Original languageEnglish
Title of host publication2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
Pages1705-1711
Number of pages7
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS - Nice
Duration: 2008 Sep 222008 Sep 26

Other

Other2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
CityNice
Period08/9/2208/9/26

Fingerprint

Robots
Audition
Microphones
Television
Speech recognition
Communication
Experiments

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Cite this

Kim, H. D., Kim, J., Komatani, K., Ogata, T., & Okuno, H. G. (2008). Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (pp. 1705-1711). [4650977] https://doi.org/10.1109/IROS.2008.4650977

Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. / Kim, Hyun Don; Kim, Jinsung; Komatani, Kazunori; Ogata, Tetsuya; Okuno, Hiroshi G.

2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. 2008. p. 1705-1711 4650977.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kim, HD, Kim, J, Komatani, K, Ogata, T & Okuno, HG 2008, Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. in 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS., 4650977, pp. 1705-1711, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, Nice, 08/9/22. https://doi.org/10.1109/IROS.2008.4650977
Kim HD, Kim J, Komatani K, Ogata T, Okuno HG. Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. In 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. 2008. p. 1705-1711. 4650977 https://doi.org/10.1109/IROS.2008.4650977
Kim, Hyun Don ; Kim, Jinsung ; Komatani, Kazunori ; Ogata, Tetsuya ; Okuno, Hiroshi G. / Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments. 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS. 2008. pp. 1705-1711
@inproceedings{dfcff74819274930a26059187d44768b,
title = "Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments",
abstract = "In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.",
author = "Kim, {Hyun Don} and Jinsung Kim and Kazunori Komatani and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2008",
doi = "10.1109/IROS.2008.4650977",
language = "English",
isbn = "9781424420582",
pages = "1705--1711",
booktitle = "2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS",

}

TY - GEN

T1 - Target speech detection and separation for humanoid robots in sparse dialogue with noisy home environments

AU - Kim, Hyun Don

AU - Kim, Jinsung

AU - Komatani, Kazunori

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2008

Y1 - 2008

N2 - In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.

AB - In normal human communication, people face the speaker when listening and usually pay attention to the speaker' face. Therefore, in robot audition, the recognition of the front talker is critical for smooth interactions. This paper presents an enhanced speech detection method for a humanoid robot that can separate and recognize speech signals originating from the front even in noisy home environments. The robot audition system consists of a new type of voice activity detection (VAD) based on the complex spectrum circle centroid (CSCC) method and a maximum signal-to-noise (Max-SNR) beamformer. This VAD based on CSCC can classify speech signals that are retrieved at the frontal region of two microphones embedded on the robot. The system works in real-time without needing training filter coefficients given in advance even in a noisy environment (SNR > 0 dB). It can cope with speech noise generated from televisions and audio devices that does not originate from the center. Experiments using a humanoid robot, SIG2, with two microphones showed that our system enhanced extracted target speech signals more than 12 dB (SNR) and the success rate of automatic speech recognition for Japanese words was increased about 17 points.

UR - http://www.scopus.com/inward/record.url?scp=69549083168&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=69549083168&partnerID=8YFLogxK

U2 - 10.1109/IROS.2008.4650977

DO - 10.1109/IROS.2008.4650977

M3 - Conference contribution

AN - SCOPUS:69549083168

SN - 9781424420582

SP - 1705

EP - 1711

BT - 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS

ER -