Enhanced robot speech recognition based on microphone array source separation and missing feature theory

Shun'ichi Yamamoto, Jean Marc Valin, Kazuhiro Nakadai, Jean Rouat, François Michaud, Tetsuya Ogata, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

49 Citations (Scopus)

Abstract

A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

Original languageEnglish
Title of host publicationProceedings - IEEE International Conference on Robotics and Automation
Pages1477-1482
Number of pages6
Volume2005
DOIs
Publication statusPublished - 2005
Externally publishedYes
Event2005 IEEE International Conference on Robotics and Automation - Barcelona
Duration: 2005 Apr 182005 Apr 22

Other

Other2005 IEEE International Conference on Robotics and Automation
CityBarcelona
Period05/4/1805/4/22

Fingerprint

Source separation
Microphones
Speech recognition
Acoustic waves
Robots
Masks
Audition
Acoustics

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering

Cite this

Yamamoto, S., Valin, J. M., Nakadai, K., Rouat, J., Michaud, F., Ogata, T., & Okuno, H. G. (2005). Enhanced robot speech recognition based on microphone array source separation and missing feature theory. In Proceedings - IEEE International Conference on Robotics and Automation (Vol. 2005, pp. 1477-1482). [1570323] https://doi.org/10.1109/ROBOT.2005.1570323

Enhanced robot speech recognition based on microphone array source separation and missing feature theory. / Yamamoto, Shun'ichi; Valin, Jean Marc; Nakadai, Kazuhiro; Rouat, Jean; Michaud, François; Ogata, Tetsuya; Okuno, Hiroshi G.

Proceedings - IEEE International Conference on Robotics and Automation. Vol. 2005 2005. p. 1477-1482 1570323.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yamamoto, S, Valin, JM, Nakadai, K, Rouat, J, Michaud, F, Ogata, T & Okuno, HG 2005, Enhanced robot speech recognition based on microphone array source separation and missing feature theory. in Proceedings - IEEE International Conference on Robotics and Automation. vol. 2005, 1570323, pp. 1477-1482, 2005 IEEE International Conference on Robotics and Automation, Barcelona, 05/4/18. https://doi.org/10.1109/ROBOT.2005.1570323
Yamamoto S, Valin JM, Nakadai K, Rouat J, Michaud F, Ogata T et al. Enhanced robot speech recognition based on microphone array source separation and missing feature theory. In Proceedings - IEEE International Conference on Robotics and Automation. Vol. 2005. 2005. p. 1477-1482. 1570323 https://doi.org/10.1109/ROBOT.2005.1570323
Yamamoto, Shun'ichi ; Valin, Jean Marc ; Nakadai, Kazuhiro ; Rouat, Jean ; Michaud, François ; Ogata, Tetsuya ; Okuno, Hiroshi G. / Enhanced robot speech recognition based on microphone array source separation and missing feature theory. Proceedings - IEEE International Conference on Robotics and Automation. Vol. 2005 2005. pp. 1477-1482
@inproceedings{041b85a81d494e9e949fc0358c5fbe97,
title = "Enhanced robot speech recognition based on microphone array source separation and missing feature theory",
abstract = "A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42{\%} (relative).",
author = "Shun'ichi Yamamoto and Valin, {Jean Marc} and Kazuhiro Nakadai and Jean Rouat and Fran{\cc}ois Michaud and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2005",
doi = "10.1109/ROBOT.2005.1570323",
language = "English",
isbn = "078038914X",
volume = "2005",
pages = "1477--1482",
booktitle = "Proceedings - IEEE International Conference on Robotics and Automation",

}

TY - GEN

T1 - Enhanced robot speech recognition based on microphone array source separation and missing feature theory

AU - Yamamoto, Shun'ichi

AU - Valin, Jean Marc

AU - Nakadai, Kazuhiro

AU - Rouat, Jean

AU - Michaud, François

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2005

Y1 - 2005

N2 - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

AB - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

UR - http://www.scopus.com/inward/record.url?scp=33846170539&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846170539&partnerID=8YFLogxK

U2 - 10.1109/ROBOT.2005.1570323

DO - 10.1109/ROBOT.2005.1570323

M3 - Conference contribution

AN - SCOPUS:33846170539

SN - 078038914X

SN - 9780780389144

VL - 2005

SP - 1477

EP - 1482

BT - Proceedings - IEEE International Conference on Robotics and Automation

ER -