Enhanced robot speech recognition based on microphone array source separation and missing feature theory

Shun'ichi Yamamoto, Jean Marc Valin, Kazuhiro Nakadai, Jean Rouat, François Michaud, Tetsuya Ogata, Hiroshi G. Okuno

研究成果: Conference contribution

49 引用 (Scopus)

抄録

A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

元の言語English
ホスト出版物のタイトルProceedings - IEEE International Conference on Robotics and Automation
ページ1477-1482
ページ数6
2005
DOI
出版物ステータスPublished - 2005
外部発表Yes
イベント2005 IEEE International Conference on Robotics and Automation - Barcelona
継続期間: 2005 4 182005 4 22

Other

Other2005 IEEE International Conference on Robotics and Automation
Barcelona
期間05/4/1805/4/22

Fingerprint

Source separation
Microphones
Speech recognition
Acoustic waves
Robots
Masks
Audition
Acoustics

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering

これを引用

Yamamoto, S., Valin, J. M., Nakadai, K., Rouat, J., Michaud, F., Ogata, T., & Okuno, H. G. (2005). Enhanced robot speech recognition based on microphone array source separation and missing feature theory. : Proceedings - IEEE International Conference on Robotics and Automation (巻 2005, pp. 1477-1482). [1570323] https://doi.org/10.1109/ROBOT.2005.1570323

Enhanced robot speech recognition based on microphone array source separation and missing feature theory. / Yamamoto, Shun'ichi; Valin, Jean Marc; Nakadai, Kazuhiro; Rouat, Jean; Michaud, François; Ogata, Tetsuya; Okuno, Hiroshi G.

Proceedings - IEEE International Conference on Robotics and Automation. 巻 2005 2005. p. 1477-1482 1570323.

研究成果: Conference contribution

Yamamoto, S, Valin, JM, Nakadai, K, Rouat, J, Michaud, F, Ogata, T & Okuno, HG 2005, Enhanced robot speech recognition based on microphone array source separation and missing feature theory. : Proceedings - IEEE International Conference on Robotics and Automation. 巻. 2005, 1570323, pp. 1477-1482, 2005 IEEE International Conference on Robotics and Automation, Barcelona, 05/4/18. https://doi.org/10.1109/ROBOT.2005.1570323
Yamamoto S, Valin JM, Nakadai K, Rouat J, Michaud F, Ogata T その他. Enhanced robot speech recognition based on microphone array source separation and missing feature theory. : Proceedings - IEEE International Conference on Robotics and Automation. 巻 2005. 2005. p. 1477-1482. 1570323 https://doi.org/10.1109/ROBOT.2005.1570323
Yamamoto, Shun'ichi ; Valin, Jean Marc ; Nakadai, Kazuhiro ; Rouat, Jean ; Michaud, François ; Ogata, Tetsuya ; Okuno, Hiroshi G. / Enhanced robot speech recognition based on microphone array source separation and missing feature theory. Proceedings - IEEE International Conference on Robotics and Automation. 巻 2005 2005. pp. 1477-1482
@inproceedings{041b85a81d494e9e949fc0358c5fbe97,
title = "Enhanced robot speech recognition based on microphone array source separation and missing feature theory",
abstract = "A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42{\%} (relative).",
author = "Shun'ichi Yamamoto and Valin, {Jean Marc} and Kazuhiro Nakadai and Jean Rouat and Fran{\cc}ois Michaud and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2005",
doi = "10.1109/ROBOT.2005.1570323",
language = "English",
isbn = "078038914X",
volume = "2005",
pages = "1477--1482",
booktitle = "Proceedings - IEEE International Conference on Robotics and Automation",

}

TY - GEN

T1 - Enhanced robot speech recognition based on microphone array source separation and missing feature theory

AU - Yamamoto, Shun'ichi

AU - Valin, Jean Marc

AU - Nakadai, Kazuhiro

AU - Rouat, Jean

AU - Michaud, François

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2005

Y1 - 2005

N2 - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

AB - A humanoid robot under real-world environments usually hears mixtures of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. While the first two are frequently addressed, the last one has not been studied so much. We present a system that gives a humanoid robot the ability to localize, separate and recognize simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation (GSS) and a multi-channel post-filter that gives us a further reduction of interferences from other sources. An automatic speech recognizer (ASR) based on the Missing Feature Theory (MFT) recognizes separated sounds in real-time by generating missing feature masks automatically from the post-filtering step. The main advantage of this approach for humanoid robots resides in the fact that the ASR with a clean acoustic model can adapt the distortion of separated sound by consulting the post-filter feature masks. Recognition rates are presented for three simultaneous speakers located at 2m from the robot. Use of both the post-filter and the missing feature mask results in an average reduction in error rate of 42% (relative).

UR - http://www.scopus.com/inward/record.url?scp=33846170539&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846170539&partnerID=8YFLogxK

U2 - 10.1109/ROBOT.2005.1570323

DO - 10.1109/ROBOT.2005.1570323

M3 - Conference contribution

AN - SCOPUS:33846170539

SN - 078038914X

SN - 9780780389144

VL - 2005

SP - 1477

EP - 1482

BT - Proceedings - IEEE International Conference on Robotics and Automation

ER -