Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots

Kazuhiro Nakadai, Daisuke Matsuura, Hiroshi G. Okuno, Hiroshi Tsujino

Research output: Contribution to journalArticle

44 Citations (Scopus)

Abstract

This paper presents a method to improve recognition of three simultaneous speech signals by a humanoid robot equipped with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech signal are difficult, because the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speech signals, two key ideas are introduced. One is two-layered audio-visual integration of both name (ID) and location, that is, speech and face recognition, and speech and face localization. The other is acoustical modeling of the humanoid head by scattering theory. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using the interaural phase/intensity difference estimated by scattering theory. Since features of separated sounds vary according to the sound direction, multiple direction- and speaker-dependent acoustic models are used. The system integrates ASR results by using the sound direction and speaker information provided by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows an improvement of about 10% on average against recognition of three simultaneous speech signals, where three speakers were located around the humanoid on a 1 m radius half circle, one of them being in front of him (angle 0°) and the other two being at symmetrical positions (±θ) varying by 10° steps from 0° to 90°.

Original languageEnglish
Pages (from-to)97-112
Number of pages16
JournalSpeech Communication
Volume44
Issue number1-4 SPEC. ISS.
DOIs
Publication statusPublished - 2004 Oct
Externally publishedYes

Fingerprint

Humanoid Robot
Speech Signal
Scattering Theory
robot
Acoustic waves
Robots
Scattering
Speech recognition
Automatic Speech Recognition
Face recognition
Face Recognition
Confidence Measure
Acoustic Model
Microphones
Speech Recognition
Sound
Robot
Signal to noise ratio
Circle
Acoustics

Keywords

  • Active audition
  • Audio-visual integration
  • Robot audition
  • Scattering theory
  • Sound source localization
  • Sound source separation
  • Speech recognition

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Experimental and Cognitive Psychology
  • Linguistics and Language

Cite this

Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots. / Nakadai, Kazuhiro; Matsuura, Daisuke; Okuno, Hiroshi G.; Tsujino, Hiroshi.

In: Speech Communication, Vol. 44, No. 1-4 SPEC. ISS., 10.2004, p. 97-112.

Research output: Contribution to journalArticle

Nakadai, Kazuhiro ; Matsuura, Daisuke ; Okuno, Hiroshi G. ; Tsujino, Hiroshi. / Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots. In: Speech Communication. 2004 ; Vol. 44, No. 1-4 SPEC. ISS. pp. 97-112.
@article{9c215aa1719b46f882c878a8599e5dac,
title = "Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots",
abstract = "This paper presents a method to improve recognition of three simultaneous speech signals by a humanoid robot equipped with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech signal are difficult, because the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speech signals, two key ideas are introduced. One is two-layered audio-visual integration of both name (ID) and location, that is, speech and face recognition, and speech and face localization. The other is acoustical modeling of the humanoid head by scattering theory. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using the interaural phase/intensity difference estimated by scattering theory. Since features of separated sounds vary according to the sound direction, multiple direction- and speaker-dependent acoustic models are used. The system integrates ASR results by using the sound direction and speaker information provided by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows an improvement of about 10{\%} on average against recognition of three simultaneous speech signals, where three speakers were located around the humanoid on a 1 m radius half circle, one of them being in front of him (angle 0°) and the other two being at symmetrical positions (±θ) varying by 10° steps from 0° to 90°.",
keywords = "Active audition, Audio-visual integration, Robot audition, Scattering theory, Sound source localization, Sound source separation, Speech recognition",
author = "Kazuhiro Nakadai and Daisuke Matsuura and Okuno, {Hiroshi G.} and Hiroshi Tsujino",
year = "2004",
month = "10",
doi = "10.1016/j.specom.2004.10.010",
language = "English",
volume = "44",
pages = "97--112",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",
number = "1-4 SPEC. ISS.",

}

TY - JOUR

T1 - Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots

AU - Nakadai, Kazuhiro

AU - Matsuura, Daisuke

AU - Okuno, Hiroshi G.

AU - Tsujino, Hiroshi

PY - 2004/10

Y1 - 2004/10

N2 - This paper presents a method to improve recognition of three simultaneous speech signals by a humanoid robot equipped with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech signal are difficult, because the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speech signals, two key ideas are introduced. One is two-layered audio-visual integration of both name (ID) and location, that is, speech and face recognition, and speech and face localization. The other is acoustical modeling of the humanoid head by scattering theory. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using the interaural phase/intensity difference estimated by scattering theory. Since features of separated sounds vary according to the sound direction, multiple direction- and speaker-dependent acoustic models are used. The system integrates ASR results by using the sound direction and speaker information provided by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows an improvement of about 10% on average against recognition of three simultaneous speech signals, where three speakers were located around the humanoid on a 1 m radius half circle, one of them being in front of him (angle 0°) and the other two being at symmetrical positions (±θ) varying by 10° steps from 0° to 90°.

AB - This paper presents a method to improve recognition of three simultaneous speech signals by a humanoid robot equipped with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech signal are difficult, because the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speech signals, two key ideas are introduced. One is two-layered audio-visual integration of both name (ID) and location, that is, speech and face recognition, and speech and face localization. The other is acoustical modeling of the humanoid head by scattering theory. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using the interaural phase/intensity difference estimated by scattering theory. Since features of separated sounds vary according to the sound direction, multiple direction- and speaker-dependent acoustic models are used. The system integrates ASR results by using the sound direction and speaker information provided by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows an improvement of about 10% on average against recognition of three simultaneous speech signals, where three speakers were located around the humanoid on a 1 m radius half circle, one of them being in front of him (angle 0°) and the other two being at symmetrical positions (±θ) varying by 10° steps from 0° to 90°.

KW - Active audition

KW - Audio-visual integration

KW - Robot audition

KW - Scattering theory

KW - Sound source localization

KW - Sound source separation

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=10444237268&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=10444237268&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2004.10.010

DO - 10.1016/j.specom.2004.10.010

M3 - Article

VL - 44

SP - 97

EP - 112

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

IS - 1-4 SPEC. ISS.

ER -