Effects of increasing modalities in recognizing three simultaneous speeches

Hiroshi G. Okuno, Kazuhiro Nakadai, Hiroaki Kitano

Research output: Contribution to journalArticle

Abstract

One of the essential problems of auditory processing in noisy real-world environments is that the number of sound sources is greater than that of microphones. To model this situation, we try to separate three simultaneous speeches by two microphones. This problem is difficult because well-known techniques with microphone arrays such as the nullforming and beamforming techniques and independent component analysis (ICA) require in practice three or more microphones. This paper reports the effects of increasing modalities in recognizing three simultaneous speeches with two microphones. We investigate four cases; monaural (one microphone), binaural (a pair of microphones embedded in a dummy head), binaural with ICA, and binaural with vision (two dummy head microphones and two cameras). The fourth method is called "Direction-Pass Filter" (DPF), which separates sound sources originating from a specific direction given by auditory and/or visual processing. The direction of auditory frequency component is determined by using the Head-Related Transfer Function (HRTF) of the dummy head and thus the DPF is independent for the number of sound sources i.e. it does not assume the number of sound sources. With 200 benchmarks of three simultaneous utterances of Japanese words, the quality of each separated speech is evaluated by an automatic speech recognition system. The performance of word recognition of three simultaneous speeches is improved by adding more modalities, that is, from monaural, binaural, binaural with ICA, to binaural with vision. The average 1-best and 10-best recognition rates of separated speeches attained by the Direction-Pass Filter are 60% and 81%, respectively.

Original languageEnglish
Pages (from-to)347-359
Number of pages13
JournalSpeech Communication
Volume43
Issue number4 SPEC. ISS.
DOIs
Publication statusPublished - 2004 Sep
Externally publishedYes

Fingerprint

Microphones
Modality
Independent Component Analysis
Head
Filter
Independent component analysis
Acoustic waves
Microphone Array
Automatic Speech Recognition
Beamforming
Benchmarking
Transfer Function
Camera
Speech
Benchmark
Processing
Speech recognition
Direction compound
Sound
Transfer functions

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Experimental and Cognitive Psychology
  • Linguistics and Language

Cite this

Effects of increasing modalities in recognizing three simultaneous speeches. / Okuno, Hiroshi G.; Nakadai, Kazuhiro; Kitano, Hiroaki.

In: Speech Communication, Vol. 43, No. 4 SPEC. ISS., 09.2004, p. 347-359.

Research output: Contribution to journalArticle

Okuno, Hiroshi G. ; Nakadai, Kazuhiro ; Kitano, Hiroaki. / Effects of increasing modalities in recognizing three simultaneous speeches. In: Speech Communication. 2004 ; Vol. 43, No. 4 SPEC. ISS. pp. 347-359.
@article{f4cc7a56cccd43b8898126e823ddff62,
title = "Effects of increasing modalities in recognizing three simultaneous speeches",
abstract = "One of the essential problems of auditory processing in noisy real-world environments is that the number of sound sources is greater than that of microphones. To model this situation, we try to separate three simultaneous speeches by two microphones. This problem is difficult because well-known techniques with microphone arrays such as the nullforming and beamforming techniques and independent component analysis (ICA) require in practice three or more microphones. This paper reports the effects of increasing modalities in recognizing three simultaneous speeches with two microphones. We investigate four cases; monaural (one microphone), binaural (a pair of microphones embedded in a dummy head), binaural with ICA, and binaural with vision (two dummy head microphones and two cameras). The fourth method is called {"}Direction-Pass Filter{"} (DPF), which separates sound sources originating from a specific direction given by auditory and/or visual processing. The direction of auditory frequency component is determined by using the Head-Related Transfer Function (HRTF) of the dummy head and thus the DPF is independent for the number of sound sources i.e. it does not assume the number of sound sources. With 200 benchmarks of three simultaneous utterances of Japanese words, the quality of each separated speech is evaluated by an automatic speech recognition system. The performance of word recognition of three simultaneous speeches is improved by adding more modalities, that is, from monaural, binaural, binaural with ICA, to binaural with vision. The average 1-best and 10-best recognition rates of separated speeches attained by the Direction-Pass Filter are 60{\%} and 81{\%}, respectively.",
author = "Okuno, {Hiroshi G.} and Kazuhiro Nakadai and Hiroaki Kitano",
year = "2004",
month = "9",
doi = "10.1016/j.specom.2004.03.008",
language = "English",
volume = "43",
pages = "347--359",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier",
number = "4 SPEC. ISS.",

}

TY - JOUR

T1 - Effects of increasing modalities in recognizing three simultaneous speeches

AU - Okuno, Hiroshi G.

AU - Nakadai, Kazuhiro

AU - Kitano, Hiroaki

PY - 2004/9

Y1 - 2004/9

N2 - One of the essential problems of auditory processing in noisy real-world environments is that the number of sound sources is greater than that of microphones. To model this situation, we try to separate three simultaneous speeches by two microphones. This problem is difficult because well-known techniques with microphone arrays such as the nullforming and beamforming techniques and independent component analysis (ICA) require in practice three or more microphones. This paper reports the effects of increasing modalities in recognizing three simultaneous speeches with two microphones. We investigate four cases; monaural (one microphone), binaural (a pair of microphones embedded in a dummy head), binaural with ICA, and binaural with vision (two dummy head microphones and two cameras). The fourth method is called "Direction-Pass Filter" (DPF), which separates sound sources originating from a specific direction given by auditory and/or visual processing. The direction of auditory frequency component is determined by using the Head-Related Transfer Function (HRTF) of the dummy head and thus the DPF is independent for the number of sound sources i.e. it does not assume the number of sound sources. With 200 benchmarks of three simultaneous utterances of Japanese words, the quality of each separated speech is evaluated by an automatic speech recognition system. The performance of word recognition of three simultaneous speeches is improved by adding more modalities, that is, from monaural, binaural, binaural with ICA, to binaural with vision. The average 1-best and 10-best recognition rates of separated speeches attained by the Direction-Pass Filter are 60% and 81%, respectively.

AB - One of the essential problems of auditory processing in noisy real-world environments is that the number of sound sources is greater than that of microphones. To model this situation, we try to separate three simultaneous speeches by two microphones. This problem is difficult because well-known techniques with microphone arrays such as the nullforming and beamforming techniques and independent component analysis (ICA) require in practice three or more microphones. This paper reports the effects of increasing modalities in recognizing three simultaneous speeches with two microphones. We investigate four cases; monaural (one microphone), binaural (a pair of microphones embedded in a dummy head), binaural with ICA, and binaural with vision (two dummy head microphones and two cameras). The fourth method is called "Direction-Pass Filter" (DPF), which separates sound sources originating from a specific direction given by auditory and/or visual processing. The direction of auditory frequency component is determined by using the Head-Related Transfer Function (HRTF) of the dummy head and thus the DPF is independent for the number of sound sources i.e. it does not assume the number of sound sources. With 200 benchmarks of three simultaneous utterances of Japanese words, the quality of each separated speech is evaluated by an automatic speech recognition system. The performance of word recognition of three simultaneous speeches is improved by adding more modalities, that is, from monaural, binaural, binaural with ICA, to binaural with vision. The average 1-best and 10-best recognition rates of separated speeches attained by the Direction-Pass Filter are 60% and 81%, respectively.

UR - http://www.scopus.com/inward/record.url?scp=4644247616&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4644247616&partnerID=8YFLogxK

U2 - 10.1016/j.specom.2004.03.008

DO - 10.1016/j.specom.2004.03.008

M3 - Article

AN - SCOPUS:4644247616

VL - 43

SP - 347

EP - 359

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

IS - 4 SPEC. ISS.

ER -