Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously

Hiroshi G. Okuno, Tomohiro Nakatani, Takeshi Kawabata

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

This paper reports the preliminary results of experiments on listening to several sounds at once. Two issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (ASR). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve the recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women's utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman's utterance is, on average, 75%.

Original languageEnglish
Title of host publicationProceedings of the National Conference on Artificial Intelligence
Editors Anon
Place of PublicationMenlo Park, CA, United States
PublisherAAAI
Pages1082-1089
Number of pages8
Volume2
Publication statusPublished - 1996
Externally publishedYes
EventProceedings of the 1996 13th National Conference on Artificial Intelligence. Part 2 (of 2) - Portland, OR, USA
Duration: 1996 Aug 41996 Aug 8

Other

OtherProceedings of the 1996 13th National Conference on Artificial Intelligence. Part 2 (of 2)
CityPortland, OR, USA
Period96/8/496/8/8

Fingerprint

Speech recognition
Acoustic waves
Substitution reactions
Experiments

ASJC Scopus subject areas

  • Software

Cite this

Okuno, H. G., Nakatani, T., & Kawabata, T. (1996). Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously. In Anon (Ed.), Proceedings of the National Conference on Artificial Intelligence (Vol. 2, pp. 1082-1089). Menlo Park, CA, United States: AAAI.

Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously. / Okuno, Hiroshi G.; Nakatani, Tomohiro; Kawabata, Takeshi.

Proceedings of the National Conference on Artificial Intelligence. ed. / Anon. Vol. 2 Menlo Park, CA, United States : AAAI, 1996. p. 1082-1089.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Okuno, HG, Nakatani, T & Kawabata, T 1996, Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously. in Anon (ed.), Proceedings of the National Conference on Artificial Intelligence. vol. 2, AAAI, Menlo Park, CA, United States, pp. 1082-1089, Proceedings of the 1996 13th National Conference on Artificial Intelligence. Part 2 (of 2), Portland, OR, USA, 96/8/4.
Okuno HG, Nakatani T, Kawabata T. Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously. In Anon, editor, Proceedings of the National Conference on Artificial Intelligence. Vol. 2. Menlo Park, CA, United States: AAAI. 1996. p. 1082-1089
Okuno, Hiroshi G. ; Nakatani, Tomohiro ; Kawabata, Takeshi. / Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously. Proceedings of the National Conference on Artificial Intelligence. editor / Anon. Vol. 2 Menlo Park, CA, United States : AAAI, 1996. pp. 1082-1089
@inproceedings{2f5555c9d3004640a46bd1a9736465e9,
title = "Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously",
abstract = "This paper reports the preliminary results of experiments on listening to several sounds at once. Two issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (ASR). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve the recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women's utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman's utterance is, on average, 75{\%}.",
author = "Okuno, {Hiroshi G.} and Tomohiro Nakatani and Takeshi Kawabata",
year = "1996",
language = "English",
volume = "2",
pages = "1082--1089",
editor = "Anon",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",
publisher = "AAAI",

}

TY - GEN

T1 - Interfacing sound stream segregation to automatic speech recognition - preliminary results on listening to several sounds simultaneously

AU - Okuno, Hiroshi G.

AU - Nakatani, Tomohiro

AU - Kawabata, Takeshi

PY - 1996

Y1 - 1996

N2 - This paper reports the preliminary results of experiments on listening to several sounds at once. Two issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (ASR). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve the recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women's utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman's utterance is, on average, 75%.

AB - This paper reports the preliminary results of experiments on listening to several sounds at once. Two issues are addressed: segregating speech streams from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition (ASR). Speech stream segregation (SSS) is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting some sounds for non-harmonic parts of groups. This system is implemented by extending the harmonic-based stream segregation system reported at AAAI-94 and IJCAI-95. The main problem in interfacing SSS with HMM-based ASR is how to improve the recognition performance which is degraded by spectral distortion of segregated sounds caused mainly by the binaural input, grouping, and residue substitution. Our solution is to re-train the parameters of the HMM with training data binauralized for four directions, to group harmonic fragments according to their directions, and to substitute the residue of harmonic fragments for non-harmonic parts of each group. Experiments with 500 mixtures of two women's utterances of a word showed that the cumulative accuracy of word recognition up to the 10th candidate of each woman's utterance is, on average, 75%.

UR - http://www.scopus.com/inward/record.url?scp=0030352647&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0030352647&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0030352647

VL - 2

SP - 1082

EP - 1089

BT - Proceedings of the National Conference on Artificial Intelligence

A2 - Anon, null

PB - AAAI

CY - Menlo Park, CA, United States

ER -