Using vision to improve sound source separation

Yukiko Nakagawa, Hiroshi G. Okuno, Hiroaki Kitano

Research output: Chapter in Book/Report/Conference proceedingChapter

15 Citations (Scopus)

Abstract

We present a method of improving sound source separation using vision. The sound source separation is an essential function to accomplish auditory scene understanding by separating stream of sounds generated from multiple sound sources. By separating a stream of sounds, recognition process, such as speech recognition, can simply work on a single stream, not mixed sound of several speakers. The performance is known to be improved by using stereo/binaural microphone and microphone array which provides spatial information for separation. However, these methods still have more than 20 degree of positional ambiguities. In this paper, we further added visual information to provide more specific and accurate position information. As a result, separation capability was drastically improved. In addition, we found that the use of approximate direction information drastically improve object tracking accuracy of a simple vision system, which in turn improves performance of the auditory system. We claim that the integration of vision and auditory inputs improves performance of tasks in each perception, such as sound source separation and object tracking, by bootstrapping.

Original languageEnglish
Title of host publicationProceedings of the National Conference on Artificial Intelligence
Place of PublicationMenlo Park, CA, United States
PublisherAAAI
Pages768-775
Number of pages8
ISBN (Print)0262511061
Publication statusPublished - 1999
Externally publishedYes
EventProceedings of the 1999 16th National Conference on Artificial Intelligence (AAAI-99), 11th Innovative Applications of Artificial Intelligence Conference (IAAI-99) - Orlando, FL, USA
Duration: 1999 Jul 181999 Jul 22

Other

OtherProceedings of the 1999 16th National Conference on Artificial Intelligence (AAAI-99), 11th Innovative Applications of Artificial Intelligence Conference (IAAI-99)
CityOrlando, FL, USA
Period99/7/1899/7/22

Fingerprint

Source separation
Acoustic waves
Microphones
Speech recognition

ASJC Scopus subject areas

  • Software

Cite this

Nakagawa, Y., Okuno, H. G., & Kitano, H. (1999). Using vision to improve sound source separation. In Proceedings of the National Conference on Artificial Intelligence (pp. 768-775). Menlo Park, CA, United States: AAAI.

Using vision to improve sound source separation. / Nakagawa, Yukiko; Okuno, Hiroshi G.; Kitano, Hiroaki.

Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA, United States : AAAI, 1999. p. 768-775.

Research output: Chapter in Book/Report/Conference proceedingChapter

Nakagawa, Y, Okuno, HG & Kitano, H 1999, Using vision to improve sound source separation. in Proceedings of the National Conference on Artificial Intelligence. AAAI, Menlo Park, CA, United States, pp. 768-775, Proceedings of the 1999 16th National Conference on Artificial Intelligence (AAAI-99), 11th Innovative Applications of Artificial Intelligence Conference (IAAI-99), Orlando, FL, USA, 99/7/18.
Nakagawa Y, Okuno HG, Kitano H. Using vision to improve sound source separation. In Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA, United States: AAAI. 1999. p. 768-775
Nakagawa, Yukiko ; Okuno, Hiroshi G. ; Kitano, Hiroaki. / Using vision to improve sound source separation. Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA, United States : AAAI, 1999. pp. 768-775
@inbook{a4edf8a9db8e4c7a8da6d45bf4b32526,
title = "Using vision to improve sound source separation",
abstract = "We present a method of improving sound source separation using vision. The sound source separation is an essential function to accomplish auditory scene understanding by separating stream of sounds generated from multiple sound sources. By separating a stream of sounds, recognition process, such as speech recognition, can simply work on a single stream, not mixed sound of several speakers. The performance is known to be improved by using stereo/binaural microphone and microphone array which provides spatial information for separation. However, these methods still have more than 20 degree of positional ambiguities. In this paper, we further added visual information to provide more specific and accurate position information. As a result, separation capability was drastically improved. In addition, we found that the use of approximate direction information drastically improve object tracking accuracy of a simple vision system, which in turn improves performance of the auditory system. We claim that the integration of vision and auditory inputs improves performance of tasks in each perception, such as sound source separation and object tracking, by bootstrapping.",
author = "Yukiko Nakagawa and Okuno, {Hiroshi G.} and Hiroaki Kitano",
year = "1999",
language = "English",
isbn = "0262511061",
pages = "768--775",
booktitle = "Proceedings of the National Conference on Artificial Intelligence",
publisher = "AAAI",

}

TY - CHAP

T1 - Using vision to improve sound source separation

AU - Nakagawa, Yukiko

AU - Okuno, Hiroshi G.

AU - Kitano, Hiroaki

PY - 1999

Y1 - 1999

N2 - We present a method of improving sound source separation using vision. The sound source separation is an essential function to accomplish auditory scene understanding by separating stream of sounds generated from multiple sound sources. By separating a stream of sounds, recognition process, such as speech recognition, can simply work on a single stream, not mixed sound of several speakers. The performance is known to be improved by using stereo/binaural microphone and microphone array which provides spatial information for separation. However, these methods still have more than 20 degree of positional ambiguities. In this paper, we further added visual information to provide more specific and accurate position information. As a result, separation capability was drastically improved. In addition, we found that the use of approximate direction information drastically improve object tracking accuracy of a simple vision system, which in turn improves performance of the auditory system. We claim that the integration of vision and auditory inputs improves performance of tasks in each perception, such as sound source separation and object tracking, by bootstrapping.

AB - We present a method of improving sound source separation using vision. The sound source separation is an essential function to accomplish auditory scene understanding by separating stream of sounds generated from multiple sound sources. By separating a stream of sounds, recognition process, such as speech recognition, can simply work on a single stream, not mixed sound of several speakers. The performance is known to be improved by using stereo/binaural microphone and microphone array which provides spatial information for separation. However, these methods still have more than 20 degree of positional ambiguities. In this paper, we further added visual information to provide more specific and accurate position information. As a result, separation capability was drastically improved. In addition, we found that the use of approximate direction information drastically improve object tracking accuracy of a simple vision system, which in turn improves performance of the auditory system. We claim that the integration of vision and auditory inputs improves performance of tasks in each perception, such as sound source separation and object tracking, by bootstrapping.

UR - http://www.scopus.com/inward/record.url?scp=0032596389&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032596389&partnerID=8YFLogxK

M3 - Chapter

SN - 0262511061

SP - 768

EP - 775

BT - Proceedings of the National Conference on Artificial Intelligence

PB - AAAI

CY - Menlo Park, CA, United States

ER -