Two-layered audio-visual speech recognition for robots in noisy environments

Takami Yoshida, Kazuhiro Nakadai, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Audio-visual (AV) integration is one of the key ideas to improve perception in noisy real-world environments. This paper describes automatic speech recognition (ASR) to improve human-robot interaction based on AV integration. We developed AV-integrated ASR, which has two AV integration layers, that is, voice activity detection (VAD) and ASR. However, the system has three difficulties: 1) VAD and ASR have been separately studied although these processes are mutually dependent, 2) VAD and ASR assumed that high resolution images are available although this assumption never holds in the real world, and 3) an optimal weight between audio and visual stream was fixed while their reliabilities change according to environmental changes. To solve these problems, we propose a new VAD algorithm taking ASR characteristics into account, and a linear-regression-based optimal weight estimation method. We evaluate the algorithm for auditory-and/or visually-contaminated data. Preliminary results show that the robustness of VAD improved even when the resolution of the images is low, and the AVSR using estimated stream weight shows the effectiveness of AV integration.

Original languageEnglish
Title of host publicationIEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings
Pages988-993
Number of pages6
DOIs
Publication statusPublished - 2010
Externally publishedYes
Event23rd IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Taipei
Duration: 2010 Oct 182010 Oct 22

Other

Other23rd IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010
CityTaipei
Period10/10/1810/10/22

Fingerprint

Speech recognition
Robots
Human robot interaction
Image resolution
Linear regression

ASJC Scopus subject areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Control and Systems Engineering

Cite this

Yoshida, T., Nakadai, K., & Okuno, H. G. (2010). Two-layered audio-visual speech recognition for robots in noisy environments. In IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings (pp. 988-993). [5651205] https://doi.org/10.1109/IROS.2010.5651205

Two-layered audio-visual speech recognition for robots in noisy environments. / Yoshida, Takami; Nakadai, Kazuhiro; Okuno, Hiroshi G.

IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings. 2010. p. 988-993 5651205.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Yoshida, T, Nakadai, K & Okuno, HG 2010, Two-layered audio-visual speech recognition for robots in noisy environments. in IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings., 5651205, pp. 988-993, 23rd IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010, Taipei, 10/10/18. https://doi.org/10.1109/IROS.2010.5651205
Yoshida T, Nakadai K, Okuno HG. Two-layered audio-visual speech recognition for robots in noisy environments. In IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings. 2010. p. 988-993. 5651205 https://doi.org/10.1109/IROS.2010.5651205
Yoshida, Takami ; Nakadai, Kazuhiro ; Okuno, Hiroshi G. / Two-layered audio-visual speech recognition for robots in noisy environments. IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings. 2010. pp. 988-993
@inproceedings{eb602e72a58e4ddfb76373701cb38712,
title = "Two-layered audio-visual speech recognition for robots in noisy environments",
abstract = "Audio-visual (AV) integration is one of the key ideas to improve perception in noisy real-world environments. This paper describes automatic speech recognition (ASR) to improve human-robot interaction based on AV integration. We developed AV-integrated ASR, which has two AV integration layers, that is, voice activity detection (VAD) and ASR. However, the system has three difficulties: 1) VAD and ASR have been separately studied although these processes are mutually dependent, 2) VAD and ASR assumed that high resolution images are available although this assumption never holds in the real world, and 3) an optimal weight between audio and visual stream was fixed while their reliabilities change according to environmental changes. To solve these problems, we propose a new VAD algorithm taking ASR characteristics into account, and a linear-regression-based optimal weight estimation method. We evaluate the algorithm for auditory-and/or visually-contaminated data. Preliminary results show that the robustness of VAD improved even when the resolution of the images is low, and the AVSR using estimated stream weight shows the effectiveness of AV integration.",
author = "Takami Yoshida and Kazuhiro Nakadai and Okuno, {Hiroshi G.}",
year = "2010",
doi = "10.1109/IROS.2010.5651205",
language = "English",
isbn = "9781424466757",
pages = "988--993",
booktitle = "IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings",

}

TY - GEN

T1 - Two-layered audio-visual speech recognition for robots in noisy environments

AU - Yoshida, Takami

AU - Nakadai, Kazuhiro

AU - Okuno, Hiroshi G.

PY - 2010

Y1 - 2010

N2 - Audio-visual (AV) integration is one of the key ideas to improve perception in noisy real-world environments. This paper describes automatic speech recognition (ASR) to improve human-robot interaction based on AV integration. We developed AV-integrated ASR, which has two AV integration layers, that is, voice activity detection (VAD) and ASR. However, the system has three difficulties: 1) VAD and ASR have been separately studied although these processes are mutually dependent, 2) VAD and ASR assumed that high resolution images are available although this assumption never holds in the real world, and 3) an optimal weight between audio and visual stream was fixed while their reliabilities change according to environmental changes. To solve these problems, we propose a new VAD algorithm taking ASR characteristics into account, and a linear-regression-based optimal weight estimation method. We evaluate the algorithm for auditory-and/or visually-contaminated data. Preliminary results show that the robustness of VAD improved even when the resolution of the images is low, and the AVSR using estimated stream weight shows the effectiveness of AV integration.

AB - Audio-visual (AV) integration is one of the key ideas to improve perception in noisy real-world environments. This paper describes automatic speech recognition (ASR) to improve human-robot interaction based on AV integration. We developed AV-integrated ASR, which has two AV integration layers, that is, voice activity detection (VAD) and ASR. However, the system has three difficulties: 1) VAD and ASR have been separately studied although these processes are mutually dependent, 2) VAD and ASR assumed that high resolution images are available although this assumption never holds in the real world, and 3) an optimal weight between audio and visual stream was fixed while their reliabilities change according to environmental changes. To solve these problems, we propose a new VAD algorithm taking ASR characteristics into account, and a linear-regression-based optimal weight estimation method. We evaluate the algorithm for auditory-and/or visually-contaminated data. Preliminary results show that the robustness of VAD improved even when the resolution of the images is low, and the AVSR using estimated stream weight shows the effectiveness of AV integration.

UR - http://www.scopus.com/inward/record.url?scp=78651487973&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78651487973&partnerID=8YFLogxK

U2 - 10.1109/IROS.2010.5651205

DO - 10.1109/IROS.2010.5651205

M3 - Conference contribution

SN - 9781424466757

SP - 988

EP - 993

BT - IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems, IROS 2010 - Conference Proceedings

ER -