Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model

Toru Takahashi, Kazuhiro Nakadai, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.

Original languageEnglish
Title of host publication2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009
Pages2730-2735
Number of pages6
DOIs
Publication statusPublished - 2009 Dec 11
Externally publishedYes
Event2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009 - St. Louis, MO
Duration: 2009 Oct 112009 Oct 15

Other

Other2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009
CitySt. Louis, MO
Period09/10/1109/10/15

Fingerprint

Speech recognition
Acoustics
Robots
White noise
Signal distortion
Degradation

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Control and Systems Engineering

Cite this

Takahashi, T., Nakadai, K., Komatani, K., Ogata, T., & Okuno, H. G. (2009). Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009 (pp. 2730-2735). [5354201] https://doi.org/10.1109/IROS.2009.5354201

Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. / Takahashi, Toru; Nakadai, Kazuhiro; Komatani, Kazunori; Ogata, Tetsuya; Okuno, Hiroshi G.

2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009. 2009. p. 2730-2735 5354201.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Takahashi, T, Nakadai, K, Komatani, K, Ogata, T & Okuno, HG 2009, Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009., 5354201, pp. 2730-2735, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009, St. Louis, MO, 09/10/11. https://doi.org/10.1109/IROS.2009.5354201
Takahashi T, Nakadai K, Komatani K, Ogata T, Okuno HG. Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009. 2009. p. 2730-2735. 5354201 https://doi.org/10.1109/IROS.2009.5354201
Takahashi, Toru ; Nakadai, Kazuhiro ; Komatani, Kazunori ; Ogata, Tetsuya ; Okuno, Hiroshi G. / Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model. 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009. 2009. pp. 2730-2735
@inproceedings{a33970816da44f589cf78225720d03a0,
title = "Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model",
abstract = "A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.",
author = "Toru Takahashi and Kazuhiro Nakadai and Kazunori Komatani and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2009",
month = "12",
day = "11",
doi = "10.1109/IROS.2009.5354201",
language = "English",
isbn = "9781424438044",
pages = "2730--2735",
booktitle = "2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009",

}

TY - GEN

T1 - Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model

AU - Takahashi, Toru

AU - Nakadai, Kazuhiro

AU - Komatani, Kazunori

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2009/12/11

Y1 - 2009/12/11

N2 - A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.

AB - A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.

UR - http://www.scopus.com/inward/record.url?scp=76249127411&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=76249127411&partnerID=8YFLogxK

U2 - 10.1109/IROS.2009.5354201

DO - 10.1109/IROS.2009.5354201

M3 - Conference contribution

SN - 9781424438044

SP - 2730

EP - 2735

BT - 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009

ER -