TY - GEN
T1 - Missing-feature-theory-based robust simultaneous speech recognition system with non-clean speech acoustic model
AU - Takahashi, Toru
AU - Nakadai, Kazuhiro
AU - Komatani, Kazunori
AU - Ogata, Tetsuya
AU - Okuno, Hiroshi G.
PY - 2009/12/11
Y1 - 2009/12/11
N2 - A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.
AB - A humanoid robot must recognize a target speech signal while people around the robot chat with them in real-world. To recognize the target speech signal, robot has to separate the target speech signal among other speech signals and recognize the separated speech signal. As separated signal includes distortion, automatic speech recognition (ASR) performance degrades. To avoid the degradation, we trained an acoustic model from non-clean speech signals to adapt acoustic feature of distorted signal and adding white noise to separated speech signal before extracting acoustic feature. The issues are (1) To determine optimal noise level to add the training speech signals, and (2) To determine optimal noise level to add the separated signal. In this paper, we investigate how much noises should be added to clean speech data for training and how speech recognition performance improves for different positions of three talkers with soft masking. Experimental results show that the best performance is obtained by adding white noises of 30 dB. The ASR with the acoustic model outperforms with ASR with the clean acoustic model by 4 points.
UR - http://www.scopus.com/inward/record.url?scp=76249127411&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=76249127411&partnerID=8YFLogxK
U2 - 10.1109/IROS.2009.5354201
DO - 10.1109/IROS.2009.5354201
M3 - Conference contribution
AN - SCOPUS:76249127411
SN - 9781424438044
T3 - 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009
SP - 2730
EP - 2735
BT - 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009
T2 - 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009
Y2 - 11 October 2009 through 15 October 2009
ER -