TY - GEN
T1 - Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches
AU - Kanda, Naoyuki
AU - Fujita, Yusuke
AU - Horiguchi, Shota
AU - Ikeshita, Rintaro
AU - Nagamatsu, Kenji
AU - Watanabe, Shinji
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.
AB - This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.
KW - Acoustic model
KW - deep learning
KW - speech enhancement
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85068969438&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068969438&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682273
DO - 10.1109/ICASSP.2019.8682273
M3 - Conference contribution
AN - SCOPUS:85068969438
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6630
EP - 6634
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -