Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches

Naoyuki Kanda, Yusuke Fujita, Shota Horiguchi, Rintaro Ikeshita, Kenji Nagamatsu, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6630-6634
Number of pages5
ISBN (Electronic)9781479981311
DOIs
Publication statusPublished - 2019 May 1
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 2019 May 122019 May 17

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period19/5/1219/5/17

Fingerprint

Speech recognition
Acoustics
Speech enhancement
Pipelines

Keywords

  • Acoustic model
  • deep learning
  • speech enhancement
  • speech recognition

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Kanda, N., Fujita, Y., Horiguchi, S., Ikeshita, R., Nagamatsu, K., & Watanabe, S. (2019). Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 6630-6634). [8682273] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682273

Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches. / Kanda, Naoyuki; Fujita, Yusuke; Horiguchi, Shota; Ikeshita, Rintaro; Nagamatsu, Kenji; Watanabe, Shinji.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 6630-6634 8682273 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kanda, N, Fujita, Y, Horiguchi, S, Ikeshita, R, Nagamatsu, K & Watanabe, S 2019, Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682273, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 6630-6634, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8682273
Kanda N, Fujita Y, Horiguchi S, Ikeshita R, Nagamatsu K, Watanabe S. Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 6630-6634. 8682273. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8682273
Kanda, Naoyuki ; Fujita, Yusuke ; Horiguchi, Shota ; Ikeshita, Rintaro ; Nagamatsu, Kenji ; Watanabe, Shinji. / Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 6630-6634 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{c39b4793880e4eb697490e21e72e6642,
title = "Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches",
abstract = "This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12{\%} word error rate (WER) for the development set and a 32.33{\%} WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.",
keywords = "Acoustic model, deep learning, speech enhancement, speech recognition",
author = "Naoyuki Kanda and Yusuke Fujita and Shota Horiguchi and Rintaro Ikeshita and Kenji Nagamatsu and Shinji Watanabe",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682273",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "6630--6634",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Acoustic Modeling for Distant Multi-talker Speech Recognition with Single- and Multi-channel Branches

AU - Kanda, Naoyuki

AU - Fujita, Yusuke

AU - Horiguchi, Shota

AU - Ikeshita, Rintaro

AU - Nagamatsu, Kenji

AU - Watanabe, Shinji

PY - 2019/5/1

Y1 - 2019/5/1

N2 - This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.

AB - This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method's effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.

KW - Acoustic model

KW - deep learning

KW - speech enhancement

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85068969438&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068969438&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682273

DO - 10.1109/ICASSP.2019.8682273

M3 - Conference contribution

AN - SCOPUS:85068969438

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6630

EP - 6634

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -