Multichannel end-to-end speech recognition

Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to cncompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

Original languageEnglish
Title of host publication34th International Conference on Machine Learning, ICML 2017
PublisherInternational Machine Learning Society (IMLS)
Pages4033-4042
Number of pages10
Volume6
ISBN (Electronic)9781510855144
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event34th International Conference on Machine Learning, ICML 2017 - Sydney, Australia
Duration: 2017 Aug 62017 Aug 11

Other

Other34th International Conference on Machine Learning, ICML 2017
CountryAustralia
CitySydney
Period17/8/617/8/11

Fingerprint

Speech recognition
Acoustics
Speech enhancement
Hidden Markov models
Microphones
Beamforming
Acoustic noise
Signal processing
Neural networks
Experiments

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Human-Computer Interaction
  • Software

Cite this

Ochiai, T., Watanabe, S., Hori, T., & Hershey, J. R. (2017). Multichannel end-to-end speech recognition. In 34th International Conference on Machine Learning, ICML 2017 (Vol. 6, pp. 4033-4042). International Machine Learning Society (IMLS).

Multichannel end-to-end speech recognition. / Ochiai, Tsubasa; Watanabe, Shinji; Hori, Takaaki; Hershey, John R.

34th International Conference on Machine Learning, ICML 2017. Vol. 6 International Machine Learning Society (IMLS), 2017. p. 4033-4042.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ochiai, T, Watanabe, S, Hori, T & Hershey, JR 2017, Multichannel end-to-end speech recognition. in 34th International Conference on Machine Learning, ICML 2017. vol. 6, International Machine Learning Society (IMLS), pp. 4033-4042, 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 17/8/6.
Ochiai T, Watanabe S, Hori T, Hershey JR. Multichannel end-to-end speech recognition. In 34th International Conference on Machine Learning, ICML 2017. Vol. 6. International Machine Learning Society (IMLS). 2017. p. 4033-4042
Ochiai, Tsubasa ; Watanabe, Shinji ; Hori, Takaaki ; Hershey, John R. / Multichannel end-to-end speech recognition. 34th International Conference on Machine Learning, ICML 2017. Vol. 6 International Machine Learning Society (IMLS), 2017. pp. 4033-4042
@inproceedings{0e98a09666904b429ef32625619acba5,
title = "Multichannel end-to-end speech recognition",
abstract = "The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to cncompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.",
author = "Tsubasa Ochiai and Shinji Watanabe and Takaaki Hori and Hershey, {John R.}",
year = "2017",
month = "1",
day = "1",
language = "English",
volume = "6",
pages = "4033--4042",
booktitle = "34th International Conference on Machine Learning, ICML 2017",
publisher = "International Machine Learning Society (IMLS)",

}

TY - GEN

T1 - Multichannel end-to-end speech recognition

AU - Ochiai, Tsubasa

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Hershey, John R.

PY - 2017/1/1

Y1 - 2017/1/1

N2 - The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to cncompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

AB - The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to cncompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

UR - http://www.scopus.com/inward/record.url?scp=85041785137&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041785137&partnerID=8YFLogxK

M3 - Conference contribution

VL - 6

SP - 4033

EP - 4042

BT - 34th International Conference on Machine Learning, ICML 2017

PB - International Machine Learning Society (IMLS)

ER -