Stream Attention-based Multi-array End-to-end Speech Recognition

Xiaofei Wang, Ruizhi Li, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7% and 9.7% in the two datasets, respectively, which is better than conventional strategies as well.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7105-7109
Number of pages5
ISBN (Electronic)9781479981311
DOIs
Publication statusPublished - 2019 May 1
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 2019 May 122019 May 17

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period19/5/1219/5/17

Fingerprint

Microphones
Speech recognition
Experiments

Keywords

  • End-to-End Speech Recognition
  • Joint CTC/Attention
  • Multiple Microphone Array
  • Stream Attention

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Wang, X., Li, R., Mallidi, S. H., Hori, T., Watanabe, S., & Hermansky, H. (2019). Stream Attention-based Multi-array End-to-end Speech Recognition. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 7105-7109). [8682650] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682650

Stream Attention-based Multi-array End-to-end Speech Recognition. / Wang, Xiaofei; Li, Ruizhi; Mallidi, Sri Harish; Hori, Takaaki; Watanabe, Shinji; Hermansky, Hynek.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 7105-7109 8682650 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, X, Li, R, Mallidi, SH, Hori, T, Watanabe, S & Hermansky, H 2019, Stream Attention-based Multi-array End-to-end Speech Recognition. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682650, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 7105-7109, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8682650
Wang X, Li R, Mallidi SH, Hori T, Watanabe S, Hermansky H. Stream Attention-based Multi-array End-to-end Speech Recognition. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 7105-7109. 8682650. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8682650
Wang, Xiaofei ; Li, Ruizhi ; Mallidi, Sri Harish ; Hori, Takaaki ; Watanabe, Shinji ; Hermansky, Hynek. / Stream Attention-based Multi-array End-to-end Speech Recognition. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 7105-7109 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{5bcdd7f85656402bbc9174644c03105c,
title = "Stream Attention-based Multi-array End-to-end Speech Recognition",
abstract = "Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7{\%} and 9.7{\%} in the two datasets, respectively, which is better than conventional strategies as well.",
keywords = "End-to-End Speech Recognition, Joint CTC/Attention, Multiple Microphone Array, Stream Attention",
author = "Xiaofei Wang and Ruizhi Li and Mallidi, {Sri Harish} and Takaaki Hori and Shinji Watanabe and Hynek Hermansky",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682650",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "7105--7109",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Stream Attention-based Multi-array End-to-end Speech Recognition

AU - Wang, Xiaofei

AU - Li, Ruizhi

AU - Mallidi, Sri Harish

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Hermansky, Hynek

PY - 2019/5/1

Y1 - 2019/5/1

N2 - Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7% and 9.7% in the two datasets, respectively, which is better than conventional strategies as well.

AB - Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7% and 9.7% in the two datasets, respectively, which is better than conventional strategies as well.

KW - End-to-End Speech Recognition

KW - Joint CTC/Attention

KW - Multiple Microphone Array

KW - Stream Attention

UR - http://www.scopus.com/inward/record.url?scp=85068975687&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068975687&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682650

DO - 10.1109/ICASSP.2019.8682650

M3 - Conference contribution

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 7105

EP - 7109

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -