Multi-head decoder for end-to-end speech recognition

Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Kazuya Takeda

Research output: Contribution to journalConference article

Abstract

This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

Original languageEnglish
Pages (from-to)801-805
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Speech Recognition
Speech recognition
Network architecture
Linguistics
Output
Multiple Models
Network Architecture
Encoder
Experimental Evaluation
Modality
Ensemble
Integrate
Evaluate
Experimental Results
Model
Demonstrate

Keywords

  • Attention
  • Dynamical neural network
  • End-to-end
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Multi-head decoder for end-to-end speech recognition. / Hayashi, Tomoki; Watanabe, Shinji; Toda, Tomoki; Takeda, Kazuya.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 801-805.

Research output: Contribution to journalConference article

@article{31c2d254e3e942c28aaece2a059eff2a,
title = "Multi-head decoder for end-to-end speech recognition",
abstract = "This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.",
keywords = "Attention, Dynamical neural network, End-to-end, Speech recognition",
author = "Tomoki Hayashi and Shinji Watanabe and Tomoki Toda and Kazuya Takeda",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1655",
language = "English",
volume = "2018-September",
pages = "801--805",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Multi-head decoder for end-to-end speech recognition

AU - Hayashi, Tomoki

AU - Watanabe, Shinji

AU - Toda, Tomoki

AU - Takeda, Kazuya

PY - 2018/1/1

Y1 - 2018/1/1

N2 - This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

AB - This paper presents a new network architecture called multi-head decoder for end-to-end speech recognition as an extension of a multi-head attention model. In the multi-head attention model, multiple attentions are calculated, and then, they are integrated into a single attention. On the other hand, instead of the integration in the attention level, our proposed method uses multiple decoders for each attention and integrates their outputs to generate a final output. Furthermore, in order to make each head to capture the different modalities, different attention functions are used for each head, leading to the improvement of the recognition performance with an ensemble effect. To evaluate the effectiveness of our proposed method, we conduct an experimental evaluation using Corpus of Spontaneous Japanese. Experimental results demonstrate that our proposed method outperforms the conventional methods such as location-based and multi-head attention models, and that it can capture different speech/linguistic contexts within the attention-based encoder-decoder framework.

KW - Attention

KW - Dynamical neural network

KW - End-to-end

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054989993&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054989993&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1655

DO - 10.21437/Interspeech.2018-1655

M3 - Conference article

AN - SCOPUS:85054989993

VL - 2018-September

SP - 801

EP - 805

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -