Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

Research output: Contribution to journalConference article

55 Citations (Scopus)

Abstract

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

Original languageEnglish
Pages (from-to)949-953
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

Fingerprint

Speech Recognition
Encoder
Speech recognition
Automatic Speech Recognition
Neural Networks
Neural networks
Beam Search
Error Reduction
Prediction
Language Model
Beat
Model
Connectionist

Keywords

  • Attention model
  • Connectionist temporal classification
  • Encoder-decoder
  • End-to-end speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. / Hori, Takaaki; Watanabe, Shinji; Zhang, Yu; Chan, William.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2017-August, 01.01.2017, p. 949-953.

Research output: Contribution to journalConference article

@article{3132409069354fa087182c81827e780d,
title = "Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM",
abstract = "We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10{\%} error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.",
keywords = "Attention model, Connectionist temporal classification, Encoder-decoder, End-to-end speech recognition",
author = "Takaaki Hori and Shinji Watanabe and Yu Zhang and William Chan",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-1296",
language = "English",
volume = "2017-August",
pages = "949--953",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Zhang, Yu

AU - Chan, William

PY - 2017/1/1

Y1 - 2017/1/1

N2 - We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

AB - We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

KW - Attention model

KW - Connectionist temporal classification

KW - Encoder-decoder

KW - End-to-end speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85039169903&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039169903&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-1296

DO - 10.21437/Interspeech.2017-1296

M3 - Conference article

AN - SCOPUS:85039169903

VL - 2017-August

SP - 949

EP - 953

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -