Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

Research output: Contribution to journalConference article

68 Citations (Scopus)

Abstract

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

Original languageEnglish
Pages (from-to)949-953
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

Keywords

  • Attention model
  • Connectionist temporal classification
  • Encoder-decoder
  • End-to-end speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint Dive into the research topics of 'Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM'. Together they form a unique fingerprint.

  • Cite this