Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan

研究成果査読

121 被引用数 (Scopus)

抄録

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.

本文言語English
ページ(範囲)949-953
ページ数5
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2017-August
DOI
出版ステータスPublished - 2017
外部発表はい
イベント18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
継続期間: 2017 8 202017 8 24

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル