Joint CTC/attention decoding for end-to-end speech recognition

Takaaki Hori, Shinji Watanabe, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

28 Citations (Scopus)

Abstract

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

Original languageEnglish
Title of host publicationACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages518-529
Number of pages12
Volume1
ISBN (Electronic)9781945626753
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: 2017 Jul 302017 Aug 4

Other

Other55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
CountryCanada
CityVancouver
Period17/7/3017/8/4

ASJC Scopus subject areas

  • Language and Linguistics
  • Artificial Intelligence
  • Software
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Joint CTC/attention decoding for end-to-end speech recognition'. Together they form a unique fingerprint.

  • Cite this

    Hori, T., Watanabe, S., & Hershey, J. R. (2017). Joint CTC/attention decoding for end-to-end speech recognition. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 1, pp. 518-529). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-1048