Joint CTC/attention decoding for end-to-end speech recognition

Takaaki Hori, Shinji Watanabe, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

Original languageEnglish
Title of host publicationACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages518-529
Number of pages12
Volume1
ISBN (Electronic)9781945626753
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: 2017 Jul 302017 Aug 4

Other

Other55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
CountryCanada
CityVancouver
Period17/7/3017/8/4

Fingerprint

Speech recognition
Decoding
Linguistics
linguistics
Glossaries
Dynamic programming
resources
dictionary
acoustics
symbol
programming
Acoustics
Speech Recognition
Automatic Speech Recognition
Connectionist
performance
Hidden Markov Model
Conventional
Linguistic Resources

ASJC Scopus subject areas

  • Language and Linguistics
  • Artificial Intelligence
  • Software
  • Linguistics and Language

Cite this

Hori, T., Watanabe, S., & Hershey, J. R. (2017). Joint CTC/attention decoding for end-to-end speech recognition. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 1, pp. 518-529). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-1048

Joint CTC/attention decoding for end-to-end speech recognition. / Hori, Takaaki; Watanabe, Shinji; Hershey, John R.

ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Vol. 1 Association for Computational Linguistics (ACL), 2017. p. 518-529.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hori, T, Watanabe, S & Hershey, JR 2017, Joint CTC/attention decoding for end-to-end speech recognition. in ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). vol. 1, Association for Computational Linguistics (ACL), pp. 518-529, 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, 17/7/30. https://doi.org/10.18653/v1/P17-1048
Hori T, Watanabe S, Hershey JR. Joint CTC/attention decoding for end-to-end speech recognition. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Vol. 1. Association for Computational Linguistics (ACL). 2017. p. 518-529 https://doi.org/10.18653/v1/P17-1048
Hori, Takaaki ; Watanabe, Shinji ; Hershey, John R. / Joint CTC/attention decoding for end-to-end speech recognition. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Vol. 1 Association for Computational Linguistics (ACL), 2017. pp. 518-529
@inproceedings{881afaf0e1a44b2e96cd7218af34092e,
title = "Joint CTC/attention decoding for end-to-end speech recognition",
abstract = "End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.",
author = "Takaaki Hori and Shinji Watanabe and Hershey, {John R.}",
year = "2017",
month = "1",
day = "1",
doi = "10.18653/v1/P17-1048",
language = "English",
volume = "1",
pages = "518--529",
booktitle = "ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",
publisher = "Association for Computational Linguistics (ACL)",

}

TY - GEN

T1 - Joint CTC/attention decoding for end-to-end speech recognition

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Hershey, John R.

PY - 2017/1/1

Y1 - 2017/1/1

N2 - End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

AB - End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.

UR - http://www.scopus.com/inward/record.url?scp=85040926927&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040926927&partnerID=8YFLogxK

U2 - 10.18653/v1/P17-1048

DO - 10.18653/v1/P17-1048

M3 - Conference contribution

AN - SCOPUS:85040926927

VL - 1

SP - 518

EP - 529

BT - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

PB - Association for Computational Linguistics (ACL)

ER -