Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, Tomoki Hayashi

Research output: Contribution to journalArticle

26 Citations (Scopus)

Abstract

Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

Original languageEnglish
Article number8068205
Pages (from-to)1240-1253
Number of pages14
JournalIEEE Journal on Selected Topics in Signal Processing
Volume11
Issue number8
DOIs
Publication statusPublished - 2017 Dec 1
Externally publishedYes

Fingerprint

Speech recognition
Decoding
Linguistics
Hidden Markov models
Acoustics
Speech analysis
Glossaries
Network architecture
Dynamic programming
Experiments

Keywords

  • attention mechanism
  • Automatic speech recognition
  • connectionist temporal classification
  • end-to-end
  • hybrid CTC/attention

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. / Watanabe, Shinji; Hori, Takaaki; Kim, Suyoun; Hershey, John R.; Hayashi, Tomoki.

In: IEEE Journal on Selected Topics in Signal Processing, Vol. 11, No. 8, 8068205, 01.12.2017, p. 1240-1253.

Research output: Contribution to journalArticle

Watanabe, Shinji ; Hori, Takaaki ; Kim, Suyoun ; Hershey, John R. ; Hayashi, Tomoki. / Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. In: IEEE Journal on Selected Topics in Signal Processing. 2017 ; Vol. 11, No. 8. pp. 1240-1253.
@article{e9edd3c982e9446ead3d53ab9ac76004,
title = "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition",
abstract = "Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.",
keywords = "attention mechanism, Automatic speech recognition, connectionist temporal classification, end-to-end, hybrid CTC/attention",
author = "Shinji Watanabe and Takaaki Hori and Suyoun Kim and Hershey, {John R.} and Tomoki Hayashi",
year = "2017",
month = "12",
day = "1",
doi = "10.1109/JSTSP.2017.2763455",
language = "English",
volume = "11",
pages = "1240--1253",
journal = "IEEE Journal on Selected Topics in Signal Processing",
issn = "1932-4553",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "8",

}

TY - JOUR

T1 - Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Kim, Suyoun

AU - Hershey, John R.

AU - Hayashi, Tomoki

PY - 2017/12/1

Y1 - 2017/12/1

N2 - Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

AB - Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.

KW - attention mechanism

KW - Automatic speech recognition

KW - connectionist temporal classification

KW - end-to-end

KW - hybrid CTC/attention

UR - http://www.scopus.com/inward/record.url?scp=85041777531&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041777531&partnerID=8YFLogxK

U2 - 10.1109/JSTSP.2017.2763455

DO - 10.1109/JSTSP.2017.2763455

M3 - Article

VL - 11

SP - 1240

EP - 1253

JO - IEEE Journal on Selected Topics in Signal Processing

JF - IEEE Journal on Selected Topics in Signal Processing

SN - 1932-4553

IS - 8

M1 - 8068205

ER -