Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition

Takaaki Hori, Shinji Watanabe, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages287-293
Number of pages7
Volume2018-January
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2018 Jan 24
Externally publishedYes
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 2017 Dec 162017 Dec 20

Other

Other2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
CountryJapan
CityOkinawa
Period17/12/1617/12/20

Fingerprint

Speech recognition
Decoding
Linguistics

Keywords

  • attention decoder
  • connectionist temporal classification
  • decoding
  • End-To-End speech recognition
  • language modeling

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Cite this

Hori, T., Watanabe, S., & Hershey, J. R. (2018). Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (Vol. 2018-January, pp. 287-293). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8268948

Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. / Hori, Takaaki; Watanabe, Shinji; Hershey, John R.

2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. p. 287-293.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hori, T, Watanabe, S & Hershey, JR 2018, Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 287-293, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, 17/12/16. https://doi.org/10.1109/ASRU.2017.8268948
Hori T, Watanabe S, Hershey JR. Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January. Institute of Electrical and Electronics Engineers Inc. 2018. p. 287-293 https://doi.org/10.1109/ASRU.2017.8268948
Hori, Takaaki ; Watanabe, Shinji ; Hershey, John R. / Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. pp. 287-293
@inproceedings{1fb8a716a0534d669078eefd63742c03,
title = "Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition",
abstract = "We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 {\%} WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.",
keywords = "attention decoder, connectionist temporal classification, decoding, End-To-End speech recognition, language modeling",
author = "Takaaki Hori and Shinji Watanabe and Hershey, {John R.}",
year = "2018",
month = "1",
day = "24",
doi = "10.1109/ASRU.2017.8268948",
language = "English",
volume = "2018-January",
pages = "287--293",
booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Hershey, John R.

PY - 2018/1/24

Y1 - 2018/1/24

N2 - We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.

AB - We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.

KW - attention decoder

KW - connectionist temporal classification

KW - decoding

KW - End-To-End speech recognition

KW - language modeling

UR - http://www.scopus.com/inward/record.url?scp=85050529645&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050529645&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8268948

DO - 10.1109/ASRU.2017.8268948

M3 - Conference contribution

AN - SCOPUS:85050529645

VL - 2018-January

SP - 287

EP - 293

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -