Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, Najim Dehak

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained language model (LM). Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multilevel decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6191-6195
Number of pages5
ISBN (Electronic)9781479981311
DOIs
Publication statusPublished - 2019 May 1
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 2019 May 122019 May 17

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period19/5/1219/5/17

Fingerprint

Speech recognition
Data storage equipment
Fusion reactions
Decoding
Labels

Keywords

  • Automatic speech recognition (ASR)
  • cold fusion
  • deep fusion
  • language model
  • sequence to sequence
  • shallow fusion

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Cho, J., Watanabe, S., Hori, T., Baskar, M. K., Inaguma, H., Villalba, J., & Dehak, N. (2019). Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 6191-6195). [8683380] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8683380

Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition. / Cho, Jaejin; Watanabe, Shinji; Hori, Takaaki; Baskar, Murali Karthick; Inaguma, Hirofumi; Villalba, Jesus; Dehak, Najim.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 6191-6195 8683380 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cho, J, Watanabe, S, Hori, T, Baskar, MK, Inaguma, H, Villalba, J & Dehak, N 2019, Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8683380, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 6191-6195, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8683380
Cho J, Watanabe S, Hori T, Baskar MK, Inaguma H, Villalba J et al. Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 6191-6195. 8683380. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8683380
Cho, Jaejin ; Watanabe, Shinji ; Hori, Takaaki ; Baskar, Murali Karthick ; Inaguma, Hirofumi ; Villalba, Jesus ; Dehak, Najim. / Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 6191-6195 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{11cfdd74e9ed414cbf678856cf732e01,
title = "Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition",
abstract = "In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained language model (LM). Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7{\%}, 2.4{\%} for test clean, test other relatively to the shallow fusion baseline, with multilevel decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9{\%}, 9.8{\%} in CER, WER relatively to the 2-stage transfer baseline.",
keywords = "Automatic speech recognition (ASR), cold fusion, deep fusion, language model, sequence to sequence, shallow fusion",
author = "Jaejin Cho and Shinji Watanabe and Takaaki Hori and Baskar, {Murali Karthick} and Hirofumi Inaguma and Jesus Villalba and Najim Dehak",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8683380",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "6191--6195",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition

AU - Cho, Jaejin

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Baskar, Murali Karthick

AU - Inaguma, Hirofumi

AU - Villalba, Jesus

AU - Dehak, Najim

PY - 2019/5/1

Y1 - 2019/5/1

N2 - In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained language model (LM). Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multilevel decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.

AB - In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained language model (LM). Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multilevel decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline.

KW - Automatic speech recognition (ASR)

KW - cold fusion

KW - deep fusion

KW - language model

KW - sequence to sequence

KW - shallow fusion

UR - http://www.scopus.com/inward/record.url?scp=85069000387&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069000387&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8683380

DO - 10.1109/ICASSP.2019.8683380

M3 - Conference contribution

AN - SCOPUS:85069000387

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 6191

EP - 6195

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -