Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling

Jaejin Cho, Murali Karthick Baskar, Ruizhi Li, Matthew Wiesner, Sri Harish Mallidi, Nelson Yalta, Martin Karafiat, Shinji Watanabe, Takaaki Hori

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.

Original languageEnglish
Title of host publication2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages521-527
Number of pages7
ISBN (Electronic)9781538643341
DOIs
Publication statusPublished - 2019 Feb 11
Externally publishedYes
Event2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Athens, Greece
Duration: 2018 Dec 182018 Dec 21

Publication series

Name2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

Conference

Conference2018 IEEE Spoken Language Technology Workshop, SLT 2018
CountryGreece
CityAthens
Period18/12/1818/12/21

Fingerprint

Speech recognition
language
learning
Recurrent neural networks
neural network
Decoding

Keywords

  • Automatic speech recognition (ASR)
  • language modeling
  • multilingual setup
  • sequence to sequence
  • transfer learning

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Linguistics and Language

Cite this

Cho, J., Baskar, M. K., Li, R., Wiesner, M., Mallidi, S. H., Yalta, N., ... Hori, T. (2019). Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings (pp. 521-527). [8639655] (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SLT.2018.8639655

Multilingual Sequence-to-Sequence Speech Recognition : Architecture, Transfer Learning, and Language Modeling. / Cho, Jaejin; Baskar, Murali Karthick; Li, Ruizhi; Wiesner, Matthew; Mallidi, Sri Harish; Yalta, Nelson; Karafiat, Martin; Watanabe, Shinji; Hori, Takaaki.

2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 521-527 8639655 (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cho, J, Baskar, MK, Li, R, Wiesner, M, Mallidi, SH, Yalta, N, Karafiat, M, Watanabe, S & Hori, T 2019, Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. in 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings., 8639655, 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 521-527, 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, 18/12/18. https://doi.org/10.1109/SLT.2018.8639655
Cho J, Baskar MK, Li R, Wiesner M, Mallidi SH, Yalta N et al. Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 521-527. 8639655. (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings). https://doi.org/10.1109/SLT.2018.8639655
Cho, Jaejin ; Baskar, Murali Karthick ; Li, Ruizhi ; Wiesner, Matthew ; Mallidi, Sri Harish ; Yalta, Nelson ; Karafiat, Martin ; Watanabe, Shinji ; Hori, Takaaki. / Multilingual Sequence-to-Sequence Speech Recognition : Architecture, Transfer Learning, and Language Modeling. 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 521-527 (2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings).
@inproceedings{cb339bc865f34592976981cdbae99c5c,
title = "Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling",
abstract = "Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of {\%}WER, and achieves recognition performance comparable to the models trained with twice more training data.",
keywords = "Automatic speech recognition (ASR), language modeling, multilingual setup, sequence to sequence, transfer learning",
author = "Jaejin Cho and Baskar, {Murali Karthick} and Ruizhi Li and Matthew Wiesner and Mallidi, {Sri Harish} and Nelson Yalta and Martin Karafiat and Shinji Watanabe and Takaaki Hori",
year = "2019",
month = "2",
day = "11",
doi = "10.1109/SLT.2018.8639655",
language = "English",
series = "2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "521--527",
booktitle = "2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings",

}

TY - GEN

T1 - Multilingual Sequence-to-Sequence Speech Recognition

T2 - Architecture, Transfer Learning, and Language Modeling

AU - Cho, Jaejin

AU - Baskar, Murali Karthick

AU - Li, Ruizhi

AU - Wiesner, Matthew

AU - Mallidi, Sri Harish

AU - Yalta, Nelson

AU - Karafiat, Martin

AU - Watanabe, Shinji

AU - Hori, Takaaki

PY - 2019/2/11

Y1 - 2019/2/11

N2 - Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.

AB - Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multilingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.

KW - Automatic speech recognition (ASR)

KW - language modeling

KW - multilingual setup

KW - sequence to sequence

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85063077624&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063077624&partnerID=8YFLogxK

U2 - 10.1109/SLT.2018.8639655

DO - 10.1109/SLT.2018.8639655

M3 - Conference contribution

AN - SCOPUS:85063077624

T3 - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

SP - 521

EP - 527

BT - 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -