A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies

Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this study, we present recent developments of models trained with the RNN-T loss in ESPnet. It involves the use of various archi-tectures such as recently proposed Conformer, multi-task learning with different auxiliary criteria and multiple decoding strategies, in-cluding our own proposition. Through experiments and benchmarks, we show that our proposed systems can be competitive against other state-of-art systems on well-known datasets such as LibriSpeech and AISHELL-1. Additionally, we demonstrate that these models are promising against other already implemented systems in ESPnet in regards to both performance and decoding speed, enabling the pos-sibility to have powerful systems for a streaming task. With these additions, we hope to expand the usefulness of the ESPnet toolkit for the research community and also give tools for the ASR industry to deploy our systems in realistic and production environments.

Original languageEnglish
Title of host publication2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages16-23
Number of pages8
ISBN (Electronic)9781665437394
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Cartagena, Colombia
Duration: 2021 Dec 132021 Dec 17

Publication series

Name2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings

Conference

Conference2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Country/TerritoryColombia
CityCartagena
Period21/12/1321/12/17

Keywords

  • auxiliary task
  • decoding strategies
  • end-to-end speech recognition
  • RNN-T loss

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies'. Together they form a unique fingerprint.

Cite this