Espresso: A Fast End-To-End Neural Speech Recognition Toolkit

Yiming Wang, Sanjeev Khudanpur, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present Espresso, an open-source, modular, extensible end-To-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit FAIRSEQ. ESRESSO supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-Ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-The-Art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-To-end systems without data augmentation, and is 4-11x faster for decoding than similar systems (e.g. ESPNET).

Original languageEnglish
Title of host publication2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages136-143
Number of pages8
ISBN (Electronic)9781728103068
DOIs
Publication statusPublished - 2019 Dec
Externally publishedYes
Event2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore
Duration: 2019 Dec 152019 Dec 18

Publication series

Name2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
CountrySingapore
CitySingapore
Period19/12/1519/12/18

Keywords

  • automatic speech recognition
  • end-To-end
  • language model fusion
  • parallel decoding

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Signal Processing
  • Linguistics and Language
  • Communication

Fingerprint Dive into the research topics of 'Espresso: A Fast End-To-End Neural Speech Recognition Toolkit'. Together they form a unique fingerprint.

Cite this