A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we conduct a comparative study of various NAR modeling methods for end-to-end automatic speech recognition (ASR). Experiments are performed in the state-of-the-art setting using ESPnet. The results on various tasks provide interesting findings for developing an understanding of NAR ASR, such as the accuracy-speed trade-off and robustness against long-form utterances. We also show that the techniques can be combined for further improvement and applied to NAR end-to-end speech translation. All the implementations are publicly available to encourage further research in NAR speech processing.

Original languageEnglish
Title of host publication2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages47-54
Number of pages8
ISBN (Electronic)9781665437394
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Cartagena, Colombia
Duration: 2021 Dec 132021 Dec 17

Publication series

Name2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings

Conference

Conference2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Country/TerritoryColombia
CityCartagena
Period21/12/1321/12/17

Keywords

  • end-to-end speech recognition
  • end-to-end speech translation
  • Non-autoregressive sequence generation

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation'. Together they form a unique fingerprint.

Cite this