Streaming end-to-end ASR based on blockwise non-autoregressive models

Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwiseattention and connectionist temporal classification with maskpredict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available at https://github.com/espnet/espnet.

Original languageEnglish
Title of host publication22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PublisherInternational Speech Communication Association
Pages1421-1425
Number of pages5
ISBN (Electronic)9781713836902
DOIs
Publication statusPublished - 2021
Externally publishedYes
Event22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic
Duration: 2021 Aug 302021 Sep 3

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2
ISSN (Print)2308-457X
ISSN (Electronic)1990-9772

Conference

Conference22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Country/TerritoryCzech Republic
CityBrno
Period21/8/3021/9/3

Keywords

  • Blockwise-attention
  • Mask-CTC
  • Non-autoregressive speech recognition
  • Streaming ASR

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Streaming end-to-end ASR based on blockwise non-autoregressive models'. Together they form a unique fingerprint.

Cite this