Transformer ASR with Contextual Block Processing

Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

26 Citations (Scopus)

Abstract

The Transformer self-Attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-To-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-Attention. In this paper, we propose a new block processing method for the Transformer encoder by introducing a context-Aware inheritance mechanism. An additional context embedding vector handed over from the previously processed block helps to encode not only local acoustic information but also global linguistic, channel, and speaker attributes. We introduce a novel mask technique to implement the context inheritance to train the model efficiently. Evaluations of the Wall Street Journal (WSJ), Librispeech, VoxForge Italian, and AISHELL-1 Mandarin speech recognition datasets show that our proposed contextual block processing method outperforms naive block processing consistently. Furthermore, the attention weight tendency of each layer is analyzed to clarify how the added contextual inheritance mechanism models the global information.

Original languageEnglish
Title of host publication2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages427-433
Number of pages7
ISBN (Electronic)9781728103068
DOIs
Publication statusPublished - 2019 Dec
Externally publishedYes
Event2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Singapore, Singapore
Duration: 2019 Dec 152019 Dec 18

Publication series

Name2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings

Conference

Conference2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019
Country/TerritorySingapore
CitySingapore
Period19/12/1519/12/18

Keywords

  • Block Processing
  • End-To-end
  • Self-Attention Network
  • Speech Recognition
  • Transformer

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Signal Processing
  • Linguistics and Language
  • Communication

Fingerprint

Dive into the research topics of 'Transformer ASR with Contextual Block Processing'. Together they form a unique fingerprint.

Cite this