Auxiliary feature based adaptation of end-to-end ASR systems

Marc Delcroix, Shinji Watanabe, Atsunori Ogawa, Shigeki Karita, Tomohiro Nakatani

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.

Original languageEnglish
Pages (from-to)2444-2448
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Automatic Speech Recognition
Speech recognition
Neural Networks
Model
Pipelines
Acoustic Model
Speech Signal
Encoder
Feature Extraction
Simplify
Feature extraction
Acoustics
Module
Neural networks
Alternatives
Graph in graph theory
Demonstrate

Keywords

  • Adaptation
  • Auxiliary feature
  • End-to-end
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Auxiliary feature based adaptation of end-to-end ASR systems. / Delcroix, Marc; Watanabe, Shinji; Ogawa, Atsunori; Karita, Shigeki; Nakatani, Tomohiro.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 2444-2448.

Research output: Contribution to journalConference article

Delcroix, Marc ; Watanabe, Shinji ; Ogawa, Atsunori ; Karita, Shigeki ; Nakatani, Tomohiro. / Auxiliary feature based adaptation of end-to-end ASR systems. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2018 ; Vol. 2018-September. pp. 2444-2448.
@article{d7b47854ba54432999b38d9403edceee,
title = "Auxiliary feature based adaptation of end-to-end ASR systems",
abstract = "Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.",
keywords = "Adaptation, Auxiliary feature, End-to-end, Speech recognition",
author = "Marc Delcroix and Shinji Watanabe and Atsunori Ogawa and Shigeki Karita and Tomohiro Nakatani",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1438",
language = "English",
volume = "2018-September",
pages = "2444--2448",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Auxiliary feature based adaptation of end-to-end ASR systems

AU - Delcroix, Marc

AU - Watanabe, Shinji

AU - Ogawa, Atsunori

AU - Karita, Shigeki

AU - Nakatani, Tomohiro

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.

AB - Acoustic model adaptation has been widely used to adapt models to speakers or environments. For example, appending auxiliary features representing speakers such as i-vectors to the input of a deep neural network (DNN) is an effective way to realize unsupervised adaptation of DNN-hybrid automatic speech recognition (ASR) systems. Recently, end-to-end (E2E) models have been proposed as an alternative to conventional DNN-hybrid ASR systems. E2E models map a speech signal to a sequence of characters or words using a single neural network, which greatly simplifies the ASR pipeline. However, adaptation of E2E models has received little attention yet. In this paper, we investigate auxiliary feature based adaptation for encoder-decoder E2E models. We employ a recently proposed sequence summary network to compute auxiliary features instead of i-vectors, as it can be easily integrated into E2E models and keep the ASR pipeline simple. Indeed, the sequence summary network allows the auxiliary feature extraction module to be a part of the computational graph of the E2E model. We demonstrate that the proposed adaptation scheme consistently improves recognition performance of three publicly available recognition tasks.

KW - Adaptation

KW - Auxiliary feature

KW - End-to-end

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054952633&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054952633&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1438

DO - 10.21437/Interspeech.2018-1438

M3 - Conference article

VL - 2018-September

SP - 2444

EP - 2448

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -