Semi-supervised end-to-end speech recognition

Shigeki Karita, Shinji Watanabe, Tomoharu Iwata, Atsunori Ogawa, Marc Delcroix

Research output: Contribution to journalConference article

6 Citations (Scopus)

Abstract

We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semi-supervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8% to 14.4% than a conventional language model integration on the Wall Street Journal dataset.

Original languageEnglish
Pages (from-to)2-6
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Speech Recognition
Speech recognition
Text-to-speech
Encoder
Text
Automatic Speech Recognition
Language Model
Dissimilarity
Feature extraction
Feature Extraction
Error Rate
Speech
Minimise
Target
Experimental Results

Keywords

  • Adversarial training
  • Encoder-decoder
  • Semi-supervised learning
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Semi-supervised end-to-end speech recognition. / Karita, Shigeki; Watanabe, Shinji; Iwata, Tomoharu; Ogawa, Atsunori; Delcroix, Marc.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 2-6.

Research output: Contribution to journalConference article

Karita, Shigeki ; Watanabe, Shinji ; Iwata, Tomoharu ; Ogawa, Atsunori ; Delcroix, Marc. / Semi-supervised end-to-end speech recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2018 ; Vol. 2018-September. pp. 2-6.
@article{c8015a7380d541e881245331ef18a540,
title = "Semi-supervised end-to-end speech recognition",
abstract = "We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semi-supervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8{\%} to 14.4{\%} than a conventional language model integration on the Wall Street Journal dataset.",
keywords = "Adversarial training, Encoder-decoder, Semi-supervised learning, Speech recognition",
author = "Shigeki Karita and Shinji Watanabe and Tomoharu Iwata and Atsunori Ogawa and Marc Delcroix",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1746",
language = "English",
volume = "2018-September",
pages = "2--6",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Semi-supervised end-to-end speech recognition

AU - Karita, Shigeki

AU - Watanabe, Shinji

AU - Iwata, Tomoharu

AU - Ogawa, Atsunori

AU - Delcroix, Marc

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semi-supervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8% to 14.4% than a conventional language model integration on the Wall Street Journal dataset.

AB - We propose a novel semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. Our semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. In other words, by combining speech-to-text and text-to-text mappings through the shared network, we can improve speech-to-text mapping by learning to reconstruct the unpaired text data in a semi-supervised end-to-end manner. We investigate how to design suitable inter-domain loss, which minimizes the dissimilarity between the encoded speech and text sequences, which originally belong to quite different domains. The experimental results we obtained with our proposed semi-supervised training shows a larger character error rate reduction from 15.8% to 14.4% than a conventional language model integration on the Wall Street Journal dataset.

KW - Adversarial training

KW - Encoder-decoder

KW - Semi-supervised learning

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054986375&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054986375&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1746

DO - 10.21437/Interspeech.2018-1746

M3 - Conference article

AN - SCOPUS:85054986375

VL - 2018-September

SP - 2

EP - 6

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -