Dual supervised learning for non-native speech recognition

Kacper Pawel Radzikowski, Robert Nowak, Le Wang, Osamu Yoshie

Research output: Contribution to journalArticle

Abstract

Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.

Original languageEnglish
Article number3
JournalEurasip Journal on Audio, Speech, and Music Processing
Volume2019
Issue number1
DOIs
Publication statusPublished - 2019 Dec 1

Fingerprint

Supervised learning
speech recognition
Speech recognition
learning
English language
methodology
Reinforcement learning
reinforcement
Experiments
gradients

Keywords

  • Artificial intelligence
  • Deep learning
  • Dual supervised learning
  • Machine learning
  • Non-native speaker
  • Policy gradients
  • Reinforcement learning
  • Speech recognition

ASJC Scopus subject areas

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering

Cite this

Dual supervised learning for non-native speech recognition. / Radzikowski, Kacper Pawel; Nowak, Robert; Wang, Le; Yoshie, Osamu.

In: Eurasip Journal on Audio, Speech, and Music Processing, Vol. 2019, No. 1, 3, 01.12.2019.

Research output: Contribution to journalArticle

@article{3478629dc72d4136bb900dcc528783d2,
title = "Dual supervised learning for non-native speech recognition",
abstract = "Current automatic speech recognition (ASR) systems achieve over 90–95{\%} accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.",
keywords = "Artificial intelligence, Deep learning, Dual supervised learning, Machine learning, Non-native speaker, Policy gradients, Reinforcement learning, Speech recognition",
author = "Radzikowski, {Kacper Pawel} and Robert Nowak and Le Wang and Osamu Yoshie",
year = "2019",
month = "12",
day = "1",
doi = "10.1186/s13636-018-0146-4",
language = "English",
volume = "2019",
journal = "Eurasip Journal on Audio, Speech, and Music Processing",
issn = "1687-4714",
publisher = "Springer Publishing Company",
number = "1",

}

TY - JOUR

T1 - Dual supervised learning for non-native speech recognition

AU - Radzikowski, Kacper Pawel

AU - Nowak, Robert

AU - Wang, Le

AU - Yoshie, Osamu

PY - 2019/12/1

Y1 - 2019/12/1

N2 - Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.

AB - Current automatic speech recognition (ASR) systems achieve over 90–95% accuracy, depending on the methodology applied and datasets used. However, the level of accuracy decreases significantly when the same ASR system is used by a non-native speaker of the language to be recognized. At the same time, the volume of labeled datasets of non-native speech samples is extremely limited both in size and in the number of existing languages. This problem makes it difficult to train or build sufficiently accurate ASR systems targeted at non-native speakers, which, consequently, calls for a different approach that would make use of vast amounts of large unlabeled datasets. In this paper, we address this issue by employing dual supervised learning (DSL) and reinforcement learning with policy gradient methodology. We tested DSL in a warm-start approach, with two models trained beforehand, and in a semi warm-start approach with only one of the two models pre-trained. The experiments were conducted on English language pronounced by Japanese and Polish speakers. The results of our experiments show that creating ASR systems with DSL can achieve an accuracy comparable to traditional methods, while simultaneously making use of unlabeled data, which obviously is much cheaper to obtain and comes in larger sizes.

KW - Artificial intelligence

KW - Deep learning

KW - Dual supervised learning

KW - Machine learning

KW - Non-native speaker

KW - Policy gradients

KW - Reinforcement learning

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85060134244&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060134244&partnerID=8YFLogxK

U2 - 10.1186/s13636-018-0146-4

DO - 10.1186/s13636-018-0146-4

M3 - Article

VL - 2019

JO - Eurasip Journal on Audio, Speech, and Music Processing

JF - Eurasip Journal on Audio, Speech, and Music Processing

SN - 1687-4714

IS - 1

M1 - 3

ER -