Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text

Takahiro Shinozaki, Shinji Watanabe, Daichi Mochihashi, Graham Neubig

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

While the performance of automatic speech recognition systems has recently approached human levels in some tasks, the application is still limited to specific domains. This is because system development relies on extensive supervised training and expert tuning in the target domain. To solve this problem, systems must become more self-sufficient, having the ability to learn directly from speech and adapt to new tasks. One open question in this area is how to learn a pronunciation dictionary containing the appropriate vocabulary. Humans can recognize words, even ones they have never heard before, by reading text and understanding the context in which a word is used. However, this ability is missing in current speech recognition systems. In this work, we propose a new framework that automatically expands an initial pronunciation dictionary using independently sampled acoustic and textual data. While the task is very challenging and in its initial stage, we demonstrate that a model based on Bayesian learning of Dirichlet processes can acquire word pronunciations from phone transcripts and text of the WSJ data set.

Original languageEnglish
Pages (from-to)2546-2550
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

Fingerprint

Semi-supervised Learning
Supervised learning
Glossaries
Speech recognition
Disjoint
Bayesian Learning
Dirichlet Process
Automatic Speech Recognition
Tuning
Acoustics
Speech Recognition
System Development
Expand
Model-based
Sufficient
Target
Demonstrate
Text
Dictionary
Phonemics

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text. / Shinozaki, Takahiro; Watanabe, Shinji; Mochihashi, Daichi; Neubig, Graham.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2017-August, 01.01.2017, p. 2546-2550.

Research output: Contribution to journalConference article

@article{ae7e806ac1ca4be48448143018e68980,
title = "Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text",
abstract = "While the performance of automatic speech recognition systems has recently approached human levels in some tasks, the application is still limited to specific domains. This is because system development relies on extensive supervised training and expert tuning in the target domain. To solve this problem, systems must become more self-sufficient, having the ability to learn directly from speech and adapt to new tasks. One open question in this area is how to learn a pronunciation dictionary containing the appropriate vocabulary. Humans can recognize words, even ones they have never heard before, by reading text and understanding the context in which a word is used. However, this ability is missing in current speech recognition systems. In this work, we propose a new framework that automatically expands an initial pronunciation dictionary using independently sampled acoustic and textual data. While the task is very challenging and in its initial stage, we demonstrate that a model based on Bayesian learning of Dirichlet processes can acquire word pronunciations from phone transcripts and text of the WSJ data set.",
author = "Takahiro Shinozaki and Shinji Watanabe and Daichi Mochihashi and Graham Neubig",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-1081",
language = "English",
volume = "2017-August",
pages = "2546--2550",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text

AU - Shinozaki, Takahiro

AU - Watanabe, Shinji

AU - Mochihashi, Daichi

AU - Neubig, Graham

PY - 2017/1/1

Y1 - 2017/1/1

N2 - While the performance of automatic speech recognition systems has recently approached human levels in some tasks, the application is still limited to specific domains. This is because system development relies on extensive supervised training and expert tuning in the target domain. To solve this problem, systems must become more self-sufficient, having the ability to learn directly from speech and adapt to new tasks. One open question in this area is how to learn a pronunciation dictionary containing the appropriate vocabulary. Humans can recognize words, even ones they have never heard before, by reading text and understanding the context in which a word is used. However, this ability is missing in current speech recognition systems. In this work, we propose a new framework that automatically expands an initial pronunciation dictionary using independently sampled acoustic and textual data. While the task is very challenging and in its initial stage, we demonstrate that a model based on Bayesian learning of Dirichlet processes can acquire word pronunciations from phone transcripts and text of the WSJ data set.

AB - While the performance of automatic speech recognition systems has recently approached human levels in some tasks, the application is still limited to specific domains. This is because system development relies on extensive supervised training and expert tuning in the target domain. To solve this problem, systems must become more self-sufficient, having the ability to learn directly from speech and adapt to new tasks. One open question in this area is how to learn a pronunciation dictionary containing the appropriate vocabulary. Humans can recognize words, even ones they have never heard before, by reading text and understanding the context in which a word is used. However, this ability is missing in current speech recognition systems. In this work, we propose a new framework that automatically expands an initial pronunciation dictionary using independently sampled acoustic and textual data. While the task is very challenging and in its initial stage, we demonstrate that a model based on Bayesian learning of Dirichlet processes can acquire word pronunciations from phone transcripts and text of the WSJ data set.

UR - http://www.scopus.com/inward/record.url?scp=85039163402&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039163402&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-1081

DO - 10.21437/Interspeech.2017-1081

M3 - Conference article

AN - SCOPUS:85039163402

VL - 2017-August

SP - 2546

EP - 2550

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -