Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text

Takahiro Shinozaki, Shinji Watanabe, Daichi Mochihashi, Graham Neubig

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

While the performance of automatic speech recognition systems has recently approached human levels in some tasks, the application is still limited to specific domains. This is because system development relies on extensive supervised training and expert tuning in the target domain. To solve this problem, systems must become more self-sufficient, having the ability to learn directly from speech and adapt to new tasks. One open question in this area is how to learn a pronunciation dictionary containing the appropriate vocabulary. Humans can recognize words, even ones they have never heard before, by reading text and understanding the context in which a word is used. However, this ability is missing in current speech recognition systems. In this work, we propose a new framework that automatically expands an initial pronunciation dictionary using independently sampled acoustic and textual data. While the task is very challenging and in its initial stage, we demonstrate that a model based on Bayesian learning of Dirichlet processes can acquire word pronunciations from phone transcripts and text of the WSJ data set.

Original languageEnglish
Pages (from-to)2546-2550
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Externally publishedYes
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint Dive into the research topics of 'Semi-supervised learning of a pronunciation dictionary from disjoint phonemic transcripts and text'. Together they form a unique fingerprint.

  • Cite this