Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models

Naoki Hirayama, Koichiro Yoshino, Katsutoshi Itoyama, Shinsuke Mori, Hiroshi G. Okuno

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

This paper presents an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognizes dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models. Previous dialect ASR systems were based on handcrafted dictionaries for several dialects, which involved costly processes. The proposed system statistically trains transformation rules between a common language and dialects, and simulates a dialect corpus for ASR on the basis of a machine translation technique. The rules are trained with small sets of parallel corpora to make up for the lack of linguistic resources on dialects. The proposed system also accepts mixed dialect utterances that contain a variety of vocabularies. In fact, spoken language is not a single dialect but a mixed dialect that is affected by the circumstances of speakers' backgrounds (e.g., native dialects of their parents or where they live). We addressed two methods to combine several dialects appropriately for each speaker. The first was recognition with language models of mixed dialects with automatically estimated weights that maximized the recognition likelihood. This method performed the best, but calculation was very expensive because it conducted grid searches of combinations of dialect mixing proportions that maximized the recognition likelihood. The second was integration of results of recognition from each single dialect language model. The improvements with this model were slightly smaller than those with the first method. Its calculation cost was, however, inexpensive and it worked in real-time on general workstations. Both methods achieved higher recognition accuracies for all speakers than those with the single dialect models and the common language model, and we could choose a suitable model for use in ASR that took into consideration the computational costs and recognition accuracies.

Original languageEnglish
Article number7001195
Pages (from-to)373-382
Number of pages10
JournalIEEE/ACM Transactions on Speech and Language Processing
Volume23
Issue number2
DOIs
Publication statusPublished - 2015 Feb 1
Externally publishedYes

Fingerprint

Automatic Speech Recognition
Language Model
speech recognition
Speech recognition
dialect
Language
language
Vocabulary
Likelihood
Statistical Simulation
Costs and Cost Analysis
Machine Translation
machine translation
Linguistics
costs
Model
dictionaries
linguistics
Computational Cost
Proportion

Keywords

  • Corpus simulation
  • mixture of dialects
  • Speech recognition

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Media Technology
  • Acoustics and Ultrasonics
  • Instrumentation
  • Linguistics and Language
  • Speech and Hearing

Cite this

Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models. / Hirayama, Naoki; Yoshino, Koichiro; Itoyama, Katsutoshi; Mori, Shinsuke; Okuno, Hiroshi G.

In: IEEE/ACM Transactions on Speech and Language Processing, Vol. 23, No. 2, 7001195, 01.02.2015, p. 373-382.

Research output: Contribution to journalArticle

Hirayama, Naoki ; Yoshino, Koichiro ; Itoyama, Katsutoshi ; Mori, Shinsuke ; Okuno, Hiroshi G. / Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models. In: IEEE/ACM Transactions on Speech and Language Processing. 2015 ; Vol. 23, No. 2. pp. 373-382.
@article{aedca2a3433e46c584cd4587df5f363b,
title = "Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models",
abstract = "This paper presents an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognizes dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models. Previous dialect ASR systems were based on handcrafted dictionaries for several dialects, which involved costly processes. The proposed system statistically trains transformation rules between a common language and dialects, and simulates a dialect corpus for ASR on the basis of a machine translation technique. The rules are trained with small sets of parallel corpora to make up for the lack of linguistic resources on dialects. The proposed system also accepts mixed dialect utterances that contain a variety of vocabularies. In fact, spoken language is not a single dialect but a mixed dialect that is affected by the circumstances of speakers' backgrounds (e.g., native dialects of their parents or where they live). We addressed two methods to combine several dialects appropriately for each speaker. The first was recognition with language models of mixed dialects with automatically estimated weights that maximized the recognition likelihood. This method performed the best, but calculation was very expensive because it conducted grid searches of combinations of dialect mixing proportions that maximized the recognition likelihood. The second was integration of results of recognition from each single dialect language model. The improvements with this model were slightly smaller than those with the first method. Its calculation cost was, however, inexpensive and it worked in real-time on general workstations. Both methods achieved higher recognition accuracies for all speakers than those with the single dialect models and the common language model, and we could choose a suitable model for use in ASR that took into consideration the computational costs and recognition accuracies.",
keywords = "Corpus simulation, mixture of dialects, Speech recognition",
author = "Naoki Hirayama and Koichiro Yoshino and Katsutoshi Itoyama and Shinsuke Mori and Okuno, {Hiroshi G.}",
year = "2015",
month = "2",
day = "1",
doi = "10.1109/TASLP.2014.2387414",
language = "English",
volume = "23",
pages = "373--382",
journal = "IEEE/ACM Transactions on Speech and Language Processing",
issn = "2329-9290",
publisher = "IEEE Advancing Technology for Humanity",
number = "2",

}

TY - JOUR

T1 - Automatic Speech Recognition for Mixed Dialect Utterances by Mixing Dialect Language Models

AU - Hirayama, Naoki

AU - Yoshino, Koichiro

AU - Itoyama, Katsutoshi

AU - Mori, Shinsuke

AU - Okuno, Hiroshi G.

PY - 2015/2/1

Y1 - 2015/2/1

N2 - This paper presents an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognizes dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models. Previous dialect ASR systems were based on handcrafted dictionaries for several dialects, which involved costly processes. The proposed system statistically trains transformation rules between a common language and dialects, and simulates a dialect corpus for ASR on the basis of a machine translation technique. The rules are trained with small sets of parallel corpora to make up for the lack of linguistic resources on dialects. The proposed system also accepts mixed dialect utterances that contain a variety of vocabularies. In fact, spoken language is not a single dialect but a mixed dialect that is affected by the circumstances of speakers' backgrounds (e.g., native dialects of their parents or where they live). We addressed two methods to combine several dialects appropriately for each speaker. The first was recognition with language models of mixed dialects with automatically estimated weights that maximized the recognition likelihood. This method performed the best, but calculation was very expensive because it conducted grid searches of combinations of dialect mixing proportions that maximized the recognition likelihood. The second was integration of results of recognition from each single dialect language model. The improvements with this model were slightly smaller than those with the first method. Its calculation cost was, however, inexpensive and it worked in real-time on general workstations. Both methods achieved higher recognition accuracies for all speakers than those with the single dialect models and the common language model, and we could choose a suitable model for use in ASR that took into consideration the computational costs and recognition accuracies.

AB - This paper presents an automatic speech recognition (ASR) system that accepts a mixture of various kinds of dialects. The system recognizes dialect utterances on the basis of the statistical simulation of vocabulary transformation and combinations of several dialect models. Previous dialect ASR systems were based on handcrafted dictionaries for several dialects, which involved costly processes. The proposed system statistically trains transformation rules between a common language and dialects, and simulates a dialect corpus for ASR on the basis of a machine translation technique. The rules are trained with small sets of parallel corpora to make up for the lack of linguistic resources on dialects. The proposed system also accepts mixed dialect utterances that contain a variety of vocabularies. In fact, spoken language is not a single dialect but a mixed dialect that is affected by the circumstances of speakers' backgrounds (e.g., native dialects of their parents or where they live). We addressed two methods to combine several dialects appropriately for each speaker. The first was recognition with language models of mixed dialects with automatically estimated weights that maximized the recognition likelihood. This method performed the best, but calculation was very expensive because it conducted grid searches of combinations of dialect mixing proportions that maximized the recognition likelihood. The second was integration of results of recognition from each single dialect language model. The improvements with this model were slightly smaller than those with the first method. Its calculation cost was, however, inexpensive and it worked in real-time on general workstations. Both methods achieved higher recognition accuracies for all speakers than those with the single dialect models and the common language model, and we could choose a suitable model for use in ASR that took into consideration the computational costs and recognition accuracies.

KW - Corpus simulation

KW - mixture of dialects

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=84956608524&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84956608524&partnerID=8YFLogxK

U2 - 10.1109/TASLP.2014.2387414

DO - 10.1109/TASLP.2014.2387414

M3 - Article

AN - SCOPUS:84956608524

VL - 23

SP - 373

EP - 382

JO - IEEE/ACM Transactions on Speech and Language Processing

JF - IEEE/ACM Transactions on Speech and Language Processing

SN - 2329-9290

IS - 2

M1 - 7001195

ER -