Multi-class composite N-gram language model using multipleword clusters and word successions

Shuntaro Isogai, Katsuhiko Shirai, Hirofumi Yamamoto, Yoshinori Sagisaka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent positional attributes. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the Multi- Class Composite N-grams result in 9.5% lower perplexity and a 16% lower word error rate in speech recognition with a 40% smaller parameter size than conventional word 3-grams.

Original languageEnglish
Title of host publicationEUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology
PublisherInternational Speech Communication Association
Pages25-28
Number of pages4
ISBN (Electronic)8790834100, 9788790834104
Publication statusPublished - 2001
Externally publishedYes
Event7th European Conference on Speech Communication and Technology - Scandinavia, EUROSPEECH 2001 - Aalborg, Denmark
Duration: 2001 Sep 32001 Sep 7

Other

Other7th European Conference on Speech Communication and Technology - Scandinavia, EUROSPEECH 2001
CountryDenmark
CityAalborg
Period01/9/301/9/7

Fingerprint

Composite materials
language
grouping
Speech recognition
experiment
Experiments

ASJC Scopus subject areas

  • Communication
  • Linguistics and Language
  • Computer Science Applications
  • Software

Cite this

Isogai, S., Shirai, K., Yamamoto, H., & Sagisaka, Y. (2001). Multi-class composite N-gram language model using multipleword clusters and word successions. In EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology (pp. 25-28). International Speech Communication Association.

Multi-class composite N-gram language model using multipleword clusters and word successions. / Isogai, Shuntaro; Shirai, Katsuhiko; Yamamoto, Hirofumi; Sagisaka, Yoshinori.

EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology. International Speech Communication Association, 2001. p. 25-28.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Isogai, S, Shirai, K, Yamamoto, H & Sagisaka, Y 2001, Multi-class composite N-gram language model using multipleword clusters and word successions. in EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology. International Speech Communication Association, pp. 25-28, 7th European Conference on Speech Communication and Technology - Scandinavia, EUROSPEECH 2001, Aalborg, Denmark, 01/9/3.
Isogai S, Shirai K, Yamamoto H, Sagisaka Y. Multi-class composite N-gram language model using multipleword clusters and word successions. In EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology. International Speech Communication Association. 2001. p. 25-28
Isogai, Shuntaro ; Shirai, Katsuhiko ; Yamamoto, Hirofumi ; Sagisaka, Yoshinori. / Multi-class composite N-gram language model using multipleword clusters and word successions. EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology. International Speech Communication Association, 2001. pp. 25-28
@inproceedings{7a45416acf85440c804da51dec49e0bc,
title = "Multi-class composite N-gram language model using multipleword clusters and word successions",
abstract = "In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent positional attributes. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the Multi- Class Composite N-grams result in 9.5{\%} lower perplexity and a 16{\%} lower word error rate in speech recognition with a 40{\%} smaller parameter size than conventional word 3-grams.",
author = "Shuntaro Isogai and Katsuhiko Shirai and Hirofumi Yamamoto and Yoshinori Sagisaka",
year = "2001",
language = "English",
pages = "25--28",
booktitle = "EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology",
publisher = "International Speech Communication Association",

}

TY - GEN

T1 - Multi-class composite N-gram language model using multipleword clusters and word successions

AU - Isogai, Shuntaro

AU - Shirai, Katsuhiko

AU - Yamamoto, Hirofumi

AU - Sagisaka, Yoshinori

PY - 2001

Y1 - 2001

N2 - In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent positional attributes. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the Multi- Class Composite N-grams result in 9.5% lower perplexity and a 16% lower word error rate in speech recognition with a 40% smaller parameter size than conventional word 3-grams.

AB - In this paper, a new language model, the Multi-Class Composite N-gram, is proposed to avoid a data sparseness problem in small amount of training data. The Multi-Class Composite Ngram maintains an accurate word prediction capability and reliability for sparse data with a compact model size based on multiple word clusters, so-called Multi-Classes. In the Multi-Class, the statistical connectivity at each position of the N-grams is regarded as word attributes, and one word cluster each is created to represent positional attributes. Furthermore, by introducing higher order word N-grams through the grouping of frequent word successions, Multi-Class N-grams are extended to Multi-Class Composite N-grams. In experiments, the Multi- Class Composite N-grams result in 9.5% lower perplexity and a 16% lower word error rate in speech recognition with a 40% smaller parameter size than conventional word 3-grams.

UR - http://www.scopus.com/inward/record.url?scp=85009126513&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85009126513&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85009126513

SP - 25

EP - 28

BT - EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology

PB - International Speech Communication Association

ER -