Statistical language modeling with a class-based n-multigram model

Sabine Deligne, Yoshinori Sagisaka

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

In this paper, we present a stochastic language-modeling tool which aims at retrieving variable-length phrases (multigrams), assuming n-gram dependencies between them, hence the name of the model: n-multigram. The estimation of the probability distribution of the phrases is intermixed with a phrase-clustering procedure in a way which jointly optimizes the likelihood of the data. As a result, the language data are iteratively structured at both a paradigmatic and a syntagmatic level in a fully integrated way. We evaluate the 2-multigram model as a statistical language model on ATIS, a task-oriented database consisting of air travel reservations. Experiments show that the 2-multigrarn model allows a reduction of 10% of the word error rate on ATIS with respect to the usual trigram model, with 25% fewer parameters than in the trigram model. In addition, we illustrate the ability of this model to merge semantically related phrases of different lengths into a common class.

Original languageEnglish
Pages (from-to)261-279
Number of pages19
JournalComputer Speech and Language
Volume14
Issue number3
DOIs
Publication statusPublished - 2000 Jul
Externally publishedYes

Fingerprint

Language Modeling
Statistical Modeling
Language
language
Air Travel
Aptitude
Statistical Models
Names
Cluster Analysis
Model
Databases
N-gram
Stochastic Modeling
Language Model
Reservation
Statistical Model
Error Rate
Class
Likelihood
Probability Distribution

ASJC Scopus subject areas

  • Signal Processing
  • Electrical and Electronic Engineering
  • Experimental and Cognitive Psychology
  • Linguistics and Language

Cite this

Statistical language modeling with a class-based n-multigram model. / Deligne, Sabine; Sagisaka, Yoshinori.

In: Computer Speech and Language, Vol. 14, No. 3, 07.2000, p. 261-279.

Research output: Contribution to journalArticle

@article{5f3b6d97192a436fae0bbedc481a28f0,
title = "Statistical language modeling with a class-based n-multigram model",
abstract = "In this paper, we present a stochastic language-modeling tool which aims at retrieving variable-length phrases (multigrams), assuming n-gram dependencies between them, hence the name of the model: n-multigram. The estimation of the probability distribution of the phrases is intermixed with a phrase-clustering procedure in a way which jointly optimizes the likelihood of the data. As a result, the language data are iteratively structured at both a paradigmatic and a syntagmatic level in a fully integrated way. We evaluate the 2-multigram model as a statistical language model on ATIS, a task-oriented database consisting of air travel reservations. Experiments show that the 2-multigrarn model allows a reduction of 10{\%} of the word error rate on ATIS with respect to the usual trigram model, with 25{\%} fewer parameters than in the trigram model. In addition, we illustrate the ability of this model to merge semantically related phrases of different lengths into a common class.",
author = "Sabine Deligne and Yoshinori Sagisaka",
year = "2000",
month = "7",
doi = "10.1006/csla.2000.0146",
language = "English",
volume = "14",
pages = "261--279",
journal = "Computer Speech and Language",
issn = "0885-2308",
publisher = "Academic Press Inc.",
number = "3",

}

TY - JOUR

T1 - Statistical language modeling with a class-based n-multigram model

AU - Deligne, Sabine

AU - Sagisaka, Yoshinori

PY - 2000/7

Y1 - 2000/7

N2 - In this paper, we present a stochastic language-modeling tool which aims at retrieving variable-length phrases (multigrams), assuming n-gram dependencies between them, hence the name of the model: n-multigram. The estimation of the probability distribution of the phrases is intermixed with a phrase-clustering procedure in a way which jointly optimizes the likelihood of the data. As a result, the language data are iteratively structured at both a paradigmatic and a syntagmatic level in a fully integrated way. We evaluate the 2-multigram model as a statistical language model on ATIS, a task-oriented database consisting of air travel reservations. Experiments show that the 2-multigrarn model allows a reduction of 10% of the word error rate on ATIS with respect to the usual trigram model, with 25% fewer parameters than in the trigram model. In addition, we illustrate the ability of this model to merge semantically related phrases of different lengths into a common class.

AB - In this paper, we present a stochastic language-modeling tool which aims at retrieving variable-length phrases (multigrams), assuming n-gram dependencies between them, hence the name of the model: n-multigram. The estimation of the probability distribution of the phrases is intermixed with a phrase-clustering procedure in a way which jointly optimizes the likelihood of the data. As a result, the language data are iteratively structured at both a paradigmatic and a syntagmatic level in a fully integrated way. We evaluate the 2-multigram model as a statistical language model on ATIS, a task-oriented database consisting of air travel reservations. Experiments show that the 2-multigrarn model allows a reduction of 10% of the word error rate on ATIS with respect to the usual trigram model, with 25% fewer parameters than in the trigram model. In addition, we illustrate the ability of this model to merge semantically related phrases of different lengths into a common class.

UR - http://www.scopus.com/inward/record.url?scp=0034230088&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034230088&partnerID=8YFLogxK

U2 - 10.1006/csla.2000.0146

DO - 10.1006/csla.2000.0146

M3 - Article

AN - SCOPUS:0034230088

VL - 14

SP - 261

EP - 279

JO - Computer Speech and Language

JF - Computer Speech and Language

SN - 0885-2308

IS - 3

ER -