The structure of unseen trigrams and its application to language models: A first investigation

Yves Lepage, Julien Gosme, Adrien Lardilleux

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.

Original languageEnglish
Title of host publication2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
Pages273-280
Number of pages8
DOIs
Publication statusPublished - 2010
Event2010 4th International Universal Communication Symposium, IUCS 2010 - Beijing
Duration: 2010 Oct 182010 Oct 19

Other

Other2010 4th International Universal Communication Symposium, IUCS 2010
CityBeijing
Period10/10/1810/10/19

Fingerprint

language
Experiments
experiment

Keywords

  • Europarl
  • Structure of unseen trigrams
  • Trigram language models

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Communication

Cite this

Lepage, Y., Gosme, J., & Lardilleux, A. (2010). The structure of unseen trigrams and its application to language models: A first investigation. In 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings (pp. 273-280). [5666011] https://doi.org/10.1109/IUCS.2010.5666011

The structure of unseen trigrams and its application to language models : A first investigation. / Lepage, Yves; Gosme, Julien; Lardilleux, Adrien.

2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings. 2010. p. 273-280 5666011.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lepage, Y, Gosme, J & Lardilleux, A 2010, The structure of unseen trigrams and its application to language models: A first investigation. in 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings., 5666011, pp. 273-280, 2010 4th International Universal Communication Symposium, IUCS 2010, Beijing, 10/10/18. https://doi.org/10.1109/IUCS.2010.5666011
Lepage Y, Gosme J, Lardilleux A. The structure of unseen trigrams and its application to language models: A first investigation. In 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings. 2010. p. 273-280. 5666011 https://doi.org/10.1109/IUCS.2010.5666011
Lepage, Yves ; Gosme, Julien ; Lardilleux, Adrien. / The structure of unseen trigrams and its application to language models : A first investigation. 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings. 2010. pp. 273-280
@inproceedings{9be6eecab9fc43f9828039584689fdb2,
title = "The structure of unseen trigrams and its application to language models: A first investigation",
abstract = "In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.",
keywords = "Europarl, Structure of unseen trigrams, Trigram language models",
author = "Yves Lepage and Julien Gosme and Adrien Lardilleux",
year = "2010",
doi = "10.1109/IUCS.2010.5666011",
language = "English",
isbn = "9781424478200",
pages = "273--280",
booktitle = "2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings",

}

TY - GEN

T1 - The structure of unseen trigrams and its application to language models

T2 - A first investigation

AU - Lepage, Yves

AU - Gosme, Julien

AU - Lardilleux, Adrien

PY - 2010

Y1 - 2010

N2 - In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.

AB - In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.

KW - Europarl

KW - Structure of unseen trigrams

KW - Trigram language models

UR - http://www.scopus.com/inward/record.url?scp=78651459726&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78651459726&partnerID=8YFLogxK

U2 - 10.1109/IUCS.2010.5666011

DO - 10.1109/IUCS.2010.5666011

M3 - Conference contribution

AN - SCOPUS:78651459726

SN - 9781424478200

SP - 273

EP - 280

BT - 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings

ER -