The structure of unseen trigrams and its application to language models: A first investigation

Yves Lepage, Julien Gosme, Adrien Lardilleux

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.

Original languageEnglish
Title of host publication2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
Pages273-280
Number of pages8
DOIs
Publication statusPublished - 2010 Dec 1
Event2010 4th International Universal Communication Symposium, IUCS 2010 - Beijing, China
Duration: 2010 Oct 182010 Oct 19

Publication series

Name2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings

Conference

Conference2010 4th International Universal Communication Symposium, IUCS 2010
CountryChina
CityBeijing
Period10/10/1810/10/19

Keywords

  • Europarl
  • Structure of unseen trigrams
  • Trigram language models

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Communication

Fingerprint Dive into the research topics of 'The structure of unseen trigrams and its application to language models: A first investigation'. Together they form a unique fingerprint.

  • Cite this

    Lepage, Y., Gosme, J., & Lardilleux, A. (2010). The structure of unseen trigrams and its application to language models: A first investigation. In 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings (pp. 273-280). [5666011] (2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings). https://doi.org/10.1109/IUCS.2010.5666011