The structure of unseen trigrams and its application to language models: A first investigation

Yves Lepage, Julien Gosme, Adrien Lardilleux

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.

Original languageEnglish
Title of host publication2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
Pages273-280
Number of pages8
DOIs
Publication statusPublished - 2010
Event2010 4th International Universal Communication Symposium, IUCS 2010 - Beijing
Duration: 2010 Oct 182010 Oct 19

Other

Other2010 4th International Universal Communication Symposium, IUCS 2010
CityBeijing
Period10/10/1810/10/19

    Fingerprint

Keywords

  • Europarl
  • Structure of unseen trigrams
  • Trigram language models

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Communication

Cite this

Lepage, Y., Gosme, J., & Lardilleux, A. (2010). The structure of unseen trigrams and its application to language models: A first investigation. In 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings (pp. 273-280). [5666011] https://doi.org/10.1109/IUCS.2010.5666011