TY - GEN
T1 - The structure of unseen trigrams and its application to language models
T2 - 2010 4th International Universal Communication Symposium, IUCS 2010
AU - Lepage, Yves
AU - Gosme, Julien
AU - Lardilleux, Adrien
PY - 2010
Y1 - 2010
N2 - In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.
AB - In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish.
KW - Europarl
KW - Structure of unseen trigrams
KW - Trigram language models
UR - http://www.scopus.com/inward/record.url?scp=78651459726&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78651459726&partnerID=8YFLogxK
U2 - 10.1109/IUCS.2010.5666011
DO - 10.1109/IUCS.2010.5666011
M3 - Conference contribution
AN - SCOPUS:78651459726
SN - 9781424478200
T3 - 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
SP - 273
EP - 280
BT - 2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedings
Y2 - 18 October 2010 through 19 October 2010
ER -