Hapax legomena: Their contribution in number and efficiency to word alignment

Adrien Lardilleux, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Current techniques in word alignment disregard words with a low frequency because they would not be useful. Against this belief, this paper shows that, in particular, the notion of hapax legomena may contribute to word alignment to a large extent. In an experiment, we show that pairs of corpus hapaxes contribute to the majority of the best word alignments. In addition, we show that the notion of sentence hapax justifies a practical and common simplification of standard alignment methods.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages440-450
Number of pages11
Volume5603 LNAI
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event3rd Language and Technology Conference, LTC 2007 - Poznan
Duration: 2007 Oct 52007 Oct 7

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5603 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other3rd Language and Technology Conference, LTC 2007
CityPoznan
Period07/10/507/10/7

    Fingerprint

Keywords

  • Hapax
  • Low frequency term
  • Word alignment

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Lardilleux, A., & Lepage, Y. (2009). Hapax legomena: Their contribution in number and efficiency to word alignment. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5603 LNAI, pp. 440-450). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5603 LNAI). https://doi.org/10.1007/978-3-642-04235-5_38