Generalizing sampling-based multilingual alignment

Adrien Lardilleux*, François Yvon, Yves Lepage

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)


Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

Original languageEnglish
Pages (from-to)1-23
Number of pages23
JournalMachine Translation
Issue number1
Publication statusPublished - 2013 Mar


  • Association measures
  • Phrase-based machine translation
  • Sub-sentential alignment

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence


Dive into the research topics of 'Generalizing sampling-based multilingual alignment'. Together they form a unique fingerprint.

Cite this