Generalizing sampling-based multilingual alignment

Adrien Lardilleux, François Yvon, Yves Lepage

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

Original languageEnglish
Pages (from-to)1-23
Number of pages23
JournalMachine Translation
Volume27
Issue number1
DOIs
Publication statusPublished - 2013 Mar

Fingerprint

Sampling
heuristics
Statistical Models
Alignment
Group
Translation Units
Heuristics
Machine Translation System
Statistical Machine Translation
Machine Translation
Parallel Texts

Keywords

  • Association measures
  • Phrase-based machine translation
  • Sub-sentential alignment

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Language and Linguistics
  • Linguistics and Language

Cite this

Generalizing sampling-based multilingual alignment. / Lardilleux, Adrien; Yvon, François; Lepage, Yves.

In: Machine Translation, Vol. 27, No. 1, 03.2013, p. 1-23.

Research output: Contribution to journalArticle

Lardilleux, Adrien ; Yvon, François ; Lepage, Yves. / Generalizing sampling-based multilingual alignment. In: Machine Translation. 2013 ; Vol. 27, No. 1. pp. 1-23.
@article{2b4862cb5e614fdb852aaa20b515ec93,
title = "Generalizing sampling-based multilingual alignment",
abstract = "Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.",
keywords = "Association measures, Phrase-based machine translation, Sub-sentential alignment",
author = "Adrien Lardilleux and Fran{\cc}ois Yvon and Yves Lepage",
year = "2013",
month = "3",
doi = "10.1007/s10590-012-9126-0",
language = "English",
volume = "27",
pages = "1--23",
journal = "Machine Translation",
issn = "0922-6567",
publisher = "Springer Netherlands",
number = "1",

}

TY - JOUR

T1 - Generalizing sampling-based multilingual alignment

AU - Lardilleux, Adrien

AU - Yvon, François

AU - Lepage, Yves

PY - 2013/3

Y1 - 2013/3

N2 - Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

AB - Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks.

KW - Association measures

KW - Phrase-based machine translation

KW - Sub-sentential alignment

UR - http://www.scopus.com/inward/record.url?scp=84875743779&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84875743779&partnerID=8YFLogxK

U2 - 10.1007/s10590-012-9126-0

DO - 10.1007/s10590-012-9126-0

M3 - Article

AN - SCOPUS:84875743779

VL - 27

SP - 1

EP - 23

JO - Machine Translation

JF - Machine Translation

SN - 0922-6567

IS - 1

ER -