Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

Philipp Koehn, Huda Khayrallah, Kenneth Heafield, Mikel L. Forcada


Abstract
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.
Anthology ID:
W18-6453
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
726–739
URL:
https://www.aclweb.org/anthology/W18-6453
DOI:
10.18653/v1/W18-6453
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://www.aclweb.org/anthology/W18-6453.pdf