WMT 2020 Shared Task Parallel Corpus Filtering and Alignment for Low-Resource Conditions

Event Notification Type: 
Call for Participation
Abbreviated Title: 
Location: 
EMNLP 2020
Wednesday, 11 November 2020
State: 
Country: 
Dominican Republic
City: 
Punta Cana
Contact: 
Philipp Koehn
Paco Guzman
Submission Deadline: 
Wednesday, 1 July 2020

WMT 2020 Shared Task
Parallel Corpus Filtering and Alignment for Low-Resource Conditions
http://www.statmt.org/wmt20/parallel-corpus-filtering.html

We announce and call for participation in the WMT 2020 shared task on assessing the quality of sentence pairs in a parallel corpus.

In the WMT18 shared task on parallel corpus filtering, we posed the challenge of a noisy web-crawled parallel corpus for German-English and asked participants to score each sentence pair. These quality scores were used to select subsets of the corpus, consisting of the highest-scoring sentence pairs, train statistical and neural machine translation systems on them, and evaluate these on a set of test sets.

In the WMT19 shared task on parallel corpus filtering for low resorce conditions, we followed the same protocol, but this time for Nepali-English and Sinhala-English. For low-resource language pairs like these, both existing clean parallel corpora and the to-be-scored noisy web-crawled data comes in smaller amounts and lower quality.

This year, we pose two different language pairs, Khmer-English and Pashto-English. In addition to the task of computing quality scores for the purpose of filtering, we also allow for the re-alignment of sentence pairs from document pairs.

DETAILS
We provide a very noisy 58.3 million-word (English token count) Khmer-English corpus and a 11.6 million-word Pashto-English corpus. These corpora were partly crawled from the web as part of the Paracrawl project, and partly extracted from the CommonCrawl data set. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million English words. The quality of the resulting subsets is determined by the quality of a neural machine translation system (fairseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations for Khmer-English and Pashto-English.
We also provide clean parallel and monolingual training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance.

Note that the task addresses the challenge of data quality and not domain-relatedness of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.

The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).

This year, we also provide the document pairs from which the sentence pairs were extracted (using Hunalign and LASER). You may align sentences yourself from these document pairs, thus producing your own set of sentence pairs. If you opt to do this, you have to submit all aligned sentence pairs and their quality scores.

IMPORTANT DATES
Release of raw parallel data March 28, 2020
Submission deadline for subsampled sets July 1, 2020
System descriptions due July 15, 2020
Announcement of results June 29, 2020
Paper notification August 17, 2020
Camera-ready for system descriptions

ORGANIZERS
Philipp Koehn, Johns Hopkins University
Francisco (Paco) Guzmán, Facebook
Vishrav Chaudhary, Facebook
Ahmed Kishky, Facebook
Naman Goyal, Facebook
Peng-Jen Chen, Facebook