Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment

Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, Francisco Guzmán


Abstract
Following two preceding WMT Shared Task on Parallel Corpus Filtering (Koehn et al., 2018, 2019), we posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems. This year, the task tackled the low resource condition of Pashto–English and Khmer–English and also included the challenge of sentence alignment from document pairs.
Anthology ID:
2020.wmt-1.78
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
726–742
Language:
URL:
https://aclanthology.org/2020.wmt-1.78
DOI:
Bibkey:
Cite (ACL):
Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky, Naman Goyal, Peng-Jen Chen, and Francisco Guzmán. 2020. Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment. In Proceedings of the Fifth Conference on Machine Translation, pages 726–742, Online. Association for Computational Linguistics.
Cite (Informal):
Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment (Koehn et al., WMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wmt-1.78.pdf
Video:
 https://slideslive.com/38939678
Data
CCAlignedParaCrawl