Exploiting Sentence Order in Document Alignment

Brian Thompson, Philipp Koehn


Abstract
We present a simple document alignment method that incorporates sentence order information in both candidate generation and candidate re-scoring. Our method results in 61% relative reduction in error compared to the best previously published result on the WMT16 document alignment shared task. Our method improves downstream MT performance on web-scraped Sinhala–English documents from ParaCrawl, outperforming the document alignment method used in the most recent ParaCrawl release. It also outperforms a comparable corpora method which uses the same multilingual embeddings, demonstrating that exploiting sentence order is beneficial even if the end goal is sentence-level bitext.
Anthology ID:
2020.emnlp-main.483
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5997–6007
Language:
URL:
https://aclanthology.org/2020.emnlp-main.483
DOI:
10.18653/v1/2020.emnlp-main.483
Bibkey:
Cite (ACL):
Brian Thompson and Philipp Koehn. 2020. Exploiting Sentence Order in Document Alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5997–6007, Online. Association for Computational Linguistics.
Cite (Informal):
Exploiting Sentence Order in Document Alignment (Thompson & Koehn, EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.483.pdf
Video:
 https://slideslive.com/38938745
Code
 thompsonb/vecalign