SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP

Katsuki Chousa, Masaaki Nagata, Masaaki Nishino


Abstract
We propose a novel method of automatic sentence alignment from noisy parallel documents. We first formalize the sentence alignment problem as the independent predictions of spans in the target document from sentences in the source document. We then introduce a total optimization method using integer linear programming to prevent span overlapping and obtain non-monotonic alignments. We implement cross-language span prediction by fine-tuning pre-trained multilingual language models based on BERT architecture and train them using pseudo-labeled data obtained from unsupervised sentence alignment method. While the baseline methods use sentence embeddings and assume monotonic alignment, our method can capture the token-to-token interaction between the tokens of source and target text and handle non-monotonic alignments. In sentence alignment experiments on English-Japanese, our method achieved 70.3 F1 scores, which are +8.0 points higher than the baseline method. In particular, our method improved by +53.9 F1 scores for extracting non-parallel sentences. Our method improved the downstream machine translation accuracy by 4.1 BLEU scores when the extracted bilingual sentences are used for fine-tuning a pre-trained Japanese-to-English translation model.
Anthology ID:
2020.coling-main.418
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4750–4761
Language:
URL:
https://aclanthology.org/2020.coling-main.418
DOI:
10.18653/v1/2020.coling-main.418
Bibkey:
Cite (ACL):
Katsuki Chousa, Masaaki Nagata, and Masaaki Nishino. 2020. SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4750–4761, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
SpanAlign: Sentence Alignment Method based on Cross-Language Span Prediction and ILP (Chousa et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.418.pdf
Code
 nttcslab-nlp/spanalign