Creating a Corpus for Russian Data-to-Text Generation Using Neural Machine Translation and Post-Editing

Anastasia Shimorina, Elena Khasanova, Claire Gardent


Abstract
In this paper, we propose an approach for semi-automatically creating a data-to-text (D2T) corpus for Russian that can be used to learn a D2T natural language generation model. An error analysis of the output of an English-to-Russian neural machine translation system shows that 80% of the automatically translated sentences contain an error and that 53% of all translation errors bear on named entities (NE). We therefore focus on named entities and introduce two post-editing techniques for correcting wrongly translated NEs.
Anthology ID:
W19-3706
Volume:
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Tomaž Erjavec, Michał Marcińczuk, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
Venue:
BSNLP
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–49
Language:
URL:
https://aclanthology.org/W19-3706
DOI:
10.18653/v1/W19-3706
Bibkey:
Cite (ACL):
Anastasia Shimorina, Elena Khasanova, and Claire Gardent. 2019. Creating a Corpus for Russian Data-to-Text Generation Using Neural Machine Translation and Post-Editing. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 44–49, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Creating a Corpus for Russian Data-to-Text Generation Using Neural Machine Translation and Post-Editing (Shimorina et al., BSNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-3706.pdf
Code
 shimorina/bsnlp-2019
Data
WebNLG