PROMT Systems for WMT 2019 Shared Translation Task

This paper describes the PROMT submissions for the WMT 2019 Shared News Translation Task. This year we participated in two language pairs and in three directions: English-Russian, English-German and German-English. All our submissions are Marian-based neural systems. We use significantly more data compared to the last year. We also present our improved data filtering pipeline.


Introduction
This paper provides an overview of the PROMT submissions for the WMT 2019 Shared News Translation Task.This year we participate with neural MT systems for the second time.We participate in two language pairs and in three directions (English-Russian, English-German and German-English).We describe our data preparation pipelines, models training setups and present the results on the newstest sets.
The paper is organized as follows: Section 2 is a brief overview of the submitted systems.Section 3 describes the data preparation, preprocessing and statistics in detail.Section 4 provides a detailed description of the systems.In Section 5 we present and discuss the results.Section 6 concludes the paper.

Data
We use all data provided by the WMT organizers, private in-house parallel data and other publicly available data, mainly from the OPUS website (Tiedemann, 2012).
The Tatoeba sets as our validation sets and the newstest2018 is our test set.The reason why we choose the Tatoeba corpus for validation is that we aim at building general-domain (and not just news-domain) models.Besides, the Tatoeba corpus is available for many language pairs beyond the scope of the WMT Translation Task.
We select a small subset from training data and mix it with monolingual news with its backtranslations for fine-tuning.This will be described in detail in Section 3.4 below.

Data filtering
There are several stages in our data filtering pipeline.The statistics for the final training data are shown in Table 1 (English-Russian) and Table 2 (English-German).

Basic filtering
This includes some simple length-based and source-target length ratio-based heuristics, removing tags, lines with low amount of alphabetic symbols etc.We also remove lines which appear to be emails or web-addresses.In addition, we remove lines with rare words from the Bookshop and the OpenSubtitles corpora (using frequency lists built on large monolingual corpora including all monolingual data from WMT, private data and Wikipedia dumps).

Deduplication
We remove duplicate translations and keep only the most frequent translation for the source sentence if it repeats more than two times.This procedure is applied to some corpora, e.g.OpenSubtitles and MultiUN which contain a lot of various (and often incorrect) translations for common phrases.For example, the English phrase 'No.' is encountered almost 100k times in the source side of the English-Russian OpenSubtitles corpus.It has more than 78k unique translations, second most popular among which is 'Да.' ('Yes.' in Russian).

Language detection
The algorithm is a fairly simple ensemble of three tools: pycld21 , langid (Lui and Baldwin, 2012), langdetect2 .

Parallel segments filtering
We apply this step to low-quality data (basically, OpenSubtitles, CommonCrawl, ParaCrawl, Bookshop).We use Hunalign (Varga et al., 2005) to obtain basic sentence pair scores.We also extract about 30 additional features from sentence pairs and apply inhouse classifier to discard unparallel sentence pairs.It is a simple SVM classifier, and the features include source and target lengths in tokens, average token length in symbols, number of punctuation symbols in source and target etc.We do not use any categorical features.

Data filtering using language models
As last year, we use the modified bilingual Moore-Lewis data selection algorithm (Axelrod et al., 2011).However, this time we apply it all training corpora.We use the English and Russian news 2018 corpora from statmt.org as the indomain corpora.The idea is that the news corpora can be seen as high quality general-domain data.So using them in this scenario allows to remove some noisy outlying data.
We also substitute numbers and alphanumeric sequences with placeholders and sort the data according to language models scores.We use Levenshtein distance (set to a rather low threshold) to remove similar sentence pairs with similar scores.We regard such sentence pairs as useless (or even harmful) duplicates which can prevent our translation models from better and faster converging.We remove up to 15% of data using this procedure.

BPE
We use byte pair encoding (BPE) (Sennrich et al., 2016b) to encode our data to subword units.This year we use a different preprocessing scheme compared to the last year's systems.We noticed that the BPE algorithm from the OpenNMT toolkit (Klein et al., 2017) gives better results compared to the default script learn_bpe.pyfrom the MarianNMT toolkit.We see two reasons for that: 1) the BPE merge operations are learnt to distinguish subword units at the beginning, in the middle and at the end of the word and 2) the BPE merge operations can be learnt in case-insensitive mode (OpenNMT architecture supports features, so a feature can be used to handle case).Caseinsensitive BPE model is very useful when dealing with a lot of different and sometimes noisy data (like, for example, OpenSubtitles where uppercase is often used to communicate emphasis).This is also crucial when dealing with legal and financial data where specific terms are written in title case or uppercase.News headlines are also often written in title case or uppercase.As MarianNMT does not support features yet, we decided to perform a 'trick' similar to the one described in (Tamchyna et al., 2017): instead of using a feature we insert special tokens <C> and <U> after sequences in title case or uppercase.For example, a source sentence World Championships 2017: Neil Black praises Scottish members of Team GB is converted to world <C> championships <C> 2017 : neil <C> black <C> pra@@ ises scottish <C> members of team <C> gb <U> We do not use truecaser in our pipeline as it is redundant.All data is tokenized using the Moses toolkit (Koehn et al., 2007) tokenizer with aggressive tokenization, then the OpenNMT BPEsplitter is applied, after that we convert the case feature to separate tokens.

English-Russian system
Same as last year, we train the model with separate vocabularies due to the Cyrillic nature of Russian alphabet.Therefore we use separate BPE models for source and target with 35k and 45k merge operations respectively.We experimented with shared vocabulary following the procedure for the English-Russian pair described in (Sennrich et al., 2016b) but did not get improvements.This year, however, we train much smaller BPE models as we noticed that our NMT systems do not handle large vocabularies (70-90k) well and generate many OOVs in the output.

English-German and German-English systems
We train a joint BPE model for the English-German pair with 40k merge operations.We use a shared vocabulary and tie all embeddings of the translation models.The human parallel data for the German-English system is exactly the same as for the English-German system, the two systems only have different synthetic back-translated data.

Synthetic data
There are two types of additional synthetic training data described in detail below.The final size of the training data for the submitted systems is roughly 4 times the total size of the filtered data in Tables and 2. Back-translated data Back-translations (Sennrich et al., 2016a) are a common way to improve NMT models quality.As we aim at building general-domain models, we use data from Wikipedia dumps and news from statmt.org.We shuffle the Wikipedia data and randomly select a subset of appropriate size.The selected Wikipedia subset and the news subset are roughly equal in size.The size of the whole corpus used for back-translation is approximately equivalent to the size of human training data.
For the English-Russian pair we train a baseline Russian-English transformer model using the data prepared for the last year's WMT news task (Molchanov, 2018).For the German-English we also trained a transformer model using some data from OPUS as is: Europarl, DGT, JRC-Acquis, EMEA, ECB, NewsCommentary, TED2013, GlobalVoices.We use the Tatoeba corpus as our validation set in both cases.We use our final English-German model to obtain back-translations for the German-English model.
The trained systems were used to back-translate the 2017, 2018 news corpora from statmt.org and data selected from Wikipedia in Russian, German and English respectively.

Replicated data with unknown words
We apply the technique described in (Pinnis et al., 2017) to create a synthetic parallel corpus.The procedure includes the following steps: first, we perform word-alignment of our initial parallel training corpus using the fast-align tool (Dyer et al., 2013).Then, we randomly replace from one to three unambiguously (one-to-one) aligned tokens in both source and target parallel sentences with the special <UNK> placeholder.The same pipeline is applied to both the initial and backtranslated data.We train our models to reproduce the <UNK> placeholder in various contexts and use this feature for handling named entities described in Section 4.2 below.

Data for fine-tuning
We again apply the modified bilingual Moore-Lewis data selection algorithm.We use the news 2018 corpora as our in-domain data.We select 1M sentences from the human training data (excluding MultiUN and OpenSubtitles).We also randomly select 1M sentences from the news 2018 corpus with their back-translations.The same procedure is applied to both English-Russian and English-German pairs.

Systems architecture
This section describes the trained systems in detail.We train transformer (Vaswani et al., 2017) models for all submitted systems.We use the recipe available at the MarianNMT website3 .The system configuration, hyperparameters and training steps follow those in the recipe.There are two minor differences: we check the validation translation less frequently and set a higher earlystopping threshold to allow the model iterate over the training data a bit longer; 2) we do not use shared vocabulary for the English-Russian system because of the different alphabets in English and Russian as we mentioned earlier.For this reason we do not tie all embeddings and only tie the target embeddings to the output layer.
We trained two models -Model1 and Model2for the English-Russian pair with different seeds for almost five epochs each.The training data for the two models is slightly different: 1) we did not use the deduplication scheme described in Section 3.1 above for Model1; 2) we found about 350k English sentences in the Russian news 2018 corpus.These were removed from the synthetic data only before training Model2.
We trained single models for the English-German and German-English.Both models were trained for two epochs.

Back-off to RBMT
We fall back to our rule-based system (RBMT) in several cases:  if the NMT model output's language is other than expected.For example, we noticed that the English-Russian model sometimes generates English text (less than 1% of the test set sentences).The reasons for this were the 350k English sentences in Russian news 2018 corpus that we used for back-translation.We did not apply language filtering to the newscrawl corpora because they had been filtered by the WMT organisers until 2018.The English output is handled by the inhouse language detection tool.
 If the output contains recurring words or n-grams.
 If the output is much shorter or longer compared to the input sentence.We use simple rules based on source-translation length ratio to detect such cases.
 We also fall back to RBMT to translate very short strings (one or two words).

Handling named entities
We preserve several types of named entities (NEs): numbers, emails, alphanumeric sequencies etc. in the following way.First, we produce the baseline NMT translation without any processing.
Then we validate the translation of NEs by comparing the system's output to the source sentence.The validation is simple: we search for the corresponding strings (numbers, emails etc.) in the system's output.If some of the NEs are not translated or are translated incorrectly, we replace the entities with the <UNK> placeholder in the source sentence and translate the sentence again allowing the decoder to generate unknown words in the output.Finally, we substitute the <UNK> placeholders in the output with their initial value.
If the number of the <UNK> placeholders in the NMT system's output is not equal to the number of the placeholders in the source sentence, we fall back to the baseline NMT translation without NEs processing.We do not do any specific processing for proper names this time as they are handled much better by our current systems compared to our last year's submissions.

Models configuration
We use an ensemble of two fine-tuned models as our final translation system for the English-Russian pair.
We use a single fine-tuned model for the English-German system; the German-English system is a single baseline model.
We use the beam of size 12 and the -normalize parameter is set to 1.

Results and discussion
In this section we present the BLEU (Papineni et al., 2002) scores for our systems on two test sets and the analysis of the results.
The scores are presented in Table 4. Calculation is done using the multi-bleu-detok.perlscript from the Moses toolkit.
We significantly outperform the baseline for the English-Russian pair -our last year's submission for the News Task, an ensemble of 4 models.The results for Model1 and Model2 show us that better data filtering leads to better translation quality.
Fine-tuning does not give us significant improvements in terms of BLEU.We should probably try new approaches to data selection for domain adaptation.
We should also note the lower quality of the German-English model compared to our models and other participants.We think this must be connected with the fact that the data used for training the German-English model was in fact filtered for training the English-German model (thus, we paid less attention to the English side of the data).

Conclusions and Future work
In this paper we have described our submissions for the WMT 2019 Shared News Translation Task.Overall we have made three submissions: English-Russian, English-German and German-English.
We have documented the methodology used to prepare the training data, system training set-ups, the pipelines for handling NEs and using RBMT.
We show competitive results in two out of three language pairs.We plan our future research in several directions.First of all, data filtering improvement (especially when training models in both directions).Second, handling proper names translation into Russian.Finally, exploring other language pairs including the Chinese and Kazakh languages.

Table 2 :
Statistics for the filtered parallel English-German data in millions of sentences (#sent) and tokens.

Table 1 :
Statistics for the filtered parallel English-Russian data in millions of sentences (#sent) and tokens.

Table 4 :
Results for different systems.The submitted systems are marked in bold.Model2018 stands for our last year's submitted system which we consider the baseline.Model1 and Model2 are described in Section 4 above.