The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019

This paper describes the two submissions of Universitat d’Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation, pivot backtranslation and transfer learning. They also use linguistic information in two ways: morphological segmentation of Kazakh text, and integration of the output of a rule-based machine translation system. Our systems were ranked second in terms of chrF++ despite being built from an ensemble of only 2 independent training runs.


Introduction
This paper describes the Universitat d'Alacant submissions to the WMT 2019 news translation task. Our two submissions address the low-resource English-to-Kazakh language pair, for which only a few thousand in-domain parallel sentences are available.
In order to build competitive neural machine translation (NMT) systems, we generated synthetic training data. We took advantage of the available English-Russian (en-ru) and Kazakh-Russian (kk-ru) parallel data by means of pivot backtranslation and transfer learning, and integrated monolingual data by means of iterative backtranslation.
In addition, we used linguistic information in two different ways: we morphologically segmented the Kazakh text to make the system generalize better from the training data; and we built a hybrid system combining NMT and the Apertium English-to-Kazakh rule-based machine translation (RBMT) system (Forcada et al., 2011;Sundetova et al., 2015).
The rest of the paper is organized as follows. Section 2 describes how corpora were filtered and preprocessed, and the steps followed to train NMT systems from them. Section 3 outlines the process followed to obtain synthetic training data. Sections 4 and 5 describe respectively morphological segmentation and hybridization with Apertium. The model ensembles we submitted are then presented in Section 6. The paper ends with some concluding remarks.

Data preparation and training details
In our submissions, we only used the corpora allowed in the constrained task.
Parallel corpora were cleaned with the script clean-corpus-n.perl shipped with Moses (Koehn et al., 2007), that removes unbalanced sentence pairs and those with at least one side longer than 80 tokens. Additional filtering steps, described below, were applied to the web crawled corpora. Tables 1 and 2 depict the number of segments in the parallel and monolingual corpora used, and their sizes after cleaning.
The English-Kazakh web crawled corpus allowed in the constrained task presented a high proportion of parallel segments that were not translation of each other. We filtered it with Bicleaner . We applied the hardrules and the detection of misaligned sentences described by Sánchez-Cartagena et al.  The Kazakh-Russian crawled corpus was cleaned in a shallower way: we just removed those sentence pairs that contained less than 50% of alphabetic characters in either side, as we did not consider them fluent enough to be useful for NMT training. The same filtering was applied to the monolingual Kazakh Common Crawl corpus. In addition, inspired by Iranzo-Sánchez et al. (2018), we ranked its sentences by perplexity computed by a character-based 7-gram language model and discarded the half of the corpus with the highest perplexity. The language model was trained 2 on the high-quality Kazakh monolingual News Commentary corpus.
Training corpora were tokenized and truecased with the Moses scripts. Truecaser models were learned independently for each trained system from the very same training parallel corpus. Unless otherwise specified, for each trained system, words were split with 50 000 byte pair encoding (BPE; Sennrich et al., 2016c) operations learned from the concatenation of the source-language (SL) and target-language (TL) training corpora.
As described in Section 6, our submissions were ensembles of Transformer (Vaswani et al., 2017) and recurrent neural network (RNN; Bahdanau et al., 2015) NMT models trained with the Marian toolkit . We used the Transformer hyperparameters 3 described by Sennrich et al. (2017) and the RNN hyperparameters 4 described by Sennrich et al. (2016a). Early stopping was based on perplexity and patience was set to 5. We selected the checkpoint that obtained the highest BLEU (Papineni et al., 2002) score on Wikititles parallel corpus and extracted the positive and negative training examples from News Commentary. We kept those sentences with a classifier score above 0.6. 2 The language model was trained with KenLM (Heafield, 2011) with modified Kneser-Ney smoothing (Ney et al., 1994 Since the only evaluation corpus made available was newsdev2019, we split it in two halves, and we respectively used them as development and test set in all the training runs previous to the submission (those reported in all sections but Section 6). Throughout the paper, we report BLEU (Papineni et al., 2002) and chrF++ (Popović, 2017) scores. 5 The latter is known to correlate better than BLEU with human judgements when the TL is highly inflected (Bojar et al., 2017), as is the case. Where reported, we assess whether differences between systems' outputs are statistically significant for p < 0.05 with 1 000 iterations of paired bootstrap resampling (Koehn, 2004).

Data augmentation
This section describes the process followed to select the best strategy to take advantage of parallel corpora from other language pairs (Section 3.1) and monolingual corpora (Section 3.2).

Data from other language pairs
In order to take advantage of the parallel corpora listed in Table 1 for other language pairs, we applied the transfer learning approach proposed by Kocmi and Bojar (2018). We experimented with the parent models listed next (models trained on other high-resource language pairs) and used the concatenation of the genuine English-Kazakh parallel data as the child corpus (corpus of a lowresource language pair used to continue training a parent model): 6 • A Russian-to-Kazakh model trained on the crawled parallel corpus depicted in Table 1.
• An English-to-Russian model trained on all the available parallel data for the English-Russian language pair in this year's news translation task (depicted in Table 1).
• A multilingual system (Johnson et al., 2017) trained on the concatenation of the corpora of the two previous models. This strategy aims at making the most of the data available for related language pairs.
We also explored pivot backtranslation (Huck and Ney, 2012): we translated the Russian side of the crawled Kazakh-Russian parallel corpus with a Russian-to-English NMT system to produce a synthetic English-Kazakh parallel corpus. The NMT system was a Transformer trained on the English-Russian parallel data depicted in Table 1. We concatenated the pivot-backtranslated corpus to the genuine English-Kazakh parallel data and fine-tuned the resulting system only on the latter.
The results of the evaluation of these strategies, reported in the upper part of Table 3, show that the multilingual/transfer learning strategy outperforms the pure transfer learning approaches, probably because it takes advantage of more resources. Moreover, it performs similarly to pivot backtranslation, which we chose for our submission. All the strategies evaluated clearly outperformed the system trained only on the genuine parallel data.
As a Kazakh-to-English MT system is needed to backtranslate the Kazakh monolingual data (see Section 3.2), we also explored the best strategy for taking advantage of data from other language pairs for that direction. We experimented only with transfer learning and discarded pivot backtranslation since we wanted to avoid training a system on a parallel corpus with a synthetic TL side.
We evaluated the same parent-child configurations as in the English-to-Kazakh experiments, but we inverted their direction to ensure that either the SL of the parent corpora is Kazakh or the TL is English. Results are reported in the lower part of Table 3 and show that, as in the opposite direction, transfer learning brings a clear improvement over training only on the genuine parallel data, and the best parent model is the multilingual one.

Monolingual data: iterative backtranslation
Backtranslation (Sennrich et al., 2016b) is a widespread method for integrating TL monolingual corpora into NMT systems. In order to integrate the available Kazakh monolingual data into our submission, we need a Kazakh-to-English MT system as competitive as possible, since the quality of a system trained on backtranslated data is usually correlated with the quality of the system that perform the backtranslation (Hoang et al., 2018, Sec. 3). We followed the iterative backtranslation algorithm  outlined below with the aim of obtaining strong English-to-Kazakh and Kazakh-to-English systems using monolingual English and monolingual Kazakh corpora: 1. The best strategies from Section 3.1 were applied to build systems in both directions without backtranslated monolingual data.
2. English and Kazakh monolingual data were backtranslated with the previous systems.
3. Systems in both directions were trained on the combination of the backtranslated data and the parallel data.
4. Steps 2-3 were re-executed 2 more times. Backtranslation in step 2 was always carried out with the systems built in the most recent execution of step 3.
The Kazakh monolingual corpus used was the concatenation of the corpora listed in Table 2, while the English monolingual corpus was a subset of the News Crawl corpus in the same table. The size of the subset was duplicated after each backtranslation and started at 5 million sentences in the first one. The objective of the first 2 executions of steps 2-3 (from now on, iterations) was building a strong Kazakh-to-English system. The remainder of this section explains how MT systems were trained in these 2 iterations. The objective of the 3 rd iteration, in which only English-to-Kazakh systems were trained, was building the submissions, and the corresponding details are described in Section 6.
We explored different ways of training NMT systems with backtranslated data. First, we carried out transfer learning from the multilingual models described in Section 3.1. In this case, the child model was trained on a parallel corpus built from the concatenation of the genuine parallel data and the backtranslated data. The genuine parallel data was oversampled to match the size of the backtranslated data (Chu et al., 2017).
As an alternative to transfer learning, we experimented with corpus concatenation and finetuning. For the English-to-Kazakh direction, we concatenated the backtranslated data to the pivotbacktranslated corpus and the genuine parallel corpora, trained a model from scratch, and fine-tuned it only on the genuine parallel data. For the opposite direction, we trained a system only on the concatenation of the backtranslated and the genuine parallel data, and fine-tuned it on the latter (note that in this set-up we dispensed with parallel data from other language pairs). Table 4 shows the automatic evaluation scores obtained in the 1 st iteration by the strategies being evaluated. Only the best performing strategies in the 1 st iteration were used in the subsequent ones; the scores obtained on the 2 nd iteration are also depicted. The results show the positive impact of the introduction of backtranslated data in both directions. Concatenation plus fine-tuning outperformed   transfer learning in both directions. This result is surprising for Kazakh-to-English, where the transfer learning strategy makes use of more resources. One possible explanation could be that, with concatenation plus fine-tuning, the system is trained mostly on data from the news domain, as the English monolingual data is extracted only from News Crawl. Finally, the repetition of steps 2-3 helped to further improve translation quality.

Morphological segmentation
Morphological segmentation is a strategy for segmeting words into sub-word units that consists in splitting them into a stem, that carries out the meaning of the word, and a suffix or sequence of suffixes that contain morphological and syntatic information. When that strategy has been followed to segment the training corpus for an NMT system, it has been reported to outperform BPE for highly inflected languages such as Finnish (Sánchez-Cartagena and Toral, 2016), German (Huck et al., 2017) or Basque (Sánchez-Cartagena, 2018). In our submissions, we morphologically segmented the Kazakh text with the Apertium Kazakh morphological analyzer. 7 For each word, the analyzer provides a set of candidate analyses made of a lemma and morphological information. Those analyses in which the lemma is a prefix of the word are considered valid analyses for segmentation and involve that the word can be morphologically segmented into the lemma and the remainder of the word. 8 When there are multiple valid analyses for a word, they are disambiguated as explained below. When a word has no valid analyses for segmentation, we generate as many segmentation candidates as known suffixes match the word (plus the empty suffix, since a possible option could be no segmenting at all). Known suffixes are extracted in advance from those words with a single valid analysis.
Multiple segmentation candidates (either coming from multiple valid analyses or from suffix matching) are disambiguated by means of the strategy described by , which relies on the semi-supervised morphology learning method Morfessor (Virpioja et al., 2013). We trained the Morfessor model on all the available Kazakh corpora listed in Tables 1 and 2. Finally, as suggested by Huck et al. (2017), we applied BPE splitting with a model learned on the concatenation of all training corpora after performing the morphological segmentation. Table 5 depicts some examples of Kazakh words, their analyses and their morphological segmentation. The first word is the genitive form of университет (university). The morphological segmentation allows the NMT system to generalize to other inflected forms of the same word, while BPE does not split it because it is a rather frequent term in the corpus. The second word is an inflected form of the verb жаса (to do), although it is also analyzed as a inflected form of жасал due to an error in the analyzer. The Morfessor model preferred the wrong analysis, but the plain BPE segmentation made translation even more difficult for the MT system by choosing the prefix жас, which means young. BPE introduced more ambiguity, as the token жас can encode both the verb to do and the adjective young.  Table 6: Results obtained by the different strategies evaluated for integrating the Apertium English-to-Kazakh rule-based machine translation system into an NMT system. Scores of hybrid systems are shown in bold if they outperform the corresponding pure NMT system by a statistically significant margin.

Hybridization with rule-based machine translation
The Apertium platform contains an English-to-Kazakh RBMT system (Sundetova et al., 2015) that may encode knowledge that is not present in the corpora available in the constrained task. In order to take advantage of that knowledge, we built a hybrid system by means of multi-source machine translation (Zoph and Knight, 2016). Our hybrid system is a multi-source NMT system with two inputs: the English sentence to be translated, and its translation into Kazakh provided by Apertium. This very same set-up has been successfully followed in the WMT automated post-editing task . In order to assess the viability of this approach, we trained and automatically evaluated multisource and single-source English-to-Kazakh systems on the concatenation of the genuine English-Kazakh parallel corpora and the backtranslation of the Kazakh monolingual corpora News Crawl and Wiki dumps. 9 Results, depicted in Table 6, show that the multisource system is able to outperform the singlesource one only with the RNN architecture (the difference is statistically significant for chrF++). Apertium output seems to be of very low quality 9 We backtranslated with the best system from Section 3.1.
according to the scores reported in the table. 10 Despite that, the multi-source RNN is able to extract useful information from it. The poor performance of the multi-source Transformer architecture could be related to the low quality of the Apertium output. In order to prevent that the errors in the Apertium translation are propagated to the output, the decoder should focus mostly on the SL input. However, according to the analysis of attention carried out by Libovickỳ et al. (2018), in the serial multisource architecture of Marian the output seems to be built with information from all inputs. We plan to explore more multi-source architectures in the future. Due to the poor performance of the Transformer multi-source architecture, we used only the multi-source RNN in our submission, as explained in the next section.

Final submissions
We submitted a constrained and an unconstrained ensemble for the English-to-Kazakh direction. This section describes how the individual models of the ensembles were trained and selected, and presents the results of an automatic evaluation.
Training details. All the ensembled models were trained on the genuine parallel corpora, the pivot-backtranslated corpus, and the backtranslated corpus obtained in the 3 rd iteration, in a similar way to what has been described in Section 3.2. Preprocessing steps and training parameters were those described in Section 2, with the following exceptions: we applied morphological segmentation to the Kazakh text as described in Section 4, we used the full newsdev2019 as the development corpus, and we oversampled the News Commentary parallel corpus for fine-tuning to match the size of the concatenation of all the other genuine English-Kazakh parallel corpora.
Ensemble building. Our constrained submission was an ensemble of 2 transformer models and 2 RNN models. For each architecture, the 2 models were checkpoints from the same training run, thus our submission only contained models from 2 independent training runs. In both cases, the first model in the ensemble was the last saved checkpoint of the main training run (that was carried out on the concatenation of all the corpora), after being fine-tuned on the genuine parallel corpora. The second model in the ensemble was the checkpoint of the main training run which, after being fine-tuned on the genuine parallel corpora and ensembled with the first model, maximized chrF++ on the development set. We gave the Transformer and RNN models different weights on the final ensemble, which were also optimized on the development set. Our unconstrained submission was created in a similar way, but the two RNN models were multi-source models such as those described in Section 5. Additionally, we built an ensemble of 5 independently trained Transformer models that could not be submitted due to time constraints.
Automatic evaluation. Table 7 shows the values of the BLEU and chrF++ automatic evaluation metrics obtained by our systems on the newstest2019 test set. In order to assess the impact of the enhancements applied, we also show scores for single models, and for alternatives without morphological segmentation and without the additional RBMT input. We can observe that morphological segmentation slightly improves the results. In line with the results in Section 5, adding the additional Apertium input to a single model also brings an improvement according to both evaluation metrics. However, that gain vanishes when we compare the ensembles, probably because the scores obtained by the RNN models are far below those obtained by the Transformer models. Moreover, the ensemble of 5 independently trained Transformers outperforms our submitted systems, which were ensembles of only 2 independent training runs.
Comparison with other teams. Table 7 also depicts the scores obtained by the top 3 constrained systems submitted by other teams with the highest chrF++. In comparison with them, our constrained submission is ranked in 2 nd position in terms of chrF++ and 3 rd in terms of BLEU. Our ensemble of 5 Transformer models, built after the submission deadline, reaches the 1 st position in terms of chrF++. There are no statistically significant differences for any of the evaluation metrics between our 5-Transformer ensemble and the best performing contestant.  Table 7: Results obtained by our submissions, singlemodel alternatives, and systems submitted by other teams, computed on newstest2019. There are no statistically significant differences for any of the evaluation metrics between our 5-Transformer ensemble and the NEU submission.

Concluding remarks
We have presented the Universitat d'Alacant submissions to the WMT 2019 news translation shared task for the English-to-Kazakh language pair. As it is a low-resource pair, we took advantage of parallel corpora from other language pairs via pivot backtranslation and transfer learning. We also iteratively backtranslated monolingual data and made the most of the noisy, crawled corpora after filtering it with automatic classifiers and language models. We morphologically segmented Kazakh text to improve the generalization capacity of the NMT system and successfully used multi-source machine translation to build a hybrid system that integrates the Apertium RBMT English-Kazakh RBMT engine. Our constrained submission was ranked 2 nd in terms of chrF++. We plan to continue exploring the hybridization of NMT and RBMT. More multi-source Transformer architectures need to be evaluated to better fit the nature of the RBMT input. Another research line involves using RBMT to generate synthetic training data.