Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data

This paper presents the systems submitted by the University of Groningen to the English– Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsupervised and rule-based), given the agglutinative nature of Kazakh, (ii) data from two additional languages (Turkish and Russian), given the scarcity of English–Kazakh data and (iii) synthetic data, both for the source and for the target language. Our best submissions ranked second for Kazakh→English and third for English→Kazakh in terms of the BLEU automatic evaluation metric.


Introduction
This paper presents the neural machine translation (NMT) systems submitted by the University of Groningen to the WMT 2019 news translation task. 1 We participated in the English↔Kazakh (henceforth referred to as EN↔KK) constrained tasks.
Because of the inherent characteristics of this language pair and the current state-of-the-art of related techniques, we focused on two main research questions (RQs): • RQ1. Does morphological segmentation help? Recent research in NMT for agglutinative languages found that morphological segmentation outperforms the most widely used segmentation technique, byte-pair encoding (BPE, using character sequence frequencies) (Sennrich et al., 2016). Rule-based segmentation improved English-to-Finnish translation (Sánchez-Cartagena and Toral, 1 http://www.statmt.org/wmt19/ translation-task.html 2016) and unsupervised segmentation improved Turkish-to-English translation (Ataman et al., 2017). Because Kazakh belongs to the same language family as Turkish, the work by Ataman et al. (2017) is particularly relevant. Their training data had fewer than 300,000 sentence pairs and they trained an NMT system under the recurrent sequenceto-sequence with attention paradigm (Bahdanau et al., 2015).
Our training data is considerably bigger and we use a nonrecurrent attention-based system (Vaswani et al., 2017). Does the advantage of morphological segmentation over BPE also hold in our experimental setup?
• RQ2. Does the use of additional languages improve outcomes? Due to the scarcity of parallel data for EN-KK, we investigate if using data from two additional languages is useful, Russian (RU) and Turkish (TR). Even though RU is not related to either EN or KK, it seems a sensible choice due to the availability of large amounts of EN-RU and RU-KK parallel data. TR is related to KK and there are limited amounts of EN-TR data available. Does this additional data improve the performance, and is more data from an unrelated language (RU) better than less data from a related language (TR)?
The rest of the paper is organized as follows. Section 2 describes the datasets and tools used. Then Section 3 details our experiments. Finally, Section 4 outlines our conclusions and plans for future work.

Datasets and Tools
We preprocessed all the corpora used (training, validation and test sets) with scripts from the Moses toolkit (Koehn et al., 2007). The following operations were performed sequentially: punctuation normalisation, tokenisation, 2 truecasing and escaping of problematic characters. The truecaser was lexicon-based and it was trained on all the monolingual data available for each language. In addition, we removed sentence pairs where either side was empty or longer than 80 tokens from the parallel corpora . Tables 1 to 4 show the parallel datasets used for training for each translation direction after preprocessing. The corpora Kazakhtv (EN-KK) and crawl (KK-RU) were provided with sentence-level scores; we sorted their files according to these scores and a native KK speaker proficient in both EN and RU identified a threshold where alignments were roughly 90% correct. This led to discarding the bottom 27% of the data for EN-KK's Kazakhtv and the bottom 3% for KK-RU's crawl.

Words (M) Corpus
Sentences (   All our NMT systems are trained with Marian (Junczys-Dowmunt et al., 2018). 3 We used the transformer model type (Vaswani et al., 2017)   in all experiments, except for a few experiments where the training data was very limited, where we used the s2s model type (Bahdanau et al., 2015). During development, we evaluated our systems on the development sets provided. We used two automatic evaluation metrics: BLEU (Papineni et al., 2002) and CHRF (Popović, 2015). CHRF is our primary evaluation metric for EN→KK, due to the fact that this metric has been shown to correlate better than BLEU with human evaluation when the target language is agglutinative (Stanojević et al., 2015). BLEU is our primary evaluation metric for KK→EN systems, as the correlations with human evaluation of BLEU and CHRF are roughly on par for EN as the target language. Prior to evaluation the MT output is detruecased and detokenized with Moses' scripts.

Cyrilization and Turkish
Since KK is a low-resourced language, multilingual NMT (Johnson et al., 2017) was used. Following Neubig and Hu (2018), we have chosen TR as a helper source language, because it is related to KK (both belong to the same language family) and TR is higher-resourced than KK. However, TR uses a Latin-script alphabet, while KK uses a Cyrillic-script alphabet, which means their vocabularies do not match as they are. For this reason, we decided to transliterate TR into Cyrillic (cyrillization). However, some characters in KK's alphabet are not present in existing transliterators. Therefore, we created a cyrillizer that matches KK's alphabet exactly.
We trained a {KK, TR}→EN system in two steps. First, we use as training data the concatenation of the EN-KK and EN-TR corpora (Tables 1 and 4) and when the model converged, we resume training using only the EN-KK dataset. We compared models that used the original TR versus cyrillized. These models were trained with the s2s architecture using 32,000 joining operations in BPE and dropout of 0.05.
Training data BLEU EN-KK 6.61 + EN-TR 11.15 + cyrillizer 10.34 As it is shown in Table 5, the addition of EN-TR data proves very beneficial (absolute improvement of 4.5 BLEU points), which is not surprising since the amount of training data more than doubles (cf. Tables 1 and 4). However, cyrillising TR decreases the BLEU score by 0.8 points.

Backtranslation and Russian
Given the small amount of EN-KK parallel data (see Table 1) and the large amount of EN-RU and KK-RU datasets, we introduced RU as a pivot language, using backtranslation (Sennrich et al., 2015) to derive bigger datasets where the source side is synthetic. For KK→EN, we trained a RU→KK auxiliary system on the available KK-RU data (Table 3), and used this to translate the RU portion of the EN-RU (Table 2) data into KK, creating a synthetic EN-KK' dataset. This was then used, along the original EN-KK data (Table 1) to train the KK→EN model.
For EN→KK, we trained a RU→EN auxiliary model on the available EN-RU data, and used this model to translate the RU portion of the KK-RU data into EN, creating a synthetic EN'-KK dataset. This synthetic dataset, alongside the original EN-KK data, was then used to train the EN→KK model. Table 6 shows the results for EN→KK and KK→EN without and with the backtranslated data. The addition of backtranslated data results in massive improvements: +17.9 CHRF points for EN→KK and +14.2 BLEU points for KK→EN. This is expected given the very small size of EN-KK data and the much larger EN-RU and KK-RU datasets. The improvements are considerably larger than those obtained with additional EN-TR data (see Table 5).

Corpus Filtering and Target Synthetic Data
Since most of our training data is crawled, we applied corpus filtering to remove noisy sentence pairs. Following Artetxe and Schwenk (2018a), we removed sentences shorter than 3 words and longer than 80 words, and sentence pairs where either sentence is classified as another language using the FastText language identifier (Joulin et al., 2016a,b). 4 We also removed sentence pairs with a token overlap of 50% or higher.
We identify and remove misaligned sentence pairs (where the meanings of the source and target sentences do not match), using the LASER system, a 93-language BiLSTM encoder (Artetxe and Schwenk, 2018b). 5 This encodes the sentences in each side, and uses the cosine similarity between the embeddings of the two sentences as a filtering threshold (where sentences below the threshold are removed).
This filtering is applied after backtranslation (see Section 3.2). For KK→EN, we filter the EN-KK' data, i.e. the EN-RU corpora whose RU side had been translated into KK. The thresholds (determined manually, as previously mentioned in Section 2) and number of sentence pairs kept are shown in Table 7.

Corpus
Threshold  We quantify the impact on translation performance of each filtering step, cumulatively, in Table  8. Each filtering step improves the BLEU score, corroborating previous research, e.g. (Koehn et al., 2018), that has shown that noisy sentence pairs indeed cause a drop in translation performance.  In the opposite direction, EN→KK, we filter the EN'-KK data, i.e. the RU-KK corpora whose RU side has been translated into EN. The threshold and number of sentence pairs kept are shown in Table 9.

Corpus Threshold Pairs left (k)
Crawl 0.1463 4494.10 By manual inspection, we noticed that the biggest dataset used for EN→KK (the KK-RU crawl corpus, see Table 3) is domain-specific and rather unrelated to the domain of the test set (news). Due to this, we decided to experiment with target synthetic data by translating the EN-RU corpora, which are not domain-specific, into KK and adding a subset of the resulting EN-KK' data to our EN→KK system. We experimented with two similarity thresholds: a more conservative one (0.8) and a less conservative one (0.75). The thresholds and number of sentence pairs kept are shown in Table 10.   Table 11 shows the impact of adding target synthetic data on translation performance. Adding a small amount using a conservative threshold (0.8) results in an absolute improvement of 1.15 CHRF points. Adding more data using a less conservative threshold (0.75) results in a bigger improvement of 1.6 points. An even lower threshold was not tested due to time constraints.

Segmentation
Data is segmented with BPE (Sennrich et al., 2016) on all the languages involved in our experiments (EN, KK, RU and TR). In addition, we perform two types of morphological segmentation on KK: unsupervised and rule-based. Unsupervised morphological segmentation is performed with LMVR (Ataman et al., 2017), 6 a variant of Morfessor (Virpioja et al., 2013) that allows a fixed vocabulary size to be defined. LMVR was trained on the KK side of the RU-KK parallel data as well as on the monolingual KK data. We experimented using vocabulary sizes of 8, 16, 24, and 32 thousand. The trained LMVR models are used to segment the KK portion of the RU-KK data and the synthetic KK derived from the EN-RU data created with a RU-KK system (see Section 3.2).
For rule-based segmentation, Apertium-kaz (Washington et al., 2014) was used. 7 A transducer that provides multiple segmentation variants was set up four our purpose, 8 from these variants we decided to pick the one that segments into the smallest units, because this one, as observed by manual inspection, tends to be correct more often. Some segmentations do not correspond to the original word when joined, which we attribute to the fact that Apertium is not doing pure segmentation but also analysis. We do not pick these variants. We also observed that some words were out-of-vocabulary (OOV), i.e. not found in Apertium's transducer, so those were left unsegmented.
As can be seen in Table 12, Apertium segmenter leads to lower automatic metric scores, while BPE and LVMR are on par. This could be attributed to the morphological ambiguity issues described above and to the fact that some words were not segmented (OOV). Besides these quantitative results, we also performed qualitative analyses of the segmentations. Table 13 shows examples of words that result in ambiguous segmentations with Apertium. Table 14 shows a KK sentence segmented with BPE and LVMR. Morphological segmentation results in a better segmentation, which has a direct impact on the quality of the resulting EN translation.

Final Submissions
We took the best performing systems from previous experiments and carried out fine-tuning by resuming training after convergence using solely the EN-KK data (i.e. without any data whose source or target is synthetic). Finally, we ran ensembles of the best performing systems (with and without fine-tuning) and chose those that perform best on the development set. Those constitute our submissions to the shared task.
For KK→EN, we consider systems segmented with BPE and with LVMR since their BLEU scores are roughly on par: 22.26 and 22.36, respectively. The fine-tuned KK→EN system with BPE segmentation reaches 23.11. We built an ensemble on four BPE-based models, the two top performing ones without fine tuning (21.9 and 22.26) and the two top performing ones with fine tuning (22.99 and 23.11). The ensemble attains 23.37. We then tried different length-penalty values for the decoder (parameter normalize in Marian), using 0.9 (instead of the default 0.6) we reach 23.47.
The fine-tuned KK→EN with LVMR reaches a BLEU score of 23.26, thus slightly outperforming the fine-tuned system with BPE (23.11). We also performed fine-tuning including the synthetic data but including the non-synthetic data four times (i.e. synthetic to non-synthetic ratio of 1:4). This system reaches 22.65. We built an ensemble of the two fine-tuned models. This ensemble achieves a BLEU score of 23.71, which using a length normalisation penalty of 0.9 increases to 23.84.
For EN→KK we submitted systems based on BPE segmentation only. Our best of these systems achieves 47.27 CHRF while the best LVMR-based system yields 45.27. 9 We build an ensemble made of five models: the two top performing ones using target synthetic data with threshold 0.8 (CHRF scores 46.48 and 46.79), the two top performing ones using target synthetic data with threshold 0.75 (CHRF scores 47.07 and 47.27), and the top performing fine-tuned model with threshold 0.75 (CHRF score 47.57). The ensemble attains a CHRF score of 48.43.

Conclusions
This paper has reported on the systems submitted by the University of Groningen to the English↔Kazakh translation directions of the news shared task at WMT 2019.
Our results show quantitative evidence that, for an agglutinative language such as Kazakh, morphological segmentation is on par with segmentation based on the frequency of character sequences (in terms of automatic evaluation metrics) and qualitative evidence that it can result in better translations due to segmenting at the right morpheme boundaries. In addition, we show that the addition of data from an additional language, be it related or not, improves the performance notably, corroborating previous results. Finally, the use of synthetic data (both for the source and target languages), filtered with a state-of-the-art system based on language-independent similarity, improved the performance of our systems further.
As for future work, we plan to work along three lines. First, related to morphological segmentation, we note that Kazakh uses vowel harmony, which should be useful to model as part of the segmentation. Second, we would like to explore the contribution of synthetic target data in further detail. Third, given the unexpected negative results of cyrillization, we plan to analyse cyrillization's effects in detail.  In addition, the regional administration and the Kyzylorda State Universum named after the Fund named after the President of the Republic of Kazakhstan are ready to provide assistance in the prevention of the threat.
According to the Governor's Office of the region and the leadership of the Kyzylorda State University named after the Foundation of the First President of Kazakhstan, such devices are ready to help in the prevention of the threat. English reference Regional Akimat and Management of Kyzylorda State University named after Korkyt ata proposed to fabricate such safety devices assisting in prevention of danger in large quantities. Table 14: Segmentation examples of BPE and unsupervised morphological segmentation (LVMR) systems for KK→EN. Arrows represent boundaries between the morphs in which a word is split. Note that the word "óíèâåðñèòåòiíi" is segmented differently in both systems. The MT system with LVMR segmentation translates it correctly as "University", while the MT system with BPE segmentation produces "Universum" because of incorrect segmentation. This word, its segmentations and its translations are underlined.