Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences

This paper presents the systems submitted by the Abu-MaTran project to the English-to-Finnish language pair at the WMT 2016 news translation task. We applied morphological segmentation and deep learning in order to address (i) the data scarcity problem caused by the lack of in-domain parallel data in the constrained task and (ii) the complex morphology of Finnish. We submitted a neural machine translation system, a statistical machine translation system reranked with a neural language model and the combination of their outputs tuned on character sequences. The combination and the neural system were ranked ﬁrst and second respectively according to automatic evaluation metrics and tied for the ﬁrst place in the human evaluation.


Introduction
This paper presents the machine translation (MT) systems submitted by the Abu-MaTran project to the WMT 2016 news translation task. We participated in the English-to-Finnish constrained task.
English-to-Finnish is a particularly challenging language pair for corpus-based MT because of the lack of in-domain parallel data (the only available parallel corpus in the shared task is Europarl) and the complex morphology of Finnish. The fact that the same root can be inflected in many different ways and that nouns can be joined together in order to build compound words exacerbates the aforementioned lack of parallel data problem.
As in our last year's submission (Rubino et al., 2015), we used morphological segmentation (Pirinen, 2015) on the Finnish side in order to deal with data scarcity and reduce the size of the Finnish vocabulary. We also used character-level evaluation metrics during the development of our systems, which correlate better than word-based ones with human judgements according to the results of last year's metrics shared task (Stanojević et al., 2015) for English-to-Finnish.
When a Finnish sentence is morphologically segmented, it becomes much longer (number of tokens) than its English counterpart. This results in the distance between the Finnish tokens that depend on each other to produce a correct translation increasing too. 1 We addressed this potential issue by introducing deep learning in our systems: we submitted a neural MT (NMT) system and a phrase-based statistical MT (SMT) system enhanced with a neural language model (LM). In the latter, we reduced the length of the Finnish segmented sentences by joining the most frequent sequences of morphs. We also submitted a system that combines the outputs of our best NMT and SMT systems and is tuned on character sequences.
The paper is organised as follows: the data and tools used are described in Section 2, while our NMT, SMT and combined submissions are presented respectively in sections 3, 4 and 5. The paper ends with some concluding remarks.

Datasets and Tools
We preprocessed the training corpora with scripts included in the Moses toolkit (Koehn et al., 2007). We performed the following operations: punctuation normalisation, tokenisation, true-casing and escaping of problematic characters. The truecaser is lexicon-based and it was trained on all the monolingual data. In addition, we removed sentence pairs from the parallel corpora where either side is longer than 80 tokens.   Table 2: English monolingual data, after preprocessing, used to train the LM of the Finnish-to-English SMT system we used to backtranslate the Finnish News Crawl monolingual corpora into English (see Section 3).
Since the Common Crawl Finnish monolingual corpus was obtained by crawling websites, we applied a set of additional preprocessing steps in order to remove as much noisy data as possible: (i) detecting sentences with an incorrect character encoding and re-encoding them with the right one; (ii) replacing XML entities with the characters they represent; (iii) removing sentences with a low proportion of alphabetic characters (less than 50%); (iv) removing short sentences (less than 3 alphabetic tokens); and (v) removing sentences whose first 18 tokens are equal to those in another sentence. The last filtering is necessary because it is relatively common in the corpus to find the same sentence with some segment missing at the end. If these lines were kept, n-gram counts from which LM probabilities are estimated would be less reliable. As a result of these preprocessing steps, around 43 million sentences were removed. Table 1 shows the Finnish monolingual corpora we used together with their size and Table 3 shows the same information for the parallel corpora. We used an additional synthetic parallel corpus to train our NMT system, which was obtained by backtranslating the Finnish News Crawl corpora into English with an SMT system (see Section 3). 2 The monolingual corpora used for training its LM are listed in Table 2.
Throughout the paper we evaluate the systems we build in terms on three automatic evaluation metrics: BLEU (Papineni et al., 2002),  TER (Snover et al., 2006) and chrF1 (Popović, 2015). As the performance obtained in the development (newsdev2015) and validation (new-stest2015) sets guides our decisions, we believe it is sensible to use three metrics with different underlying methodologies and that work on different elements (words and characters). Statistical significance of the difference between systems is computed with paired bootstrap resampling (Koehn, 2004) (p ≤ 0.05, 1 000 iterations).

Neural Machine Translation
NMT systems have been reported to outperform SMT systems for different language pairs (Sennrich et al., 2015a;Luong et al., 2015;Costa-Jussà and Fonollosa, 2016;Chung et al., 2016a). Unlike SMT, in which different models are trained independently and their weights are tuned jointly, in NMT all the components are jointly trained to maximise translation quality. NMT systems have a strong generalisation power because they encode words as real-valued vectors (similar words are close to each other in that vector space) and they are able to model long-distance phenomena thanks to the use of LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Chung et al., 2014) units. We followed the encoder-decoder architecture with attention proposed by Bahdanau et al. (2015). 3 NMT models are trained only from a parallel corpus, that is, they are not designed to make use of additional target-language (TL) monolingual corpora. Given the lack of in-domain parallel corpora available for English-Finnish, we trained our system on the concatenation of Europarl and a synthetic corpus obtained by backtranslating the in-domain monolingual Finnish corpora (News Crawl) from Finnish to English. Backtranslation has been reported to be a successful way of integrating TL monolingual corpora into an NMT system (Sennrich et al., 2015a). It was performed by means of a Finnish-to-English SMT system that followed the set-up of the rule-based morphologically segmented system from our last year's constrained submission (Rubino et al., 2015). It was trained on Europarl and the concatenation of the English monolingual corpora listed in Table 2.
Most of the NMT architectures in the literature can only operate with a fixed TL vocabulary (that ranges from 30 000 to 80 000 words, according to Jean et al. (2015)), since training and decoding computational complexity grows with its size. Although Jean et al. (2015) proposed an to reduce that complexity and hence use larger vocabularies, Sennrich et al. (2015b) showed that segmenting words into smaller units can also reduce complexity, increase effective vocabulary size and even improve translation quality. We followed the latter strategy. The evaluation of character-based NMT approaches (Ling et al., 2015;Costa-Jussà and Fonollosa, 2016;Chung et al., 2016b) was left as future work.
In the remainder of this section, we present the segmentation approach we followed together with the alternatives we evaluated and we describe the training and decoding set-up of our NMT system, including the strategy followed to translate out-ofvocabulary words (OOVs).

Word Segmentation
Existing word segmentation approaches for NMT (Sennrich et al., 2015b) rely on frequencies of sequences of characters in the training corpus. We studied whether using linguistic information to segment the training corpus allows the neural network to generalise better: we applied the rule-based morphological segmentation provided by Omorfi (Pirinen, 2015) for Finnish. It splits words into morphs, that is, minimal segments carrying semantic or syntactic meaning.
We evaluated the segmentation schemes listed below. 4 Table 4 depicts an example of the effect they produce on a Finnish sentence.
• No segmentation at all.
• Byte pair encoding (BPE) on both the source language (SL) and the TL. This is one of the best performing strategies proposed by Sennrich et al. (2015b). It consists of initially segmenting each word in characters, and iter-atively joining the most frequent pair of segments in the training corpus. We applied it independently to the SL and TL sides of the parallel corpus. We performed 60 000 join operations on each language.
• BPE only on the TL side of the parallel corpus, since Finnish is morphologically more complex than English.
• Morphological segmentation with Omorfi on the TL.
• BPE on the TL using the morphs produced by Omorfi as the starting point. We evaluated the effect of performing 1 000, 10 000, 25 000 and 50 000 join operations. Morphological segmentation produces an average sentence length significantly higher than that of the English side of the parallel corpus. After performing 1 000 operations, average sentence lengths are similar: we reduce vocabulary size without significantly increasing sentence length. As the number of operations increases, average sentence length is closer to that of the unsegmented approach.
For each of these segmentation schemes, we trained an NMT system on Europarl during 5 days (a model was saved every 3 hours of training), we chose the model that achieved the highest translation quality on newsdev2015 5 and evaluated it on newstest2015. The remainder of the training and decoding parameters were the same ones we used in our submission (described in Section 3.2). Table 5 depicts the results of the evaluation together with the vocabulary size of the NMT system 6 and the proportion of tokens in the training corpus that belong to the vocabulary. Results show that, despite the fact that the BPE-based systems have full coverage of the training corpus, their performance is below that of the unsegmented alternative. These results are probably related to the fact that domains of the training and testing corpora do not match, and words in the test set that do not contain subsegments observed in the training  Table 4: Example of the application of the different segmentation schemes described in Section 3.1 to a Finnish sentence. Arrows represent boundaries between the morphs in which a word is split. Note how the compound word perusasian is segmented by the different schemes: Omorfi splits it into perus ("basic"), asia ("thing, affair") and the case marker -n, while the application of BPE over it joins the marker to the second noun. The pure BPE scheme, however, fails to segment perusasian correctly.
corpus are segmented into very long sequences. The Omorfi-based approach, which is domain agnostic, is close to the unsegmented alternative in terms of BLEU and TER (there is no statistically significant difference between them) and clearly outperforms it in terms of the character-level metric chrF1. This shows the effect of segmentation: the system is probably producing a better translation for some parts of compound words and/or producing lemmas that can be found in the reference, but inflected in a different way. Finally, the combination of BPE with morphological segmentation does not bring a clear improvement. In view of the results, we decided to segment the TL side of the training corpus with Omorfi in our submission.

Training and Decoding Details
We generally followed the training set-up by Sennrich et al. (2015b). We defined a hidden layer size of 1 000 and an embedding layer size of 620. We used Adadelta (Zeiler, 2012) with a minibatch size of 80, and reshuffled the training set between epochs. We applied gradient clipping (Pascanu et al., 2013) with a cutoff of 1.0. The vocabulary contained the 50 000 most frequent SL tokens and the 50 000 most frequent TL tokens in the training corpus. We trained our system during 8 days (a model was saved every 3 hours). 7 We chose the 4 models that produced the highest BLEU score on news-dev2015. The training of these 4 models continued for 12 hours without changing the values of the embedding layers. After that, we translated the test set with an ensemble of these 4 models. 8 7 Training was performed on a NVIDIA Tesla K20 GPU. 8 We used a beam size of 12 for beam search and normalised the probability by sentence length.

Dealing with Unknown Words
In order to translate OOVs, 9 we followed an enhanced version of the approach by Jean et al. (2015, Sec. 3.3). OOVs in the training corpus were replaced with the special token UNK, as were those in the SL sentences to be translated by the NMT system. As a result, the output contained some UNK tokens.
In order to replace the UNK tokens generated by the model, we identified the most likely SL word to which the unknown TL word was aligned. If the SL word started with an uppercase letter, we copied it to the output. Otherwise, we replaced the UNK token with its translation according to a bilingual dictionary obtained from the parallel corpus with fast align (Dyer et al., 2013).
For each UNK token, Jean et al. (2015) selected the SL word with the highest alignment probablity according to the attention mechanism, while our enhanced approach combines the attention mechanism and a heuristic that aims at preserving the named entities in the SL sentence. We considered the top 5 SL words with the highest attention alignment probability for each UNK token, 10 and, for each sentence, we chose the set of SL words that ensured that the maximum number of words that start with an uppercase letter in the SL sentence were included in the translation. 11 Ta- 9 We define OOVs as those words either not present in the training corpus or present but not frequent enough to be part of the NMT system vocabulary. 10 We ignored those SL words whose probability was 4 times lower than that of the most probable SL word. 11 We relied on the capitalisation of the first character to detect a named entity. We carried out a small study in order to test the accuracy of this approach: from 100 capitalized words (after truecasing) randomly chosen from the English side of newstest2016, 76 were named entities that do not need to be translated into Finnish (person names, place names, etc. ) and 24 needed to be translated (days of the week, country names, demonyms, etc.). However, when we analyzed only those capitalized SL words that were not part of the vocabulary of the NMT system (and hence they were likely to produce an UNK symbol), the accuracy increased: 23 out of 24  Table 5: Results of the evaluation of different word segmentation schemes on an NMT system trained on Europarl. The vocabulary size of the NMT system is depicted, as well as the proportion of tokens covered in the training copus. Scores displayed correspond to the evaluation on newstest2015. The best score for each metric is shown in bold. An arrow pointing upwards (↑) means that the corresponding system outperforms the system without segmentation by a statistically significant margin, while an arrow pointing downwards (↓) means the opposite: the system without segmentation wins.  Table 6: Results of the evaluation on newstest2016 of our NMT submission (in bold), the simpler strategy for translating unknown words by Jean et al. (2015, Sec. 3.3) (labelled as most probable SL word) and our best individual NMT model. The best score for each metric is shown in bold. An arrow pointing upwards (↑) means that the corresponding system outperforms the system in the previous row by a statistically significant margin. ble 6 shows the results of the automatic evaluation of our submitted NMT system (in bold; as described in the previous section, it is an ensemble of 4 models) on newstest2016. We also evaluated the simpler OOV translation strategy by Jean et al. (2015), and the best NMT individual model according to BLEU on the development set. Our enhanced strategy for OOV translation resulted in a statistically significant improvement in terms of BLEU and chrF1. Note also the huge impact of model ensembling.

Statistical Machine Translation
Our work on SMT systems built upon our last year's best constrained individual system (Rubino et al., 2015). This was a phrase-based SMT system where the Finnish data was segmented to morphs with Omorfi (Pirinen, 2015). It also used two additional models: an Operation Sequence words were named entities that do not need to be translated). Model (Durrani et al., 2011) anda Bilingual Neural Language Model (Devlin et al., 2014), as well as three reordering models: word-and phrasebased and hierarchical (Koehn et al., 2005;Galley and Manning, 2008).
This year's SMT systems used the same models and datasets, except for the LMs, which this time were log-linearly interpolated and used the additional corpus available (Common Crawl, cf. Table  1). We built three SMT systems, which share the same models and data, with the only difference being the segmentation used in the Finnish data: • No segmentation.
• Segmentation on morphs followed by joining the most frequent sequences (Omorfi + BPE).
In the latter we joined the most frequent sequences (1 000 operations) so that the length of the Finnish side (measured in number of tokens) becomes similar to that of the English side. As previously mentioned in Section 3.1, this is a trade-off to avoid both having a big vocabulary (as is the case without segmentation), and having to deal with longdistance phenomena (as is the case with Omorfi). Table 7 shows the results of these three SMT systems. We corroborate the results found out last year, i.e. morphological segmentation outperforms the unsegmented system by a statistically signifcant margin across all the automatic metrics. We also observe that joining the most frequent morphs results in a further improvement on BLEU (2.3% relative), and small changes in TER (−0.5%) and chrF1 (−0.3%).  Table 7: Results of the evaluation on newstest2016 of the SMT systems built. The best score for each metric is shown in bold. An arrow pointing upwards (↑) means that the corresponding system outperforms the system without segmentation by a statistically significant margin.

Reranking
We reranked the n-best list (top 500 distinct translations) produced by our best SMT system (Omorfi + BPE) using two neural LMs: left-toright (i.e. trained in the same direction as the LMs included in the SMT system) and right-to-left (i.e. reverse direction). We hypothesise that the latter LM might bring a higher improvement as the sequences this LM is trained on have not been used by the SMT decoder. 12 Both neural LMs were trained on in-domain data (a subset of 4 million sentences 13 randomly selected from News Crawl) with the rwthlm toolkit (Sundermeyer et al., 2014). The main parameters we used are as follows: vocabulary limited to the 50 000 most frequent tokens, 2 layers (linear and LSTM), both of size 200 and 1 000 word classes, generated with mkcls. Table 8 shows the results of reranking using left-to-right and right-to-left neural LMs on their own and jointly (row bidirectional). Reranking with left-to-right or the right-to-left LMs on their own does not result in a substantial improvement. However, when both LMs are used jointly we observe better scores for all the metrics: 1.7% relative improvement for BLEU, −0.5% for TER and 0.1% for chrF1.

System Combination
As we have seen in the previous two sections, our best NMT system outperforms by a wide margin our best SMT system. These two systems are typologically different, and thus, despite the gap in performance, we might expect them to have complementary strengths. We therefore explored combining both systems in order to answer the following question: whether SMT, despite the gap in performance, can still be useful, used jointly with 12 Because of the way SMT decoders work they can use left-to-right LMs but not reverse LMs. 13 Due to time constraints.  Table 8: Results of the different reranking strategies applied to the best SMT system (Omorfi + BPE) on newstest2016. The best score for each metric is shown in bold, as is the system submitted. An arrow pointing upwards (↑) means that the corresponding system outperforms the system without reranking by a statistically significant margin.
NMT, to improve upon NMT on its own. We combined the outputs produced by the best NMT and SMT systems with MEMT (Heafield and Lavie, 2010). We used default settings, except for radius (5), following empirical results obtained on newsdev2015. The LM used in the combination was built on the concatenation of all the Finnish monolingual corpora available, cf. Table 1.
As the systems combined use different segmentations (Omorfi in NMT and Omorfi followed by BPE in SMT), we joined the morphs before combining them. Therefore the tuning of the system combination was performed without segmentation. Since chrF1 was found to correlate well with human evaluation for Finnish last year (Stanojević et al., 2015), we explored tuning on this metric, alongside tuning on BLEU.
Finally, we reranked the n-best list of the system combination (top 500 translations) with the same procedure used to rerank the best SMT system (cf. Section 4.1). While the best SMT system was reranked on segmented data (Omorfi + BPE), the output of the system combination is not segmented. Therefore, similarly to what we did for system combination, we explored tuning the reranking on chrF1. Table 9 shows the results of system combination and its rerankings. In system combination, we observe that tuning on character sequences results in considerably better scores compared to tuning on BLEU. That said, the output produced by the best system combination system without reranking (i.e. tuned on chrF1) is still worse than the one produced by the NMT system alone according the automatic metrics (−3.4% relative on BLEU and −0.1% on chrF1) except for TER (2.3% relative improvement).
Overall, reranking the system combination 14 14 We reranked the system combination that performed  Table 9: Results of the system combination experiments on newstest2016. The best score for each metric is shown in bold, as is the system submitted. An arrow pointing upwards (↑) means that the corresponding system outperforms the best NMT system by a statistically significant margin.
yields better scores, tuning both on BLEU and chrF1, with the latter leading to the best results across all metrics (except TER). This system outperforms the NMT system in terms of TER and chrF1 and it is the system combination output that we submitted.

Conclusions
Our participation in WMT 2016 news translation shared task focused on tackling data scarcity in English-to-Finnish translation with the help of morphological segmentation and deep learning. Our experiments showed that rule-based morphological segmentation improves translation quality when applied to both NMT and SMT. In the latter, we had to adapt the segmentation strategy to avoid generating a training corpus with very different SL and TL sentence lengths. On the contrary, difference in sentence length was not a relevant factor in NMT.
The use of deep learning approaches to MT allowed us to obtain a remarkable improvement over SMT. Our best NMT system outperforms our best SMT system by a huge margin and their combination is only slightly better than the NMT system according to automatic evaluation. Our best SMT system also includes a neural LM but our results suggest that pure neural MT approaches constitute an important breakthrough.
Tuning on character sequences (chrF1 metric), 15 used for system combination, resulted in better performance than tuning on the de facto standard BLEU, corroborating the results seen in human evaluation, i.e. better correlation.
Our combined and NMT submissions were best, i.e. the one tuned on chrF1. 15 The code has been made available as part of Joshua and can be found at https://github.com/apache/ incubator-joshua/pull/27 ranked first and second respectively (both in terms of BLEU and TER) in the English-to-Finnish news translation task automatic evaluation 16 and they tied for the first place in the human evaluation.