Morphological Word Embeddings for Arabic Neural Machine Translation in Low-Resource Settings

Neural machine translation has achieved impressive results in the last few years, but its success has been limited to settings with large amounts of parallel data. One way to improve NMT for lower-resource settings is to initialize a word-based NMT model with pretrained word embeddings. However, rare words still suffer from lower quality word embeddings when trained with standard word-level objectives. We introduce word embeddings that utilize morphological resources, and compare to purely unsupervised alternatives. We work with Arabic, a morphologically rich language with available linguistic resources, and perform Ar-to-En MT experiments on a small corpus of TED subtitles. We find that word embeddings utilizing subword information consistently outperform standard word embeddings on a word similarity task and as initialization of the source word embeddings in a low-resource NMT system.


Introduction
Neural machine translation (Bahdanau et al., 2014;Sutskever et al., 2014) has recently become the dominant approach to machine translation. However, the standard encoder-decoder models with attention have been shown to perform poorly in low-resource settings (Koehn and Knowles, 2017), a problem which can be alleviated by initialization of parameters from an NMT system trained on higher-resource languages (Zoph et al., 2016). An alternative way to initialize parameters in a low-resource NMT setup is to use pretrained monolingual word embeddings, which are quick to train and readily available for many languages.
There is a large body of work on word embeddings.
Popular approaches include word2vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014). These have been shown to perform well in word similarity tasks and a variety of downstream tasks. However, they have been primarily evaluated on English. The learned representations for rare words are of low quality due to sparsity. For morphologically rich languages, we may want word embeddings that also consider morphological information, to reduce sparsity in word embedding training. Previous work on morphological word embeddings has shown improvements on word similarity tasks, but has not been evaluated on downstream NMT tasks. Our contribution is two-fold: 1. We adapt word2vec to utilize lemmas from a morphological analyzer, 1 and show improvements on a word similarity task over a stateof-the-art unsupervised approach to incorporating morphological information based on character n-grams (Bojanowski et al., 2017).
2. We experiment with Arabic-to-English NMT on the TED Talks corpus. Our results demonstrate that incorporating some form of morphological word embeddings into NMT improves BLEU scores and outperforms the conventional approaches of using standard word embeddings, random initialization, or bytepair encoding (BPE).

Neural Machine Translation
We follow recent work in neural machine translation, using a standard bi-directional LSTM encoderdecoder model with the attention mechanism from Luong et al. (2015). We describe below other work in NMT that has tried to address some of the same issues dealing with settings with limited parallel data, improving translation of morphological complexity, and Arabic NMT.

Low-Resource Settings
Some success has been achieved applying neural machine translation to low-resource settings. Zoph et al. (2016) use transfer learning to improve NMT from low-resource languages into English. They initialize parameters in the low-resource setting with parameters from an NMT model trained on a high-resource language. Nguyen and Chiang (2017) extend this by exploiting vocabulary overlap in related languages. Similarly, Firat et al. (2016) share parameters between high and low resource languages via multi-way, multilingual NMT.
Other work aims to exploit monolingual data via back-translation (Sennrich et al., 2016a). Imankulova et al. (2017) aim to improve this technique for low-resource settings by filtering generated back-translations with quality estimation. Meanwhile, He et al. (2016) use a reinforcement learning approach to learn from monolingual data.
Our approach is similar to those utilizing transfer learning, but we initialize on the source side with monolingual word embeddings, which is relatively simple to implement and low-cost to train. Di Gangi and Marcello (2017) experiment with monolingual word embeddings as we do, but they merge external monolingual word embeddings with the embeddings learned by an NMT system. We simply use word embeddings as initialization, and we instead focus on exploring how morphological word embeddings can help in this setup.

Incorporating Morphology
Some research has aimed to incorporate morphological information into NMT systems. Byte-Pair Encoding (BPE) segments words into pieces by merging character sequences based on frequency (Sennrich et al., 2016b), and these sequences of word pieces are translated. BPE become standard practice. However, it is unclear how much data is necessary for it to be beneficial. In our experiments, BPE performs worse than initializing with any of the word embeddings for our dataset.
Character-level NMT has recently become popular as well (Ling et al., 2015b;Costa-jussà and Fonollosa, 2016;Lee et al., 2017). Their work aims to implicitly learn morphology by building neural network architectures over characters. We also compare to a character-level NMT system in our experiments.
Additionally,  add morphological information into the decoder, following work from  that showed that the encoder already learns more morphological information than the decoder. Our work differs in that we are focusing on incorporating morphological information into the source side. Moreover,  works with higher-resource datasets. It is possible that in lower-resource settings, it will still be helpful to incorporate morphological information into the encoder. Almahairi et al. (2016) produce the first results of neural machine translation on Arabic. They find that preprocessing of Arabic as used in statistical machine translation is helpful. They normalize the text, removing diacritics and normalizing inconsistently typed characters, and they tokenize according the Penn Arabic Treebank (ATB) scheme (Maamouri et al., 2004), separating all clitics except for definite articles. We normalize as such, but do not use ATB tokenization, instead using the default tokenization in Moses (Koehn et al., 2007). We do this to focus on embeddings for words and to facilitate generalization to other languages. Additionally,  explore alternatives to language-specific segmentation in Arabic, finding that BPE performs the best in their scenario.

Arabic NMT
Note that unlike the previously described work, we are using a dataset of only 2.9 million tokens for training. This is to assess the use of morphological word embeddings in settings with limited parallel data.

Morphological Word Embeddings
Morphological word embeddings help improve the quality of pretrained word embeddings for less frequent morphological variants, which is important for morphologically rich and low-resource languages. We outline related work in this section and describe an additional approach of our own.
Some related work has used morphological resources to guide word embeddings. Cotterell and Schütze (2015) use a multi-task objective to encourage word embeddings to reflect morphological tags, working within the log-bilinear model of Mnih and Hinton (2007). Cotterell et al. (2016) use a latent-variable model to adapt existing word embeddings to morphemes. Our additional approach is similar to this vein of work in that it uses morphological resources, but it works within the popu- Figure 1: Modified skipgram objective for training morph embeddings. Here, w(t) is the current word, l(t) is its lemma, and w(t-2), w(t-1),w(t+1),w(t+2) are neighboring words. lar word2vec skipgram objective (Mikolov et al., 2013a), adding a simple modification to consider a lemma in addition to a word form.
Other work uses purely unsupervised techniques. Luong et al. (2013) segment words using Morfessor (Creutz and Lagus, 2007), and use recursive neural networks to build word embeddings from morph embeddings. Instead of explicit segmentation, fastText (Bojanowski et al., 2017) incorporates subword information into the skipgram model by treating a word as a bag of character ngrams. They represent each n-gram of sizes 3-6 with a vector, and each word as a sum of its n-gram vectors. While fastText is not explicitly learning morphology, it can be viewed as potentially incorporating morpheme-like subwords.
For simplicity and efficiency, we consider only embeddings in the skipgram family-fastText, word2vec skipgram, and our modification of the word2vec skipgram objective, described in 3.1. There is a large literature on exploiting characters, morphology, and composition for embedding models (Chen et al., 2015;Ling et al., 2015a;Qiu et al., 2014;Wieting et al., 2016;Lazaridou et al., 2013), and a comparison with these different models may be interesting future work.
The usefulness of word embeddings in downstream applications is a question that often needs to be revisited. Many types of morphological or character-level embedding models have been evaluated under various extrinsic metrics, in applications such as language modeling (Kim et al., 2016;Botha and Blunsom, 2014;Sperr et al., 2013), parsing (Ballesteros et al., 2015), part-of-speech tagging (dos Santos and Zadrozny, 2014), and named-entity recognition (dos Santos and Guimarães, 2015;Cot-terell and Duh, 2017). Besides the Arabic word similarity dataset, here we also focus on evaluating embeddings at the source side of a machine translation task.

Modified Skipgram Objective
We assume the availability of a morphological analyzer or lemmatizer that will output a lemma for each word token in a text. We modify the skipgram objective (Mikolov et al., 2013b) to use both word and lemma to predict context words, as illustrated in Figure 1. We learn word vectors and lemma vectors, using their concatenation in the dot product with a context vector in the skipgram objective. So the modified objective we are approximating with negative sampling is now Without the lemma part, this objective corresponds to word2vec.
Because there may be multiple lemmas associated with a word type, we use a weighted average over lemma vectors in the final vector: where c(·) is the count of a word or word-lemma pair. When the morphological analyzer cannot produce a lemma, we use the word form itself. We output the vectors associated with individual lemmas as well, which can be used to handle OOV words.
The lemma simplifies a word, removing clitics and some inflectional morphology. While it reduces sparsity of infrequent stems, it also removes potentially useful information. The hope is that by using both word and lemma, we can maintain enough of the benefits of morphology in frequent words while also reducing sparsity in infrequent words. We do some preliminary experiments using just the lemma to predict context words as well, but in preliminary experiments this performed worse, possibly because we lose too much information from morphology.
In future work, we could also try modifying what is predicted as well (i.e. instead of predicting context words, predict lemma or both word and lemma). 2

Arabic Morphology and Resources
We describe here the morphological analyzer we use, as well as prominent features of Arabic morphology that we consider in our analysis.

Morphological Analyzer
We use a morphological analyzer for Arabic called MADAMIRA (Pasha et al., 2014). MADAMIRA performs rule-based morphological analysis on the form of the word and then uses supervised learning techniques to disambiguate in context. It provides several types of morphological analysis for Arabic. In this work we only use the lemma, though future work could consider utilizing the other morphological information provided.

Arabic Morphology
One prominent feature of Arabic morphology is that it is rich with clitics, morphemes that syntactically function as words but phonologically function as affixes. Arabic proclitics (prefixes) include articles, conjunctions, and prepositions. Arabic enclitics (suffixes) include object or possessive pronouns. There are also inflectional affixes for number (singular, plural, and dual) and gender (masculine, feminine), and grammatical case endingsthough only certain indefinite accusative case endings are visible without diacritics.
Semitic languages such as Arabic also have a substantial amount of non-concatenative morphology. Most stems are formed from a 3-consonant root inserted into a vowelled template, called "templatic morphology." When we are only considering inflectional morphology, as we are in the case of lemmas, we see this most in "broken plurals," which are especially productive in Arabic (as compared to other Semitic languages). A broken plural changes the internal vowelled pattern from the singular, rather than attaching a suffix.
An example of this is the word for "key," mf-tAH , and its plural, mfAtyH , where the root is f-t-H, and the pattern for singular is mCCAC, and for plural is mCACyC. 3 In this case, MADAMIRA would produce the lemma: 1 for both forms. We hypothesize that the embeddings informed by MADAMIRA will have an advantage on these words, where the morphemes involved cannot be captured by character n-grams.

Experiments
We compare three types of embeddings: • word2vec: standard skip-gram word embeddings that only use word information.
• fastText: skip-gram embeddings that are sums of vectors representing character ngrams, implicitly incorporating some form of morphological information.
• morph: the modified skip-gram word embeddings described in Section 3.1, which rely on a morphological analyzer and lemma embeddings.
The word embeddings inserted into the NMT system are always of dimension 300, and in word similarity experiments, we experiment with dimensions of different sizes. All word embeddings are trained with negative sampling (5 samples), with a window size of 5, a 10 −4 rejection threshold for subsampling, and 5 iterations. Additional fastText parameters are left at the default. We use OpenNMT-py (Klein et al., 2017) for all NMT experiments, with a max sentence size of 80. We use word-level prediction accuracy for model selection. For the BPE baseline, the number of BPE merge operations is 30,000. The hidden layer size is 1024, trained with batch size 80, with Adadelta (Zeiler, 2012) and a dropout rate of 0.2 for 20 epochs with a learning rate of 1.0.
When initializing the encoder with word embeddings, we experiment both with locking the word 3 We use Buckwalter transliteration (Buckwalter, 2002 Table 1: Spearman coefficient for Arabic word similarity dataset built off of WS353. We list the dimension of the word embedding, and in the case of morph, we list the dimensions of the word part and the lemma part. In the morph system, lemma refers to using just the lemma part of the vector to compare similarity, word refers to using just the word part, and word+lemma refers to using the whole vector. embeddings throughout training ("fixed") and allowing backpropagation through the word embeddings ("unfixed"). At test time, words not seen in the MT training data are also initialized with word embeddings, if they were seen in the word embedding training data. Words unseen by either corpus are mapped to the embedding of an <unk> token. The bitext we use for NMT is a collection of TED subtitles obtained from WIT 3 (Cettolo et al., 2012). 4 This is a collection of monologue speeches from TED talks, covering a wide range of topics such technology, design, and social science. We downloaded the latest XML files (version 2016-04-08) for Arabic and performed subtitle extraction and sentence merging using the WIT 3 scripts. The data is then randomly split at the granularity of talks, with 1939 talks for training, 30 talks for development, and 30 talks for testing. 5 The corresponding sentence/token statistics are shown in Table 2. In this data, 9% of word types and 3% of tokens in the test data were not seen in train.
The monolingual corpus we use for word embeddings is cleaned and tokenized Arabic Wikipedia, consisting of about 80 million tokens, with a vocabulary of around 350k words. The word embeddings are trained on both the monolingual corpus and the source side of the TED training data. The number of lemma types in the monolingual corpus is 672k, and in TED training data is 42k.

Word Similarity Results
Before running NMT, we first experiment on a word similarity dataset to test the effectiveness of

Corpus Sentences Tokens Types
Wikipedia 1,751k 79,793k 1,263k TED, train 175k 2,855k 152k TED, dev 2k 30k 8k TED, test 2k 29k 8k Table 2: Size of corpora, the number of tokens for MT data refers to the source side. morphology in word embeddings. We compare word2vec, fastText, and variants of our morphological skip-gram in Section 3.1. We experiment with normalizing only diacritics as well as additionally normalizing inconsistently typed characters as in Almahairi et al. (2016), referred to here as "full normalization." We normalize the word similarity dataset accordingly. To ensure that dimensionality is not a major factor, we experiment with various dimensions. We also experiment with just using lemmas to predict, which performs slightly worse than using both word and lemma and taking the lemma part of the vector, though still better than word2vec and fastText. We evaluate on an Arabic dataset developed by Hassan and Mihalcea (2009) based on the classic WordSim353 (Finkelstein et al., 2001), as is evaluated on by Bojanowski et al. (2017). We re-run on word2vec and fastText and obtain similar, though not identical, results to Bojanowski et al. (2017). We suspect the differences are due to differences in cleaning and tokenizing Arabic Wikipedia. As is standard for these evaluations, we report Spearman rank coefficient in Table 1. There are 3 OOV words when normalizing diacritics, and 1 with full normalization, out of 353   word pairs. We report results both using zero vectors for OOV and with an attempt to handle OOVs when possible, as done by Bojanowski et al. (2017).
To handle OOVs, we run MADAMIRA on the unknown form alone (without the benefit of a context sentence) to get a lemma, and use the lemma vector learned for the corresponding lemma, if it was seen in training, with zeros for the word part. 6 We see that across normalization schemes and dimensions, fastText performs 1-3 points better than word2vec in the null OOV setting and 2-4 points better handling OOVs. Using both word and lemma to predict context words performs about the same as fastText. However, when we take just the part of the vector corresponded to a weighted average of lemma vectors, it performs 4-6 points better than fastText. 2-4 points of this gain can be achieved by just using the lemma to predict context words.
Interestingly, the word part of the morph vector performs poorly on word similarity, but still provides some benefit in training. We found that using just the lemma to predict in training performed slightly worse than the lemma part of the vector when using both. It is possible that complementary 6 Note that when attempting to handle OOVs, in the case where we are only normalizing diacritics, we can only recover a lemma vector for 1 of the 3 OOVs while fastText is using n-grams to recover something for all 3. In the case of full normalization, both are able to recover a vector.
features are learned in the word part and lemma part of the vector, and that the lemma part corresponds much more closely to semantic similarity.

Neural Machine Translation Results
We run 3 replicates of experiments with random initializations (re-training word embeddings on each run as well). Results for corpus-level BLEU, calculated using the multi bleu.sh script from Moses are in provided in Table 3. BPE outperforms using full words by 1.3 BLEU points (27.80 vs. 26.55). Initializing with word2vec results in a 1.8 BLEU point gain over randomly initialized word embeddings. morph results in a 0.4 BLEU point gain over word2vec, and fastText a 0.7 BLEU point gain. Fixing the embeddings consistently performs worse than allowing backpropagation. However, this gap narrows as the BLEU scores of both improve. We also compare to running a NMT system with a CNN over character embeddings in the encoder from Costa-jussà and Fonollosa (2016), which results in a BLEU score of 26.46. 7 We also perform statistical significance testing via bootstrap resampling, using the multeval tool (Clark et al., 2011). The best BLEU are 28.76 for morph and 29.10 for fastText. Both morph and fastText improve upon word2vec (28.38) with p-values < 0.01. The differences between fastText and morph are not statistically significant.
To see whether trends in BLEU are stronger for sentences containing rarer words with more frequent lemmas, we try filtering test sentences by the ratio of word count to lemma count in the source side of the MT training data. We take sentences with at least one word that has a lemma that is more than 50 times as frequent as the word in training data. Comparing just the unfixed, normalized, word-based versions, we show results for BLEU on filtered sentences in Table 4.
With this heuristic for rare morphological variants, there are 1,376 rare morphological variants out of the 7,345 words that are in the intersection of train and test source data. The heuristic pulls out 1,038 out of 1,982 test sentences to evaluate on. morph results in a 0.6 BLEU point gain over word2vec, and fastText a 0.88 BLEU point gain.
Because of corpus-level BLEU's limitations in characterizing translation quality with respect to morphological variants at the word level, we also perform a manual analysis of the sentences from each system to inspect improvements that may be due to the various word embeddings. We use multeval (Clark et al., 2011) to inspect the sentences that had the biggest sentence-level BLEU improvement over standard word2vec in the morph and fastText cases at the sentence level and see if there are notable trends. We display the median system's translation in this analysis, as recommended by Clark et al. (2011), though sentences selected here exhibited the phenomena described consistently across multiple runs. Example sentences are shown in Table 5.
In several cases, both morph and fastText systems consistently successfully translate rare or unseen words with morphological variants that are seen more commonly in the word embedding training data, while the word2vec system does not. For instance, in example 1, the word (lltdxlAt, "of interventions") is never seen in the MT training data. It is only seen rarely in the word embedding training data, 24 times. However, the word stripped of the definite article and the clitic corresponding to "of," i.e. the character n-gram tdxlAt, is seen 657 times in word embed-ding training data. The lemma, which is shared between singular and plural as well, occurs 6,887 times. In some cases, the morph system is consistently the only system that successfully translates a rare morphological variant. For instance, in example 2, the morph system translates the word (AbEAdA, "dimensions") correctly, while the other systems do not. It occurs here in the accusative case, which does not appear explicitly in many settings in Arabic. This word form occurs 7 times in the MT training data and 101 times in the monolingual corpus. Meanwhile, the lemma 1 occurs 214,297 times in the word embedding data. This is much more frequent than we'd expect to see variants of the word "dimension," because the lemma is also associated with the very frequent word for "after." However, it seems to learn a good representation despite this. It is unclear exactly why fastText does not learn a good representation in any of the three runs although it is possible that with character n-grams, there is conflict with other unrelated words. Note that because the plural is non-concatenative, none of the character n-grams in this word corresponds to the singular.
In other cases, the morphological analyzer cannot provide an analysis for a word, and a rare morphological variant is only translated correctly by fastText. In example 3, while sentencelevel BLEU is best in the word2vec version in this case, we see a word that is translated best with fastText, and fails to be translated in the other two systems. The word (AbtlAE, "swallowing") is only seen as a word itself twice in MT training data and 171 times in the monolingual corpus. However, the 6-gram corresponding to the word is seen 444 times in the word embedding training data as a part of other words. Meanwhile, the morphological analyzer does not provide an analysis. While fastText translates as "swallow" rather than "swallowing," it is better than morph for this word, which consistently fails to translate the word at all.

Discussion
Overall, morphologically aware word embeddings (morph and fastText) can help reduce sparsity and improve results on both a word similarity task and a low-resource NMT system when used as initialization. The improvements over standard word embeddings is consistent, and implies that src 1) . The sword is a tradition of ancient India. morph The sword of the sword is a traditional Indian tradition. fastText Swallow the ball is the old Indian habits. morphology is a useful signal to incorporate. It is interesting that the word embeddings that perform best on a word similarity task (morph) do not line up with what performs best in an NMT system (fastText). This reinforces the argument that word similarity tasks alone are not enough to evaluate word embeddings (Faruqui et al., 2016), and that which embeddings we prefer may depend on the downstream task and the dataset. We discuss here briefly the potential strengths and weaknesses of each approach to morphological word embeddings, though more conclusive analysis is left to future work.
One possible reason for the difference in best embeddings between the two tasks, is how in-domain the morphological analyzer is for each task. In the word similarity task, 434 of the 444 unique words in the task receive lemmas (about 98%). On the other hand, in the MT test data, 7,266 out of 8,309 unique words receive lemmas (only about 87%).
It is also possible that function words matter more in the MT task, and that their translation does not improve as much with embeddings informed by lemmas. fastText may help more with these words, especially when function words in English correspond to pieces of a word in Arabic.
From these experiments, it appears that if one is more concerned with semantic similarity or has a dataset that lines up well with the morphological analyzer used to produce lemmas, morphological word embeddings exploiting the morphological resources might be best. On the other hand, for a downstream task such as MT, and when there is a substantial number of words not covered by the analyzer, a method considering character n-grams may be better.
In both cases, word embeddings considering subword information consistently perform better than standard word embeddings on a morphologically rich language such as Arabic. It is possible that future gains could be made by combining the strengths of both models.

Conclusion
We extend the skipgram model for word embeddings to incorporate lemmas from a morphological resource in a simple way, maintaining the efficiency of word2vec, and release the code publicly. We show that this model outperforms word2vec and fastText on a word similarity task in Arabic.
We also conduct experiments with these word embeddings as initialization for a low-resource neural machine translation system. We find that the word embeddings utilizing subword information consistently outperform standard word embeddings at this task, and that any of the word embeddings we tried outperformed a random initialization or BPE. fastText does best at this task, with a 0.7 BLEU gain over standard word embeddings and 2.5 BLEU gain over random initialization.
Future work will attempt to combine the strengths of these multiple approaches to incorporating morphological information in word embeddings, as well as to explore other sources of information such as part-of-speech or syntax.