Word Representation Models for Morphologically Rich Languages in Neural Machine Translation

Out-of-vocabulary words present a great challenge for Machine Translation. Recently various character-level compositional models were proposed to address this issue. In current research we incorporate two most popular neural architectures, namely LSTM and CNN, into hard- and soft-attentional models of translation for character-level representation of the source. We propose semantic and morphological intrinsic evaluation of encoder-level representations. Our analysis of the learned representations reveals that character-based LSTM seems to be better at capturing morphological aspects compared to character-based CNN. We also show that hard-attentional model provides better character-level representations compared to vanilla one.


Introduction
Models of end-to-end machine translation based on neural networks have been shown to produce excellent translations, rivalling or surpassing traditional statistical machine translation systems (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015).A central challenge in neural MT is handling rare and uncommon words.Conventional neural MT models use a fixed modest-size vocabulary, such that the identity of rare words are lost, which makes their translation exceedingly difficult.Accordingly sentences containing rare words tend to be translated much more poorly than those containing only common words (Sutskever et al., 2014;Bahdanau et al., 2015).The rare word problem is particularly exacerbated when translating from morphology rich languages, where the several morphological variants of words result in a huge vocabulary with a heavy tail distribution.For example in Russian, there are at least 70 words for dog, encoding case, gender, age, number, sentiment and other semantic connotations.Many of these words share a common lemma, and contain regular morphological affixation; consequently much of the information required for translation is present, but not in an accessible form for models of neural MT.
In this paper, we propose a solution to this problem by constructing word representations compositionally from smaller sub-word units, which occur more frequently than the words themselves.We show that these representations are effective in handling rare words, and increase the generalisation capabilities of neural MT beyond the vocabulary observed in the training set.We propose several neural architectures for compositional word representations, and systematically compare these methods integrated into a novel neural MT model.
More specifically, we make use of character sequences or morpheme sequences in building word representations.These sub-word units are combined using recurrent neural networks (RNNs), convolutional neural networks (CNNs), or simple bag-ofunits.This work was inspired by research into compositional word approaches proposed for language modelling (e.g., Botha and Blunsom (2014), Kim et al. (2016)), with a few notable exceptions (Ling et al., 2015b;Sennrich et al., 2015;Costa-jussà and Fonollosa, 2016), these approaches have not been applied to the more challenging problem of translation.We integrate these word representations into a novel neural MT model to build robust word representations for the source language.
Our novel neural MT model, is based on the operation sequence model (OSM;Durrani et al. (2011), Feng and Cohn (2013)), which considers translation as a sequential decision process.The decisions involved in generating each target word is decomposed into separate translation and alignment factors, where each factor is modelled separately and conditioned on a rich history of recent translation decisions.Our OSM can be considered as a form of attentional encoder-decoder Bahdanau et al. (2015) with hard attention in which each decision is contextualised by at most one source word, contrasting with the soft attention in Bahdanau et al. (2015).
Integrating the word models into our neural OSM, we provide -for the first time -a comprehensive and systematic evaluation of the resulting word representations when translating into English from several morphologically rich languages, Russian, Estonian, and Romanian.Our evaluation includes both intrinsic and extrinsic metrics, where we compare these approaches based on their translation performance as well as their ability to recover synonyms for the rare words.We show that morpheme and character representation of words leads much better heldout perplexity although the improvement on the translation BLEU scores is more modest.Intrinsic analysis shows that the recurrent encoder tends to capture more morphosyntactic information about words, whereas convolutional network better encodes the lemma.Both these factors provide different strengths as part of a translation model, which might use lemmas to generalise over words sharing translations, and morphosyntax to guide reordering and contextualise subsequent translation decisisions.These factors are also likely to be important in other language processing applications.

Related Work
Most neural models for NLP rely on words as their basic units, and consequently face the problem of how to handle tokens in the test set that are out-ofvocabulary (OOV), i.e., did not appear in the training set (or are considered too rare in the training set to be worth including in the vocabulary.)Often these words are either assigned a special UNK token, which allows for application to any data, however it comes at the expense of modelling accuracy especially in structured problems like language modelling and translation, where the identity of the word is paramount in making the next decision.
One solution to OOV problem is modelling sub-word units, using a model of a word from its composite morphemes.Luong et al. (2013) proposed a recursive combination of morphs using affine transformation, however this is unable to differentiate between the compositional and non-compositional cases.Botha and Blunsom (2014) aim to address this problem by forming word representations from adding a sum of each word's morpheme embeddings to its word embedding.Morpheme based methods rely on good morphological analysers, however these are only available for a limited set of languages.Unsupervised analysers (Creutz and Lagus, 2007) are prone to segmentation errors, particularly on fusional or polysynthetic languages.In these settings, character-level word representations may be more appropriate.Several authors have proposed convolutional neural networks over character sequences, as part of models of part of speech tagging (Santos and Zadrozny, 2014), language models (Kim et al., 2015) and machine translation (Costa-jussà and Fonollosa, 2016).These models are able to capture not just orthographic similarity, but also some semantics.Another strand of research has looked at recurrent architectures, using long-short term memory units (Ling et al., 2015a;Ballesteros et al., 2015) which can capture long orthographic patterns in the character sequence, as well as non-compositionality.
All of the aforementioned models were shown to consistently outperform standard word-embedding approaches.But there is no systematic investigation of the various modelling architectures or comparison of characters versus morpheme as atomic units of word composition.In our work we consider both morpheme and character levels and study 1) wether character-based approaches can outperform morpheme-based, and, importantly, 2) what linguistic lexical aspects are best encoded in each type of architecture, and their efficacy as part of a machine translation model when translating from morphologically rich languages.

Operation sequence model
The first contribution of this paper is a neural network variant of the Operational Sequence Model (OSM) (Durrani et al., 2011;Feng and Cohn, 2013).In OSM, the translation is modelled as a sequential decision process.The words of the target sentence are generated one at a time in a left-to right order, similar to the decoding strategy in traditional phrase-based SMT.The decisions involved in generating each target word is decomposed into a number of separate factors, where each factor is modelled separately and conditioned on a rich history of recent translation decisions.
In previous work (Durrani et al., 2011;Feng and Cohn, 2013), the sequence of operations is modelled as Markov chain with a bounded history, where each translation decision is conditioned on a finite history of past decisions.Using deep neural architectures, we model the sequence of translation decisions as a non-Markovian chain, i.e. with unbounded history.Therefore, our approach is able to capture long-range dependencies which are commonplace in translation and missed by previous approaches.
More specifically, the operations are (i) generation of a target word, (ii) jumps over the source sentence to capture re-ordering (to allow different sentence reordering in the target vs. source language), (iii) aligning to NULL to capture gappy phrases, and (iv) finishing the translation process.The probability of a sequence of operations to generate a target translation t for a given source sentence s is where τ j is a jump action moving over the source sentence (to align a target word to a source word or null) or finishing the translation process τ |t|+1 = FINISH.It is worth noting that the sequence of operations for generating a target translation (in a left-to-  right order) has a 1-to-1 correspondence to an alignment a, so the use of P (t, a|s) in the left-hand-side.
Our model generates the target sentence and the sequence of operations with a recurrent neural network (Figure 1).At each stage, the RNN state is a function of the previous state, the previously generated target word, and an aligned source word, using a single layer perceptron (MLP) which applies an affine transformation to the concatentated input vectors followed by a tanh activation function, where R (t) ∈ R V T ×E T and R (s) ∈ R V S ×E S are word embedding matrices with V S the size of the source vocabulary, V T the size of the target vocabulary, and E T and E S the word embedding sizes for the target and source languages, respectively.
The model then generates the target word t i and index of the source word to be translated next,1 where affine performs an affine transformation of its input,2 and the parameters include and F is the dimensionality of the feature vector Φ(.) representing the induced alignment structure (explained in the next paragraph).The matrix encoding of the source sentence r (s) ∈ R (|s|+2)×E S is defined as where it includes the embeddings of the source sentence words and the NULL and FINISH actions.
The feature matrix Φ(|s|, i ≤j , t ≤j ) ∈ R (|s|+2)×F captures the important aspects between a candidate position for the next alignment and the current alignment position; this is reminiscent of the features captured in the HMM alignment model.The feature vector in each row is composed of two parts :3 (i) the first part is a one-hot vector activating the proper feature depending whether i j+1 − i j is equal to {0, 1, ≥ 2, ≤ −1} or if the action is NULL or FINISH, and (ii) the second part consists of two features i j+1 − i j and Note that the neural OSM can be considered as a hard attentional model, as opposed to the soft attentional neural translation model (Bahdanau et al., 2015).In their soft attentional model, a dynamic summary of the source sentence is used as context to each translation decision, which is formulated as a weighted average of the encoding of all source positions.In the hard attentional model this context comes from the encoding of a single fixed source position.This has the benefit of allowing external information to be included into the model, here the predicted alignments from high quality word alignment tools, which have complementary strengths compared to neural network translation models.

Word Representation Models
Now we turn to the problem of learning word representations.As outlined above, when translating morphologically rich languages, treating word types as unique discrete atoms is highly naive and will compromise translation quality.For better accuracy, we would need to characterise words by their subword units, in order to capture the lemma and morphological affixes, thereby allowing better generalisation between similar word forms.
In order to test this hypothesis, we consider both morpheme and character level encoding methods which we compare to the baseline word embedding approach.For each type of sub-word encoder we learn two word representations: one estimated from the sub-units and the word embedding. 4Then we run max pooling over both embeddings to obtain the word representation, r w = m w e w , where m w is the embedding of word w and e w is the sub-word encoding.The max pooling operation captures non-compostionality in the semantic meaning of a word relative to its sub-parts.We assume that the model would favour unit-based embeddings for rare words and word-based for more common ones.
Let U be the vocabulary of sub-word units, i.e., morphemes or characters, E u be the dimensionality of unit embeddings, and M ∈ R Eu×|U | be the matrix of unit embeddings.Suppose that a word w from the source dictionary is made up of a sequence of units U w := [u 1 , . . ., u |w| ], where |w| stands for the number of constituent units in the word.We combine the representation of sub-word units using a LSTM recurrent neural networks (RNN), convolutional neural network (CNN), or simple bag-of-units (described below).The resulting word representations are then fed to our neural OSM in eqn (2) as the source word embeddings.

Bag of Sub-word Units
This method is inspired by (Botha and Blunsom, 2014) in which the embeddings of sub-word units are simply added together, e w = u∈Uw m u , where m u is the embedding of sub-word unit u.

Bidirectional LSTM Encoder
The encoding of the word is formulated using a pair of LSTMs (denoted bi-LSTM) one operating leftto-right over the input sequence and another operating right-to-left, where h → j and h ← j are the LSTM hidden states. 5The source word is then represented as a pair of hidden states, from left-and right-most states of LSTMs.These are fed into multilayer perception (MLP) with a single hidden layer and a tanh activation function to form the word representation, e w = MLP h → |Uw| , h ← 1 .

Convolutional Encoder
The last word encoder we consider is a convolutional neural network, inspired by a similar approach in language modelling (Kim et al., 2016).Let U w ∈ R Eu×|U |w denote the unit-level representation of w, where the jth column corresponds to the unit embedding of u j .The idea of unitlevel CNN is to apply a kernel Q l ∈ R Eu×k l with the width k l to U w to obtain a feature map f l ∈ R |U |w−k l +1 .More formally, for the jth element of the feature map the convolutional representation is f l (j) = tanh( U w,j , Q l +b), where U w,j ∈ R Eu×k l is a slice from U w which spans the representations of the jth unit and its preceding k l − 1 units, and A, B = i,j A ij B ij = Tr AB T denotes the Frobenius inner product.For example, suppose that the input has size [4 × 9], and a kernel has size [4 × 3] with a sliding step being 1.Then, we obtain a [1 × 7] feature map.This process implements a character n-gram, where n is equal to the width of the filter.The word representation is then derived by max pooling the feature maps of the kernels: ∀l : r w (l) = max j f l (j).In order to capture interactions between the character obtained by the filters, a highway network (Srivastava et al., 2015) is applied after the max pooling layer, e w = t MLP(r w ) + (1 − t) r w , where t = MLP σ (r w ) is a sigmoid gating function which modulates between a tanh MLP transformation of the input (left component) and preserving the input as is (right component).

Experiments
The Setup.We compare the different word representation models based on three morphologically rich languages using both exterinsic and intrinsic evaluations.For exterinsic evaluation, we investigate their effects in translating to English from Estonian, Romanian, and Russian using our neural OSM.For intrinsic evaluation, we investigate how accurately the models recover semantically/syntactically related words to a set of given words.
Datasets.We use parallel bilingual data from Europarl for Estonian-English and Romanian-English (Koehn, 2005), and web-crawled parallel data for Russian-English (Antonova and Misyurev, 2011).For preprocessing, we tokenize, lower-case, and filter out sentences longer than 30 words.Further- more, we apply a frequency threshold of 5, and replace all low-frequency words with a special UNK token.We split the corpora into three partitions: training (100K), development(10K), and test(10K); Table 1 provides the datasets statistics.
Morfessor Training.We use Morfessor CAT-MAP (Creutz and Lagus, 2007) to perform morphological analysis needed for morph-based neural models.
Morfessor does not rely on any linguistic knowledge, instead it relays on minimum description length principle to construct a set of stems, affixes and paradigms that explains the data.Each word form is then represented as (prefix) * (stem) + (suffix) * .
We ran Morfessor on the entire initial datasets, i.e before filtering out long sentences.The word perplexity is the only Morfessor parameter that has to be adjusted.The parameter depends on the vocabulary size: larger vocabulary requires higher perplexity number; setting the perplexity threshold to a small value results in over-splitting.We experimented with various thresholds and tuned these to yield the most reasonable morpheme inventories. 6able 1 presents the percentage of unknown words in the test for each source language .For reconstruction we considered the words from the native alphabet only.The recovering rate depends on the model.For characters all the words could be easily rebuilt.In case of morpheme-based approach the quality mainly depends on the Morfessor output and the level of word segmentation.In terms of morphemes, Estonian presents the highest reconstruction rate, therefore we expect it to benefit the most from the morpheme-based models.Romanian, on the other hand, presents the lowest unknown words rate being the most morphologically simple out of the three languages.Morfessor quality for Russian  was the worst one, so we expect that Russian should mainly benefit from character-based models.

Extrinsic Evaluation: MT
Training.We annotate the training sentence-pairs with their sequence of operations to training the neural OSM model.We first run a word aligner 7 to align each target word to a source word.We then read off the sequence of operations by scanning the target words in a left-to-right order.As a result, the training objective consists of maximising the joint probability of target words and their alignments eqn 1, which is performed by stochastic gradient descent (SGD).The training stops when the likelihood objective on the development set starts decreasing.
For the re-ranker, we use the standard features generated by moses 8 as the underlying phrase-based MT system plus two additional features coming from the neural MT model.The neural features are based on the generated alignment and the translation probabilities, which correspond to the first and second terms in eqn 1, respectively.We train the reranker using MERT (Och, 2003) with 100 restarts.
Translation Metrics.We use BLEU (Papineni et al., 2002) and METEOR 9 (Denkowski and Lavie, 2014) to measure the translation quality against the reference.BLEU is purely based on the exact match of n-grams in the generated and reference translation, whereas METEOR takes into account matches based on stem, synonym, and paraphrases as well.This is particularly suitable for our morphology rep- 7 We made use of fast_align in our experiments https: //github.com/clab/fast_align.
resentation learning methods since they may result in using the translation of paraphrases.We train the paraphrase table of METEOR using the entire initial bilingual corpora based on pivoting (Bannard and Callison-Burch, 2005).
Results.Table 3 shows the translation and alignment perplexities of the development sets when the models are trained.As seen, the CNN char model leads to lower word and alignment perplexities in almost all cases.This is interesting, and shows the power of this model in fitting to morphologically complex languages using only their characters.Table 2 presents BLEU and METEOR score results, where the re-ranker is optimised by the ME-TEOR and BLEU when reporting the corresponding score.As seen, re-ranking based on neural models' scores outperforms the phrase-based baseline.Furthermore, the translation quality of the BILSTM morph model outperforms others for Romanian and Estonian, whereas the CNN char model outperforms others for Russian which is consistent with our expectations.We assume that replacing Morfessor with real morphology analyser for each language should improve the performance of morpheme-based models, but leave it for future research.However, the translation quality of the neural models are not significantly different, which may be due to the convoluted contributions of high and low frequency words into BLEU and METEOR.Therefore, we investigate our representation learning models intrinsically in the next section.

Intrinsic Evaluation
We now take a closer look at the embeddings learned by the models, based on how well they capture the semantic and morphological information in the nearest neighbour words.Learning representations for low frequency words is harder than that for highfrequency words, since they cannot capitalise as reliably on their contexts.Therefore, we split the test lexicon into 6 subsets according to their frequency in the training set: [0-4], [5-9], [10-14], [15][16][17][18][19], [20-50], and 50+.Since we set out word frequency threshold to 5 for the training set, all words appearing in the frequency band [0,4] are in fact OOVs for the test set.For each word of the test set, we take its top-20 nearest neighbours from the whole training lexicon (without threshold) using cosine metric.
Semantic Evaluation.We investigate how well the nearest neighbours are interchangable with a query word in the translation process.So we formalise the notion of semantics of the source words based on their translations in the target language.
We use pivoting to define the probability of a candidate word e to be the synonym of the query word e, p(e |e) = f p(f |e)p(e |f ), where f is a target language word, and the translation probabilities inside the summation are estimated using a wordbased translation model trained on the entire bilingual corpora (i.e.before splitting into train/dev/test sets).We then take the top-5 most probable words as the gold synonyms for each query word of the test set. 10e measure the quality of predicted nearest neighbours using the multi-label accuracy,11 where G(w) and N (w) are the sets of gold standard synonyms and nearest neighbors for w respectively; the function 1 [C] is one if the condition C is true, and zero otherwise.In other words, it is the fraction of words in S whose nearest neighbours and gold standard synonyms have non-empty overlap.
Table 4 presents the semantic evaluation results.As seen, on words with frequency ≤ 50, the CNN char model performs best across all of the three languages.Its superiority is particularly interesting for the OOV words (i.e. the frequency band [0,4]) where the model has cooked up the representations com- pletely based on the characters.For high frequency words (> 50), the BILSTM word outperforms the other models.
Morphological Evaluation.We now turn to evaluating the morphological component.For this evaluation, we focus on Russian since it has a notoriously hard morphology.We run another morphological analyser, mystem (Segalovich, 2003), to generate linguistically tagged morphological analyses for a word, e.g.POS tags, case, person, plurality, etc.We represent each morphological analysis with a bit vector showing the presence of these grammatical features.Each word is then assigned a set of bit vectors corresponding to the set of its morphological analyses.As the morphology similarity between two words, we take the minimum of Hamming similarity12 between the corresponding two sets of bit vectors.Table 5(a) shows the average morphology similarity between the words and their nearest neighbours across the frequency bands.Likewise, we represent the words based on their lemma features; Table 5(b) shows the average lemma similarity.We can see that both character-based models capture morphology far better than morpheme-based ones, especially in the cases of OOV words.But it is also clear that CNN tends to outperform bi-LSTM in case where we compare lemmas, and bi-LSTM seems to be better at capturing affixes.Now we take a closer look at the character-based models.We manually created a set of non-existing Russian words of three types.Words in the first set consist of known root and affixes, but their combination is atypical, although one might guess the meaning.The second type corresponds to the words with non-existing(nonsense) root, but meaningful affixes, so one might guess its part of speech and some other properties, e.g.gender, plurality, case.Finally, a third type comprises of the words with all known root and morphemes, but the combination is absolutely not possible in the language and the meaning is hard to guess.
Table 6 shows that CNN is strongly biased towards longest substring matching from the beginning of the word, and it yields better recall in retrieving words sharing same lemma.Bi-LSTM, on the other hand, is mainly focused on matching the patterns from both ends regardless the middle of the word.And it results in higher recall of the words sharing same grammar features.

Figure 1 :
Figure 1: Illustration of the neural operation sequence model for an example sentence-pair.

Figure 2 :
Figure 2: Model architecture for the several approaches to learning word representations, showing from left: bagof-morphs, BiLSTM over morphs, and the character convolution.Note that the BiLSTM is also applied at the character level.The input word, täppi-de-ga, is Estonian for speckled, bearing plural (de) and comitative (ga) suffixes.

Table 1 :
Corpus statistics for parallel data between Russian/Romanian/Estonian and English.The OOV rate are the fraction of word types in the source language that are in the test set but are below the frequency cut-off or unseen in training.

Table 2 :
BLEU and METEOR scores for re-ranking the test sets.

Table 4 :
Semantic evaluation of nearest neighbours using multi-label accuracy on words in different frequency bands.

Table 5 :
Morphology analysis for nearest neighbours based on (a) Grammar tag features, and (b) Lemma features. German-ness(s,f,nom,sg)