Compositional Representation of Morphologically-Rich Input for Neural Machine Translation

Neural machine translation (NMT) models are typically trained with fixed-size input and output vocabularies, which creates an important bottleneck on their accuracy and generalization capability. As a solution, various studies proposed segmenting words into sub-word units and performing translation at the sub-lexical level. However, statistical word segmentation methods have recently shown to be prone to morphological errors, which can lead to inaccurate translations. In this paper, we propose to overcome this problem by replacing the source-language embedding layer of NMT with a bi-directional recurrent neural network that generates compositional representations of the input at any desired level of granularity. We test our approach in a low-resource setting with five languages from different morphological typologies, and under different composition assumptions. By training NMT to compose word representations from character n-grams, our approach consistently outperforms (from 1.71 to 2.48 BLEU points) NMT learning embeddings of statistically generated sub-word units.


Introduction
An important problem in neural machine translation (NMT) is translating infrequent or unseen words. The reasons are twofold: the necessity of observing many examples of a word until its input representation (embedding) becomes reliable, and the computational requirement of limiting the input and output vocabularies to few tens of thousands of words. These requirements eventually lead to coverage issues when dealing with low-resource and/or morphologically-rich languages, due to their high lexical sparseness. To cope with this well-known problem, several approaches have been proposed redefining the model vocabulary in terms of interior orthographic units compounding the words, ranging from character ngrams (Ling et al., 2015b;Costa-jussà and Fonollosa, 2016;Lee et al., 2017;Luong and Manning, 2016) to statistically-learned sub-word units (Sennrich et al., 2016;Wu et al., 2016;Ataman et al., 2017). While the former provide an ideal open vocabulary solution, they mostly failed to achieve competitive results. This might be related to the semantic ambiguity caused by solely relying on input representations based on character n-grams which are generally learned by disregarding any morphological information. In fact, the second approach is now prominent and has established a pre-processing step for constructing a vocabulary of sub-word units before training the NMT model. However, several studies have shown that segmenting words into sub-word units without preserving morpheme boundaries can lead to loss of semantic and syntactic information and, thus, inaccurate translations (Niehues et al., 2016;Ataman et al., 2017;Pinnis et al., 2017;Huck et al., 2017;Tamchyna et al., 2017).
In this paper, we propose to improve the quality of input (source language) representations of rare words in NMT by augmenting its embedding layer with a bi-directional recurrent neural network (bi-RNN), which can learn compositional input representations at different levels of granularity. Compositional word embeddings have recently been applied in language modeling and obtained successful results (Vania and Lopez, 2017). The apparent advantage of our approach is that by feeding NMT with simple character n-grams, our bi-RNN can potentially learn the morphology necessary to create word-level representations of the in-put language directly at training time, thus, avoiding the burden of a separate and sub-optimal word segmentation step. We compare our approach against conventional embedding-based representations learned from statistical word segmentation in a public evaluation benchmark, which provides low-resource training conditions by pairing English with five morphologically-rich languages: Arabic, Czech, German, Italian and Turkish, where each language represents a distinct morphological typology and language family. The experimental results show that our compositional input representations lead to significantly and consistently better translation quality in all language directions.

Neural Machine Translation
In this paper, we use the NMT model of . The model essentially estimates the conditional probability of translating a source sequence x = (x 1 , x 2 , . . . x m ) into a target sequence y = (y 1 , y 2 , . . . y l ), using the decomposition The model is trained by maximizing the loglikelihood of a parallel training set via stochastic gradient descent (Bottou, 2010) and the backpropagation through time (Werbos, 1990) algorithms.
The inputs of the network are one-hot vectors, which are binary vectors with a single bit set to 1 to identify a specific word in the vocabulary. Each one-hot vector is then mapped to an embedding, a distributed representation of the word in a lower dimension but a more dense continuous space. From this input, a representation of the whole input sequence is learned using a bi-RNN, the encoder, which maps x into m dense sentence vectors corresponding to its hidden states. Next, another RNN, the decoder, predicts each target token y i by sampling from a distribution computed from the previous target token y i−1 , the previous decoder hidden state, and the context vector. The latter is a linear combination of the encoder hidden states, whose weights are dynamically computed by a feed-forward neural network called attention model . The probability of generating each target word y j is normalized via a softmax function.
Both the source and target vocabulary sizes play an important role in terms of defining the complex-ity of the model. In a standard architecture, like ours, the source and target embedding matrices actually account for the vast majority of the network parameters. The vocabulary size also plays an important role when translating from and to lowresource and morphologically-rich languages, due to the sparseness of the lexical distribution. Therefore, a conventional approach has now become to compose both the source and target vocabularies of sub-word units generated through statistical segmentation methods (Sennrich et al., 2016;Wu et al., 2016;Ataman et al., 2017), and performing NMT by directly learning embeddings of subword units. A popular one of these is the Byte-Pair Encoding (BPE) method (Gage, 1994;Sennrich et al., 2016), which finds the optimal description of a corpus vocabulary by iteratively merging the most frequent character sequences. A more recent approach is the Linguistically-Motivated Vocabulary Reduction (LMVR) method (Ataman et al., 2017), which similarly generates a new vocabulary by segmenting words into sub-lexical units based on their likeliness of being morphemes and their morphological categories. A drawback of these methods is that, as pre-processing steps to NMT, they are not optimized for the translation task. Moreover, they can suffer from morphological errors at different levels, which can lead to loss of semantic or syntactic information.

Learning Compositional Input Representations via bi-RNNs
In this paper, we propose to perform NMT from input representations learned by composing smaller symbols, such as character n-grams (Ling et al., 2015a), that can easily fit in the model vocabulary. This composition is essentially a function which can establish a mapping between combinations of ortographic units and lexical meaning, that is learned using the bilingual context so that it can produce representations that are optimized for machine translation. In our model (Figure 1), the one-hot vectors, after being fed into the embedding layer, are processed by an additional composition layer, which computes the final input representations passed to the encoder to generate translations. For learning the composition function, we employ a bi-RNN. Hence, by encoding each interior unit inside the word, we hope to capture important cues about their functional role, i.e. semantic or syn- Figure 1: Translation of the Italian sentence tornai a casa (I came home) with a word-level representation composed from character trigrams. tactic contribution to the word. We implement the network using gated recurrent units (GRUs) , which have shown comparable performance to long-short-term-memory units (Hochreiter and Schmidhuber, 1997), whereas they provide much faster computation. As a minimal set of input symbols required to cope with contextual ambiguities, we opt to use intersecting sequences of character trigrams, as recently suggested by Vania and Lopez (2017).
Given a bi-RNN with a forward (f ) and backward (b) layer, the input representation w of a token of t characters is computed from the hidden states h f t and h 0 b , i.e. the final outputs of the forward and backward RNNs, as follows: where W f and W b are weight matrices associated to each RNN and b is a bias vector (Ling et al., 2015a). These parameters are jointly learned together with the internal parameters of the GRUs and the input token embedding matrix while training the NMT model. For an input of m tokens, our implementation increases the computational complexity of the network by O(Kt max m), where K is the bi-RNN cost and t max is the maximum number of symbols per word. However, since computation of each input representation is independent, a parallelised implementation could cut the overhead down to O(Kt max ).

Experiments
We test our approach along with statistical word segmentation based open vocabulary NMT methods in an evaluation benchmark simulating a lowresource translation setting pairing English (En) with five languages from different language families and morphological typologies: Arabic (Ar), Czech (Cs), German (De), Italian (It) and Turk-ish (TR). The characteristics of each language are given in Table 1, whereas Table 2 presents the statistical properties of the training data. We train our NMT models using the TED Talks corpora (Cettolo et al., 2012) and test them on the official data sets of IWSLT 1 (Mauro et al., 2017).

Language Morphological Morphological
Typology Complexity   The simple NMT model constitutes the baseline in our study and performs translation directly at the level of sub-word units, which can be of four different types: characters, character trigrams, BPE sub-word units, and LMVR sub-word units.
The compositional model, on the other hand, performs NMT with input representations composed from sub-lexical vocabulary units. In our study, we evaluate representations composed from character trigrams, BPE, and LMVR units. In order to choose the segmentation method to apply on the English side (the output of NMT decoder), we compare BPE and LMVR sub-word units by carrying out an evaluation on the official data sets of Morpho Challenge 2010 2 (Kurimo et al., 2010). The results of this evaluation, as given in Table  3, suggest that LMVR seems to provide a segmentation that is more consistent with morpheme boundaries, which motivates us to use sub-word tokens generated by LMVR for the target side. This choice aids us in evaluating the morphological knowledge contained in input representations in terms of the translation accuracy in NMT.
The compositional bi-RNN layer is implemented in Theano (Team et al., 2016) and integrated into the Nematus NMT toolkit (Sennrich et al., 2017). In our experiments, we use a compositional bi-RNN with 256 hidden units, an NMT model with a one-layer bi-directional GRU encoder and one-layer GRU decoder of 512 hidden units, and an embedding dimension of 256 for both models. We use a highly restricted dictionary size of 30,000 for both source and target languages, and train the segmentation models (BPE and LMVR) to generate sub-word vocabularies of the same size. We train the NMT models using the Adagrad (Duchi et al., 2011) optimizer with a mini-batch size of 50, a learning rate of 0.01, and a dropout rate of 0.1 (in all layers and embeddings). In order to prevent over-fitting, we stop training if the perplexity on the validation does not decrease for 5 epochs, and use the best model to translate the test set. The model outputs are evaluated using the (case-sensitive) BLEU (Papineni et al., 2002) metric and the Multeval (Clark et al., 2011)

Results
The performance of NMT models in translating each language using different vocabulary units and encoder input representations can be seen in Table 4. With the simple model, LMVR based units achieve the best accuracy in translating all languages, with improvements over BPE by 0.85 to 1.09 BLEU points in languages with high morphological complexity (Arabic, Czech and Turkish) and 0.32 to 0.53 BLEU points in languages with low to medium complexity (Italian and German). This confirms our previous results in (Ataman and Federico, 2018). Moreover, simple models using character trigrams as vocabulary units reach much higher translation accuracy compared to models using characters, indicating their superior performance in handling contextual ambiguity. In the Italian to English translation direction, the performance of simple models using character trigrams and BPE sub-word units as input representations are almost comparable, showing that character trigrams can even be sufficient as the standalone vocabulary units in languages with low lexical sparseness. These findings suggest that each type of sub-word unit used in the simple model is specifically convenient for a given morphological typology. Using our compositional model improves the quality of input representations for each type of vocabulary unit, nevertheless, the best performance is obtained by using character trigrams as input symbols and words as input representations. The higher quality of these input representations compared to those obtained from subword units generated with LMVR suggest that our compositional model can learn morphology better than LMVR, which was found to provide comparable performance to morphological analyzers in Turkish to English NMT (Ataman et al., 2017). Moreover, sample outputs from both models show that the compositional model is also able to better capture syntactic information of input sentences. Figure 5 illustrates two example translations from Italian and Turkish. In Italian, the simple model fails to understand the common subject of different verbs in the sentence due to the repetition of the same inflective suffix after segmentation. In Turkish, the genitive case "yerlerin fotograflarının" (the photographs of places) and the complex predicate "birleştirilmesiyle meydana geldi" (is composed of ) are both incorrectly Input e comunque, em@@ ig@@ riamo , circol@@ iamo e mescol@@ iamo così tanto che (Simple Model) non esiste più l' isolamento necessario affinché avvenga un' evoluzione .

NMT Output
and anyway , we repair, and we mix so much that (Simple Model) there 's no longer the isolation that we need to happen to make an evolution . Input e comunque, emigriamo, circoliamo e mescoliamo così tanto che (Compositional Model) non esiste più l' isolamento necessario affinché avvenga un' evoluzione.

NMT Output
and anyway , we migrate , circle and mix so much that (Compositional Model) there 's no longer the isolation necessary to become evolutionary . Reference and by the way , we immigrate and circulate and intermix so much that you can 't any longer have the isolation that is necessary for evolution to take place .

Input
ama aslında bu resim tamamen , farklı yerlerin fotograf@@ larının (Simple Model) birleştir@@ il@@ mesiyle meydana geldi . NMT Output but in fact , this picture came up with a completely (Simple Model) different place of photographs . Input ama aslında bu resim tamamen , farklı yerlerin fotograflarının (Compositional Model) birleştirilmesiyle meydana geldi . NMT Output but in fact , this picture came from collecting pictures of (Compositional Model) different places . Reference but this image is actually entirely composed of photographs from different locations . Table 5: Example translations with different approaches in Italian (above) and Turkish (below).
translated by the simple model. On the other hand, the compositional model is able to capture the correct sentence semantics and syntax in either case. These findings suggest that maintaining translation at the lexical level apparently aids the attention mechanism and provides more semantically and syntactically consistent translations. The overall improvements obtained with this model over the best performing simple model are 1.99 BLEU points in Arabic, 2.32 BLEU points in Czech, 1.91 BLEU points in German, 2.48 BLEU points in Italian and 1.71 BLEU points in Turkish to English translation directions. As evident from the significant and consistent improvements across all languages, our approach provides a more promising and generic solution to the data sparseness problem in NMT.

Conclusion
In this paper, we have addressed the problem of translating infrequent and unseen words in NMT and proposed to solve it by replacing the conventional (source language) sub-word embeddings with input representations compositionally learned from character n-grams using a bi-RNN. Our approach showed significant and consistent improvements over a variety of languages with different morphological typologies, making it a competitive solution for NMT of low-resource and morphologically-rich languages. In the future, we plan to develop a more efficient implementation of our approach and to test its scalability on larger data sets. Our implementation and evaluation benchmark are available for public use.