CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks

Neural sequence to sequence learning recently became a very promising paradigm in machine translation, achieving competitive results with statistical phrase-based systems. In this system description paper, we attempt to utilize several recently published methods used for neural sequential learning in order to build systems for WMT 2016 shared tasks of Automatic Post-Editing and Multimodal Machine Translation.


Introduction
Neural sequence to sequence models are currently used for variety of tasks in Natural Language Processing including machine translation (Sutskever et al., 2014;Bahdanau et al., 2014), text summarization (Rush et al., 2015), natural language generation (Wen et al., 2015), and others.This was enabled by the capability of recurrent neural networks to model temporal structure in data, including the long-distance dependencies in case of gated networks (Hochreiter and Schmidhuber, 1997;Cho et al., 2014).
The deep learning models' ability to learn a dense representation of the input in the form of a real-valued vector recently allowed researchers to combine machine vision and natural language processing into tasks believed to be extremely difficult only few years ago.The distributed representations of words, sentences and images can be understood as a kind of common data type for language and images within the models.This is then used in tasks like automatic image captioning (Vinyals et al., 2015;Xu et al., 2015), visual question answering (Antol et al., 2015) or in attempts to ground lexical semantics in vision (Kiela and Clark, 2015).
In this system description paper, we bring a summary of the Recurrent Neural Network (RNN)-based system we have submitted to the automatic post-editing task and to the multimodal translation task.Section 2 describes the architecture of the networks we have used.Section 3 summarizes related work on the task of automatic post-editing of machine translation output and describes our submission to the Workshop of Machine Translation (WMT) competition.In a similar fashion, Section 4 refers to the task of multimodal translation.Conclusions and ideas for further work are given in Section 5.

Model Description
We use the neural translation model with attention (Bahdanau et al., 2014) and extend it to include multiple encoders, see Figure 1 for an illustration.Each input sentence enters the system simultaneously in several representations x i .An encoder used for the i-th representation X i = (x 1 i , . . ., x k i ) of k words, each stored as a one-hot vector x j i , is a bidirectional RNN implementing a function where the states h j i are concatenations of the outputs of the forward and backward networks after processing the j-th token in the respective order.
The initial state of the decoder is computed as a weighted combination of the encoders' final states.
The decoder is an RNN which receives an embedding of the previously produced word as an input in every time step together with the hidden state from the previous time step.The RNN's output is then used to compute the attention and the next word distribution.
The attention is computed over each encoder separately as described by Bahdanau et al. (2014).The attention vector a m i of the i-th encoder in the m-th step of the decoder is where the weights α m i is a distribution estimated as with s m being the hidden state of the decoder in time m.Vector v and matrix W H i are learned parameters for projecting the encoder states.The probability of the decoder emitting the word y m in the j-th step, denoted as where H i are hidden states from the i-th encoder and Y 0..m−1 is the already decoded target sentence (represented as matrix, one-hot vector for each produced word).Matrices W o and W a i are learned parameters; W o determines the recurrent dependence on the decoder's state and W a i determine the dependence on the (attention-weighted) encoders' states.
For image captioning, we do not use the attention model because of its high computational demands and rely on the basic model by Vinyals et al. (2015) instead.We use Gated Recurrent Units (Cho et al., 2014) and apply the dropout of 0.5 on the inputs and the outputs of the recurrent layers (Zaremba et al., 2014) and L2 regularization of 10 −8 on all parameters.The decoding is done using a beam search of width 10.Both the decoders and encoders have hidden states of 500 neurons, word embeddings have the dimension of 300.The model is optimized using the Adam optimizer (Kingma and Ba, 2014) with learning rate of 10 −3 .
We experimented with recently published improvements of neural sequence to sequence learning: scheduled sampling (Bengio et al., 2015), noisy activation function (Gülc ¸ehre et al., 2016), linguistic coverage model (Tu et al., 2016).None of them were able to improve the systems' performance, so we do not include them in our submissions.
Since the target language for both the task was German, we also did language dependent pre-and post-processing of the text.For the training we split the contracted prepositions and articles (am ↔ an dem, zur ↔ zu der, . . . ) and separated some pronouns from their case ending (keinem ↔ kein -em, unserer ↔ unser -er, . . .).We also tried splitting compound nouns into smaller units, but on the relatively small data sets we have worked with, it did not bring any improvement.
The task of automatic post-editing (APE) aims at improving the quality of a machine translation system treated as black box.The input of an APE system is a pair of sentences -the original input sentence in the source language and the translation generated by the machine translation (MT) system.This scheme allows to use any MT system without any prior knowledge of the system itself.The goal of this task is to perform automatic corrections on the translated sentences and generate a better translation (using the source sentence as an additional source of information).
For the APE task, the organizers provided tokenized data from the IT domain (Turchi et al., 2016).The training data consist of 12,000 triplets of the source sentence, its automatic translation and a reference sentence.The reference sentences are manually post-edited automatic translations.Additional 1,000 sentences were provided for validation, and another 2,000 sentences for final evaluation.Throughout the paper, we report scores on the validation set; reference sentences for final evaluation were not released for obvious reasons.
The performance of the systems is measured using Translation Error Rate (Snover et al., 2006) from the manually post-edited sentences.We thus call the score HTER.This means that the goal of the task is more to simulate manual post-editing, rather than to reconstruct the original unknown reference sentence.

Related Work
In the previous year's competition (Bojar et al., 2015), most of the systems were based on the phrase-base statistical machine translation (SMT) in a monolingual setting (Simard et al., 2007).
There were also several rule-based post-editing systems benefiting from the fact that errors introduced by statistical and rule-based systems are of a different type (Rosa, 2014;Mohaghegh et al., 2013).
Although the use of neural sequential model is very straightforward in this case, to the best of our knowledge, there have not been experiments with RNNs for this task.

Experiments & Results
The input sentence is fed to our system in a form of multiple input sequences without explicitly telling which sentence is the source one and which one is the MT output.It is up to the network to discover their best use when producing the (single) target sequence.The initial experiments showed that the network struggles to learn that one of the source sequences is almost correct (even if it shares the vocabulary and word embeddings with the expected target sequence).Instead, the network seemed to learn to paraphrase the input.
To make the network focus more on editing of the source sentence instead of preserving the meaning of the sentences, we represented the target sentence as a minimum-length sequence of edit operations needed to turn the machine-translated sentence into the reference post-edit.We extended the vocabulary by two special tokens keep and delete and then encoded the reference as a sequence of keep, delete and insert operations with the insert operation defined by the placing the word itself.See Figure 2 for an example.
After applying the generated edit operations on the machine-translated sentences in the test phase, we perform a few rule-based orthographic fixes for punctuation.The performance of the system is given in Table 1.The system was able to slightly improve upon the baseline (keeping the translation as it is) in both the HTER and BLEU score.The system was able to deal very well with the frequent error of keeping a word from the source in the translated sentence.Although neural sequential models usually learn the basic output structure very quickly, in this case it made a lot of errors in pairing parentheses correctly.We ascribe this to the edit-operation notation which obfuscated the basic orthographic patterns in the target sentences.

Multimodal Translation
The goal of the multimodal translation task is to generate an image caption in a target language (German) given the image itself and one or more captions in the source language (English).

Source
Choose Uncached Refresh from the Histogram panel menu.

MT
Wählen  Recent experiments of Elliott et al. (2015) showed that including the information from the images can help disambiguate the source-language captions.
The participants were provided with the Multi30k dataset (Elliott et al., 2016) which is an extension of the Flickr30k dataset (Plummer et al., 2015).In the original dataset, 31,014 images were taken from the users collections on the image hosting service Flickr.Each of the images were given five independent crowd-sourced captions in English.For the Multi30k dataset, one of the English captions for each image was translated into German and five other independent German captions were provided.The data are split into a training set of 29,000 images, a validation set of 1,014 images and a test set with 1,000 images.
The two ways in which the image annotation were collected also lead to two sub-tasks.The first one is called Multimodal Translation and its goal is to generate a translation of an image caption to the target language given the caption in source language and the image itself.The second task is the Cross-Lingual Image Captioning.In this setting, the system is provided five captions in the source language and it should generate one caption in target language given both sourcelanguage captions and the image itself.Both tasks are evaluated using the BLEU (Papineni et al., 2002) score and METEOR score (Denkowski and Lavie, 2011).The translation task is evaluated against a single reference sentence which is the direct human translation of the source sentence.The cross-lingual captioning task is evaluated against the five reference captions in the target language created independently of the source captions.

Related Work
The state-of-the-art image caption generators use a remarkable property of the Convolutional Neural Network (CNN) models originally designed for ImageNet classification to capture the semantic features of the images.Although the images in ImageNet (Deng et al., 2009;Russakovsky et al., 2015) always contain a single object to classify, the networks manage to learn a representation that is usable in many other cases including image captioning which usually concerns multiple objects in the image and also needs to describe complex actions and spacial and temporal relations within the image.
Prior to CNN models, image classification used to be based on finding some visual primitives in the image and transcribing automatically estimated relations between the primitives.Soon after Kiros et al. (2014) showed that the CNN features could be used in a neural language model, Vinyals et al. (2015) developed a model that used an RNN decoder known from neural MT for generating captions from the image features instead of the vector encoding the source sentence.Xu et al. (2015) later even improved the model by adapting the soft alignment model (Bahdanau et al., 2014) nowadays known as the attention model.Since then, these models have become a benchmark for works trying to improve neural sequence to sequence models (Bengio et al., 2015;Gülc ¸ehre et al., 2016;Ranzato et al., 2015).

Phrase-Based System
For the translation task, we trained Moses SMT (Koehn et al., 2007) with additional language models based on coarse bitoken classes.We follow the approach of Stewart et al. ( 2014 Table 2: Results of experiments with the multimodal translation task on the validation data.At the time of the submission, the models were not tuned as well as our final models.The first six system are targeted for the translation task.They were trained against one reference -a German translation of one English caption.The last four systems are target to the cross-lingual captioning task.They were trained with 5 independent German captions (5 times bigger data).
is concatenated with its aligned source word into one bitoken (e.g."Katze-cat").For unaligned target words, we create a bitoken with NULL as the source word (e.g."wird-NULL").Unaligned source words are dropped.For more than one-toone alignments, we join all aligned word pairs into one bitoken (e.g."hat-had+gehabt-had").These word-level bitokens are afterwards clustered into coarse classes (Brown et al., 1992) and a standard n-gram language model is trained on these classes.Following the notation of Stewart et al. (2014), "400bi" indicates a LM trained on 400 bitoken classes, "200bi" stands for 200 bitoken classes, etc.Besides bitokens based on aligned words, we also use class-level bitokens.For example "(200,400)" means that we clustered source words into 200 classes and target words into 400 classes and only then used the alignment to extract bitokens of these coarser words.The last type is "100bi(200,400)", a combination of both independent clustering in the source and target "(200,400)" and the bitoken clustering "100bi".Altogether, we tried 26 configurations combining various coarse language models.The best three were "200bi" (a single bitoken LM), "200bi&(1600,200)&100tgt" (three LMs, each with its own weight, where 100tgt means a language model over 100 word classes trained on the target side only) and "200bi&100tgt".Manual inspection of these three best configurations reveals almost no differences; often the outputs are identical.Comparing to the baseline (a single word-based LM), it is evident that coarse models prefer to ensure agreement and are much more likely to allow for a different word or preposition choice to satisfy the agreement.

Neural System
For the multimodal translation task, we combine the RNN encoders with image features.The image features are extracted from the 4096-dimensional penultimate layer (fc7) of the VGG-16 Imagenet network (Simonyan and Zisserman, 2014) before applying non-linearity.We keep the weights of the convolutional network fixed during the training.We do not use attention over the image features, so the image information is fed to the network only via the initial state.
We also try a system combination and add an encoder for the phrase-based output.The SMT encoder shares the vocabulary and word embeddings with the decoder.For the combination with SMT output, we experimented with the CopyNet architecture (Gu et al., 2016) and with encoding the sequence the way as in the APE task (see Section 3.2).Since neither of these variations seems to have any effect on the performance, we report only the results of the simple encoder combina- A group of men are loading ::::::::: something onto a truck.

CLC
Mehrere Personen stehen an einem LKW.Gloss: More persons stand on a truck.

Source
A man sleeping in a green room on a couch.

Reference
Ein Mann schläft in einem grünen Raum auf einem Sofa.
No error, a correctly used synonym for "couch".
Figure 3: Sample outputs of our multimodal translation (MMMT) system and cross-lingual captioning (CLC) system in comparison with phrase-based MT and the reference.The MMMT system refers to the 'NMT + Moses + image' row and CLC system to the '5 captions + image' row in Table 2.
tion.Systems targeted for the multimodal translation task have a single English caption (and eventually its SMT and the image representation) on its input and produce a single sentence which is a translation of the original caption.Every input appears exactly once in the training data paired with exactly one target sentence.On the other hand, systems targeted for the cross-lingual captioning use all five reference sentences as a target, i.e. every input is present five times in the training data with five different target sentences, which are all independent captions in German.In case of the crosslingual captioning, we use five parallel encoders sharing all weights combined with the image features in the initial state.
Results of the experiments with different input combinations are summarized in the next section.

Results
The results of both the tasks are given in Table 2. Our system significantly improved since the competition submission, therefore we report both the performance of the current system and of the submitted systems.Examples of the system output can be found in Figure 3.
The best performance has been achieved by the neural system that combined all available input both for the multimodal translation and crosslingual captioning.Although, using the image as the only source of information led to poor results, adding the image information helped to improve the performance in both tasks.This supports the hypothesis that for the translation of an image caption, knowing the image can add substantial piece of information.
The system for cross-lingual captioning tended to generate very short descriptions, which were usually true statements about the images, but the sentences were often too general or missing important information.We also needed to truncate the vocabulary which brought out-of-vocabulary tokens to the system output.Unlike the translation task where the vocabulary size was around 20,000 different forms for both languages, having 5 source and 5 reference sentences increased the vocabulary size more than twice.
Similarly to the automatic postediting task, we were not able to come up with a setting where the combination with the phrase-based system would improve over the very strong Moses system with bitoken-classes language model.We can therefore hypothesize that the weakest point of the models is the weighted combination of the inputs for the initial state of the decoder.The difficulty of learning relatively big combination weighting matrices which are used just once during the model execution (unlike the recurrent connections having approximately the same number of parameters) probably over-weighted the benefits of having more information on the input.In case of system combination, more careful exploration of explicit copy mechanism as CopyNet (Gu et al., 2016) may be useful.
We applied state-of-the art neural machine translation models to two WMT shared tasks.We showed that neural sequential models could be successfully applied to the APE task.We also showed that information from the image can significantly help while producing a translation of an image caption.Still, with the limited amount of data provided, the neural system performed comparably to a very well tuned SMT system.
There is still a big room for improvement of the performance using model ensembles or recently introduced techniques for neural sequence to sequence learning.An extensive hyper-parameter testing could be also helpful.

Figure 1 :
Figure 1: Multi-encoder architecture used for the multimodal translation.

Figure 2 :
Figure 2: An example of the sequence of edit operations that our system should learn to produce when given the candidate MT translation.The colors and subscripts denote the alignment between the edit operations and the machine-translated and post-edited sentence.

Table 1 :
Results of experiments on the APE task on the validation data.The '+' sign indicates the additional regular-expression rules -the system that has been submitted.