Ensembling Factored Neural Machine Translation Models for Automatic Post-Editing and Quality Estimation

This work presents a novel approach to Automatic Post-Editing (APE) and Word-Level Quality Estimation (QE) using ensembles of specialized Neural Machine Translation (NMT) systems. Word-level features that have proven effective for QE are included as input factors, expanding the representation of the original source and the machine translation hypothesis, which are used to generate an automatically post-edited hypothesis. We train a suite of NMT models that use different input representations, but share the same output space. These models are then ensembled together, and tuned for both the APE and the QE task. We thus attempt to connect the state-of-the-art approaches to APE and QE within a single framework. Our models achieve state-of-the-art results in both tasks, with the only difference in the tuning step which learns weights for each component of the ensemble.


Introduction
Translation destined for human consumption often must pass through multiple editing stages. In one common scenario, human translators correct machine translation (MT) output, correcting errors and omissions until a perfect translation has been produced. Several studies has shown that this process, referred to as "post-editing", is faster than translation from scratch (Specia, 2011), or interactive machine translation (Green et al., 2013).
A relatively recent line of research has tried to build models which correct errors in MT automatically (Simard et al., 2007;Bojar et al., 2015;Junczys-Dowmunt and Grundkiewicz, 2016). Automatic Post-Editing (APE) typically views the system that produced the original translation as a black box, which cannot be modified or inspected. An APE system has access to the same data that a human translator would see: a source sentence and a translation hypothesis. The job of the system is to output a corrected hypothesis, attempting to fix errors made by the original translation system. This can be viewed as a sequence-tosequence task (Sutskever et al., 2014), and is also similar to multi-source machine translation (Zoph and Knight, 2016;Firat et al., 2016). However, APE intuitively tries to make the minimum number of edits required to transform the hypothesis into a satisfactory translation, because we would like our system to mimic human translators in attempting to minimize the time spent correcting each MT output. This additional constraint on APE models differentiates the task from multisource MT. The Word Level QE task is ostensibly a simpler version of APE, where a system must only decide whether or not each word in an MT hypothesis belongs in the post-edited version -it is not necessary to propose a fix for errors. Most recent work has considered word-level QE to be a sequence labeling task, and employed the standard tools of structured prediction to solve it, i.e. structured predictors such as CRFs or structured SVMs, which take advantage of sparse representations and very large feature sets, as well as dependencies between labels in the output sequence (Logacheva et al., 2016;Martins et al., 2016). However, Martins et al. (2017 recently proposed a new method of word-level QE using APE, which simply uses an APE system to produce a "pseudo-post-edit" given a source sentence and an MT hypothesis. Their approach, which we call APE-QE, is the basis of the work presented here. In APE-QE, the original MT hypothesis is then aligned with the pseudo-postedit from the APE system using word level edit-distance, and words which correspond to Insert or Delete operations are labeled as incorrect. Note that this also corresponds exactly to the way QE datasets are currently created, with the only difference being that human post-edits are typically used to create gold-standard data (Bojar et al., 2015).
A key similarity between the QE and APE tasks is that both use information from two sequences: (1) the original source input, and (2) an MT hypothesis. Martins et al. (2017), showed that APE systems with no knowledge about the QE task already provide a very strong baseline for QE. Because the essential training data for the APE and QE tasks is identical, consisting of parallel triples of (SRC, M T, P E), it is also natural to consider these tasks as two subtasks that make use of a single underlying model.
In this work, we explicitly design ensembles of NMT models for both word-level QE, and APE. This approach builds upon the approach presented in Martins et al. (2017), by incorporating features which have proven effective for Word Level QE as "factors" in the input to Neural Machine Translation (NMT) systems. We achieve state-of-the-art results in both Automatic Post-Editing and Word-Level Quality Estimation, matching the performance of much more complex QE systems, and significantly outperforming the current state-ofthe-art in APE.
The main contributions of this work are: • Novel Input Representations for Neural APE models

• New tuned ensembles for APE-QE
• An open-source decoder supporting ensembles of models with different inputs 1 The following sections discuss our approach to creating hybrid models for APE-QE, which should be able to solve both tasks with minimal modification.

Related Work
Two important lines of research have recently made breakthroughs in QE and APE. 1 code avaiable at https://github.com/ chrishokamp/constrained_decoding 2.1 Automatic Post-Editing APE and QE training datasets consist of (SRC, M T, P E) triples, where the post-edited reference is created by a human translator in the workflow described above. However, publicly available APE datasets are relatively small in comparison to parallel datasets used to train machine translation systems. Junczys-Dowmunt and Grundkiewicz (2016) introduce a method for generating a large synthetic training dataset from a parallel corpus of (SRC, REF ) by first translating the reference to the source language, and then translating this "pseudo-source" back into the target language, resulting in a "pseudohypothesis" which is likely to be more similar to the reference than a direct translation from source to target. The release of this synthetic training data was a major contribution towards improving APE.
Junczys-Dowmunt and Grundkiewicz (2016) also present a framework for ensembling SRC → PE and SRC → PE NMT models together, and tuning for APE performance. Our work extends this idea with several new input representations, which are inspired by the goal of solving both QE and APE with the same model.

Quality Estimation
Martins et al. (2016) introduced a stacked architecture, using a very large feature set within a structured prediction framework to achieve a large jump in the state of the art for Word-Level QE. Some features are actually the outputs of standalone feedforward and recurrent neural network models, which are then stacked into the final system. Although their approach creates a very good final model, the training and feature extraction steps are quite complicated. An additional disadvantage of this approach is that it requires "jackknifing" the training data for the standalone models that provide features to the stacked model, in order to avoid overfitting in the stacked ensemble. This requires training k versions of each model type, where k is the number of jackknife splits.
Our approach is most similar to Martins et al. (2017), the major differences are: we do not use any internal features from the original MT system, and we do not need to "jackknife" in order to create a stacked ensemble. Using only NMT with attention, we are able to surpass the state-of-theart in APE and match it in QE.

Factored Inputs
Alexandrescu and Kirchoff (2006) introduced linguistic factors for neural language models. The core idea is to learn embeddings for linguistic features such as part-of-speech (POS) tags and dependency labels, augmenting the word embeddings of the input with additional features. Recent work has shown that NMT performance can also be improved by concatenating embeddings for additional word-level "factors" to source-word input embeddings . The input representation e j for each source input x j with factors F thus becomes Eq. 1: where indicates vector concatenation, E k is the embedding matrix of factor k, and x jk is a one hot vector for the k-th input factor.

Models
In this section we describe the five model types used for APE-QE, as well as the ensembles of these models which turn out to be the bestperforming overall. We design several features to be included as inputs to APE. The operating hypothesis is that that features which haven proven useful for Quality Estimation should also have a positive impact upon APE performance. Our baseline models are the same models used in Junczys-Dowmunt (2016) 2 . The authors provide trained SRC → P E and M T → P E models, which correspond to the last four checkpoints from fine-tuning the models on the 500K training data concatenated with the task internal APE data upsampled 20 times. These models are referred to as SRC and MT.

Word Alignments
Previous work has shown that alignment information between source and target is a critical component of current state-of-the-art word level QE systems (Kreutzer et al., 2015;Martins et al., 2016). The sequential inputs for structured prediction, as well as the feedforward and recurrent models in existing work obtain the source-side features for each target word using the word-alignments provided by the WMT task organizers. However, this information is not likely to be available in many real-world usecases for Quality Estimation, and the use of this information also means that the MT system used to produce the hypotheses is not actually a "black box", which is part of the definition of the QE task. Clearly, access to the word-alignment information of an SMT system provides a lot of insight into the underlying model.
Because our models rely upon synthetic training data, and because we wish to view the MT system as a true black-box, we instead use the SRC NMT system to obtain these alignments. The attention model for NMT produces a normalized vector of weights at each timestep, where the weights can be viewed as the "alignment probabilities" for each source word (Bahdanau et al., 2014). In order to obtain the input representation shown in table 3, we use the source word with the highest weight from the attention model as an additional factor in the input to another MT-aligned → PE system. The MT-aligned → PE system thus depends upon the SRC → PE system to produce the additional alignment factor.

Inputting Both Source and Target
Following Crego et al. (2016), we train a model which takes the concatenated source and MT as input. The two sequences are separated by a special BREAK token. We refer to this system as SRC+MT.

Part-of-Speech and Dependency Labels
Sennrich and Haddow (2016) showed that information such as POS tags, NER labels, and syntactic roles can be included in the input to NMT models, generally improving performance. Inspired by this idea, we select some of the top performing features from Martins et al. (Martins et al., 2016), and include them as input factors to the SRC+MT-factor model. The base representation is the concatenated SRC+MT (again with a special BREAK token). For each word in the English source and the German hypothesis, we obtain the part-of-speech tag, the dependency relation, and the part-of-speech of the head word, and include these as input factors. For both English and German, we use spaCy 3 to extract these features for all training, development, and test data. The resulting model is illustrated in figure 1.

Extending Factors to Subword Encoding
Our NMT models use subword encoding , but the additional factors are computed at the word level. Therefore, the factors must also be segmented to match the BPE segmentation. We use the {BILOU}-prefixes common in sequence-labeling tasks such as NER to extend factor vocabularies and map each word-level factor to the subword segmentation of the source or target text. Table 3 shows the input representations for each of the model types using an example from the WMT 2016 test data.

Ensembling NMT Models
We average the parameters of the four best checkpoints of each model type, and create an ensemble of the resulting five models, called Avg-All Baseline. We then tune this ensemble for TER (APE) and F1-Mult (QE), using MERT (Och, 2003). The tuned models are called Avg-All APE-Tuned and Avg-All QE-Tuned, respectively. After observing that source-only models have the best singlemodel QE performance (see section 5), we created a final F1-Mult tuned ensemble, consisting of the four individual SRC models, and the averaged models from each other type (an ensemble of eight models total), called 4-SRC+Avg-All QE-Tune.

Experiments
All of our models are trained using Nematus (Sennrich et al., 2017). At inference time we use our own decoder, which supports weighted loglinear ensembles of Nematus models 4 . Following Junczys-Dowmunt and Grundkiewicz (2016), we first train each model type on the large (4M) synthetic training data, then fine tune using the 500K dataset, concatenated with the task-internal training data upsampled 20x. Finally, for SRC+MT and SRC+MT-factor we continued fine-tuning each model for a small number of iterations using the min-risk training implementation available in Nematus (Shen et al., 2016). Table 4 shows the best dev result after each stage of training. For both APE and QE, we use only the taskspecific training data provided for the WMT 2017 APE task, including the extra synthetic training data 5 . However, note that the SpaCy models used to extract features for the factored models are trained with external data -we only use the offthe-shelf models provided by the SpaCy developers.
To convert the output sequence from an APE system into OK, BAD labels for QE, we use the APE hypothesis as a "pseudo-reference", which is then aligned with the original MT hypothesis using TER (Snover et al., 2006). Table 1 shows the results of our experiments using the WMT 16 development and test sets. For each system, we measure performance on BLEU and TER, which are the metrics used in APE task, and also on F1-Mult, which is the primary metric used for the Word Level QE task. Overall tagging accuracy is included as a secondary metric for QE.

Results
All systems with input factors significantly improve APE performance over the baselines. For QE, the trends are less clear, but point to a key difference between optimizing for TER vs. F1_product: F1_product optimization probably lowers the threshold for "changing" a word, as opposed to copying it from the MT hypothesis. This hypothesis is supported by the observation that the source-only APE system outperforms all other single models on the QE metrics. Because the source-only systems cannot resort to copying words from the input, they are forced to make the best guess about the final output, and words which are more likely to be wrong are less likely to be present in the output. If input factors were used with a source-only APE system, the performance on word-level QE could likely be further improved. However, this hypothesis needs more   Table 4: Best BLEU score on dev set after each of the training stages. General is training with 4M instances, Fine-tune is training with 500K + upsampled in-domain data, Min-Risk uses the same dataset as Fine-tune, but uses a minimum-risk loss with BLEU score as the target metric.
analysis and experimentation to confirm.

Conclusion
This work has presented APE-QE, unifying models for APE and word-level QE by leveraging the flexibility of NMT to take advantage of informative features from QE. Models with different input representations are ensembled together and tuned for either APE or QE, achieving state of the art performance in both tasks. The complementary nature of these tasks points to future avenues of exploration, such as joint training using both QE labels and reference translations, as well as the incorporation of other features as input factors.