OpenKiwi: An Open Source Framework for Quality Estimation

We introduce OpenKiwi, a Pytorch-based open source framework for translation quality estimation. OpenKiwi supports training and testing of word-level and sentence-level quality estimation systems, implementing the winning systems of the WMT 2015–18 quality estimation campaigns. We benchmark OpenKiwi on two datasets from WMT 2018 (English-German SMT and NMT), yielding state-of-the-art performance on the word-level tasks and near state-of-the-art in the sentence-level tasks.


Introduction
Quality estimation (QE) provides the missing link between machine and human translation: its goal is to evaluate a translation system's quality without access to reference translations (Specia et al., 2018b). Among its potential usages are: informing an end user about the reliability of automatically translated content; deciding if a translation is ready for publishing or if it requires human post-editing; and highlighting the words that need to be post-edited.
While there has been tremendous progress in QE in the last years (Martins et al., 2016(Martins et al., , 2017Wang et al., 2018), the ability of researchers to reproduce state-of-the-art systems has been hampered by the fact that these are either based on complex ensemble systems, complicated architectures, or require not well-documented pretraining and fine-tuning of some components. Existing open-source frameworks such as QuEST++ (Specia et al., 2015), Marmot (Logacheva et al., 2016), or DeepQuest (Ive et al., 2018), while helpful, are currently behind the recent best systems in WMT QE shared tasks. To address the shortcoming above, this paper presents OpenKiwi: 1 a new open-source framework for QE that implements the best QE systems from WMT 2015-18 shared tasks, making it easy to combine and modify their key components, while experimenting under the same framework.
Training a new OpenKiwi model is as simple as running the following command line: where config.yml is a configuration file with training and model options. A stacked combination of these models yields state-of-the-art results for word-level QE on the widely used English-German SMT and NMT datasets. For ease of use from the research community, we made our codebase clean and modular, with detailed documentation and high test coverage. The main features of OpenKiwi are: • Implementation of five QE systems: QUETCH (Kreutzer et al., 2015), NUQE (Martins et al., 2016, Predictor-Estimator Wang et al., 2018), APE-QE (Martins et al., 2017), and a stacked ensemble with a linear system (Martins et al., 2016(Martins et al., , 2017; • Implementation in Python 3 using PyTorch as the deep learning framework; • Easy to use API: can be imported as a package in other projects or run from the command line; • Ability to train new QE models on new data; • Ability to run pre-trained QE models on data from the WMT 2018 campaign; • This project is hosted at https://github. com/Unbabel/OpenKiwi. We welcome and encourage contributions from the research community. 2

Quality Estimation
The goal of word-level QE (Figure 1) is to assign quality labels (OK or BAD) to each machinetranslated word, as well as to gaps between words (to account for context that needs to be inserted), and source words (to denote words in the original sentence that have been mistranslated or omitted in the target). In the last years, the most accurate systems that have been developed for this task combine linear and neural models (Kreutzer et al., 2015;Martins et al., 2016), use automatic post-editing as an intermediate step (Martins et al., 2017), or develop specialized neural architectures Wang et al., 2018).
Sentence-level QE, on the other hand, aims to predict the quality of the whole translated sentence, for example based on the time it takes for a human to post-edit it, or on how many edit operations are required to fix it (Specia et al., 2018b). The most successful approaches to sentence-level QE to date are based on conversions from wordlevel predictions (Martins et al., 2017) or joint training with multi-task learning Wang et al., 2018).

Datasets
To benchmark OpenKiwi, we use the following datasets from the WMT 2018 quality estimation shared task, all English-German (En-De): • Two quality estimation datasets of sentence triplets, each consisting of a source sentence (SRC), its machine translation (MT) and a human post-edition (PE) of the machine translation. The data also contains wordlevel quality labels and sentence-level scores that are obtained from the post-editions using TERCOM (Snover et al., 2006).
-A larger dataset of 26,273 training and 1,000 development triplets, where the MT is generated by a phrase-based statistical machine translation (SMT).
-A smaller dataset of 13,442 training and 1,000 development triplets, where the MT is generated by a neural machine translations system (NMT).
• A corpus of 526,368 artificially generated sentence triplets, obtained by first crossentropy filtering a much larger monolingual corpus for in-domain sentences, then using round-trip translation and a final stratified sampling step.
• A parallel dataset of 3,396,364 in-domain sentences used for pre-training of the predictor-estimator model.

Implemented Systems
OpenKiwi implements five popular systems that have been proposed in the last years, which we now describe briefly.
QUETCH. QUETCH (Kreutzer et al., 2015) is designed as a multilayer perceptron with one hidden layer, non-linear tanh activation functions and a lookup-table layer mapping words to continuous dense vectors. For each position in the MT, a window of fixed size surrounding that position, as well as a windowed representation of aligned words from the source language are concatenated as model input. The output layer scores OK/BAD probabilities for each word with a softmax activation. The model is trained independently to predict source tags, gap tags, and target tags. QUETCH is a very simple model and does not rely on any kind of external auxiliary data for training, only the shared task data sets.
NuQE. NuQE is a neural quality estimation system proposed by Martins et al. (2016). Its architecture consists of a lookup layer containing embeddings for target words and their source-aligned words. 3 These embeddings are concatenated and fed into two consecutive sets of two feed-forward layers and a bi-directional GRU layer. The output contains a softmax layer that produces the final OK/BAD decisions. Gap tags are handled separately from target tags, requiring standalone models to be trained. Likewise for source tags. NuQE is a blackbox system, meaning we train it with the   Table 1: Benchmarking of the different models implemented in OpenKiwi on the WMT 2018 development set, along with an ensembled system that averages the outputs of the other systems (ENSEMBLED) and a stacked architecture with stacks their predictions into a linear feature-based model (STACKED). For each system, we report the five official scores used in WMT 2018: word-level F mult 1 for MT, gaps, and source tokens, and sentence-level Pearson's r and Spearman's ρ rank correlations.
shared task data only (i.e., no auxiliary parallel or roundtrip data).

APE-QE.
Automatic post-editing (APE) has been used by Martins et al. (2017) as an intermediate step for quality estimation, where an APE system is trained on the human post-edits and its outputs are used as pseudo-post-editions to generate word-level quality labels and sentence-level scores in the same way that the original labels were created. Since OpenKiwi's focus is not on implementing a sequence-to-sequence model, we used an external software, OpenNMT-py (Klein et al., 2017), to train two separate translation models: • SRC → PE: trained first on the in-domain corpus provided, then fine-tuned on the shared task data.
• MT → PE: trained on the concatenation of the corpus of artificially created sentence triplets and the shared task data oversampled by a factor of 20.
These predictions are then combined in the ensemble and stacked systems as explained below.
Predictor-Estimator. Our implementation follows closely the architecture proposed by , which consists of two modules: • a predictor, which is trained to predict each token of the target sentence given the source and the left and right context of the target sentence; • an estimator, which takes features produced by the predictor and uses them to classify each word as OK or BAD.  deepQUEST is the open-source system developed by Ive et al. (2018), and UNQE is the unpublished system from Jiangxi Normal University, described by Specia et al. (2018a).
Our predictor uses a bidirectional LSTM to encode the source, and two unidirectional LSTMs processing the target in left-to-right (LSTM-L2R) and right-to-left (LSTM-R2L) order. For each target token t i , the representations of its left and right context are concatenated and used as query to an attention module before a final softmax layer. It is trained on the large parallel corpora provided as additional data by the WMT shared task organizers. The estimator takes as input a sequence of features: for each target token t i , the final layer before the softmax (before processing t i ), and the concatenation of the i-th hidden state of LSTM-L2R and LSTM-R2L (after processing t i ). In addition, we train this system with a multi-task architecture that allows us to predict sentence-level HTER scores. Overall, this system is capable to predict sentence-level scores and all word-level labels (for MT words, gaps, and source words)-the source word labels are produced by training a predictor in the reverse direction.
Stacked Ensemble. Finally, we ensemble the systems above using a stacked architecture with a feature-based linear system, as described by Martins et al. (2017). The features are the ones described there, including lexical and part-of-speech tags from words, their contexts, and their aligned words and contexts, as well as syntactic features and features provided by a language model (as provided by the organizers). This system is only used to produce word-level labels for MT words.

Benchmark Experiments
We show benchmark numbers on the two English-German WMT 2018 datasets. In Table 1, we compare different configurations of OpenKiwi on the development datasets. For the single systems, we can see that the predictor-estimator has the best performance, except for predicting the source and the gap word-level tags, where APE-QE is superior. Overall, ensembled versions of these systems perform the best, with a stacked architecture being very effective for predicting word-level MT labels, confirming the findings of Martins et al. (2017). Finally, in Table 2, we report numbers on the official test set. We compare OpenKiwi against the best systems in WMT 2018 (Specia et al., 2018a) and another existing open-source tool, deepQuest (Ive et al., 2018). Overall, OpenKiwi outperforms deepQuest for all wordlevel and sentence-level tasks, and attains the best results for all the word-level tasks.

Conclusions
We presented OpenKiwi, a new open-source framework for QE. OpenKiwi is implemented in Pytorch and it supports training of word-level and sentence-level QE systems on new data. It outperforms other open-source toolkits on both wordlevel and sentence-level, and it yields state-of-theart word-level QE results.