Multi-level Translation Quality Prediction with QuEst++

This paper presents Q U E ST ++ , an open source tool for quality estimation which can predict quality for texts at word, sentence and document level. It also provides pipelined processing, whereby predictions made at a lower level (e.g. for words) can be used as input to build models for predictions at a higher level (e.g. sentences). Q U E ST ++ allows the extraction of a variety of features, and provides machine learning algorithms to build and test quality estimation models. Results on recent datasets show that Q U E ST ++ achieves state-of-the-art performance.


Introduction
Quality Estimation (QE) of Machine Translation (MT) have become increasingly popular over the last decade. With the goal of providing a prediction on the quality of a machine translated text, QE systems have the potential to make MT more useful in a number of scenarios, for example, improving post-editing efficiency (Specia, 2011), selecting high quality segments (Soricut and Echihabi, 2010), selecting the best translation (Shah and Specia, 2014), and highlighting words or phrases that need revision (Bach et al., 2011).
Most recent work focuses on sentence-level QE. This variant is addressed as a supervised machine learning task using a variety of algorithms to induce models from examples of sentence translations annotated with quality labels (e.g. 1-5 likert scores). Sentence-level QE has been covered in shared tasks organised by the Workshop on Statistical Machine Translation (WMT) annually since 2012. While standard algorithms can be used to build prediction models, key to this task is work of feature engineering. Two open source feature extraction toolkits are available for that: ASIYA 1 and QUEST 2 . The latter has been used as the official baseline for the WMT shared tasks and extended by a number of participants, leading to improved results over the years (Callison-Burch et al., 2012;Bojar et al., 2013;Bojar et al., 2014).
QE at other textual levels have received much less attention. Word-level QE (Blatz et al., 2004;Luong et al., 2014) is seemingly a more challenging task where a quality label is to be produced for each target word. An additional challenge is the acquisition of sizable training sets. Although significant efforts have been made, there is considerable room for improvement. In fact, most WMT13-14 QE shared task submissions were unable to beat a trivial baseline.
Document-level QE consists in predicting a single label for entire documents, be it an absolute score (Scarton and Specia, 2014) or a relative ranking of translations by one or more MT systems (Soricut and Echihabi, 2010). While certain sentences are perfect in isolation, their combination in context may lead to an incoherent document. Conversely, while a sentence can be poor in isolation, when put in context, it may benefit from information in surrounding sentences, leading to a good quality document. Feature engineering is a challenge given the little availability of tools to extract discourse-wide information. In addition, no datasets with human-created labels are available and thus scores produced by automatic metrics have to be used as approximation (Scarton et al., 2015).
Some applications require fine-grained, wordlevel information on quality. For example, one may want to highlight words that need fixing. Document-level QE is needed particularly for gisting purposes where post-editing is not an option.
For example, for predictions on translations of product reviews in order to decide whether or not they are understandable by readers. We believe that the limited progress in word and documentlevel QE research is partially due to lack of a basic framework that one can be build upon and extend. QUEST++ is a significantly refactored and expanded version of an existing open source sentence-level toolkit, QUEST. Feature extraction modules for both word and document-level QE were added and the three levels of prediction were unified into a single pipeline, allowing for interactions between word, sentence and documentlevel QE. For example, word-level predictions can be used as features for sentence-level QE. Finally, sequence-labelling learning algorithms for wordlevel QE were added. QUEST++ can be easily extended with new features at any textual level. The architecture of the system is described in Section 2. Its main component, the feature extractor, is presented in Section 3. Section 4 presents experiments using the framework with various datasets.
2 Architecture QUEST++ has two main modules: a feature extraction module and a machine learning module. The first module is implemented in Java and provides a number of feature extractors, as well as abstract classes for features, resources and preprocessing steps so that extractors for new features can be easily added. The basic functioning of the feature extraction module requires raw text files with the source and translation texts, and a few resources (where available) such as the MT source training corpus and source and target language models (LMs). Configuration files are used to indicate paths for resources and the features that should be extracted. For its main resources (e.g. LMs), if a resource is missing, QUEST++ can generate it automatically. Figure 1 depicts the architecture of QUEST++ . Document and Paragraph classes are used for document-level feature extraction. A Document is a group of Paragraphs, which in turn is a group of Sentences. Sentence is used for both word-and sentence-level feature extraction. A Feature Processing Module was created for each level. Each processing level is independent and can deal with the peculiarities of its type of feature.
Machine learning QUEST++ provides scripts to interface the Python toolkit scikit-learn 3 (Pedregosa et al., ). This module is independent from the feature extraction code and uses the extracted feature sets to build and test QE models. The module can be configured to run different regression and classification algorithms, feature selection methods and grid search for hyper-parameter optimisation. Algorithms from scikit-learn can be easily integrated by modifying existing scripts.
For word-level prediction, QUEST++ provides an interface for CRFSuite (Okazaki, 2007), a sequence labelling C++ library for Conditional Random Fields (CRF). One can configure CRFSuite training settings, produce models and test them.

Features
Features in QUEST++ can be extracted from either source or target (or both) sides of the corpus at a given textual level. In order describe the features supported, we denote: • S and T the source and target documents, • s and t for source and target sentences, and • s and t for source and target words.
We concentrate on MT system-independent (black-box) features, which are extracted based on the output of the MT system rather than any of its internal representations. These allow for more flexible experiments and comparisons across MT systems. System-dependent features can be extracted as long they are represented using a predefined XML scheme. Most of the existing features are either language-independent or depend on linguistic resources such as POS taggers. The latter can be extracted for any language, as long as the resource is available. For a pipelined approach, predictions at a given level can become features for higher level model, e.g. features based on word-level predictions for sentence-level QE.

Word level
We explore a range of features from recent work (Bicici and Way, 2014;Camargo de Souza et al., 2014;Luong et al., 2014;Wisniewski et al., 2014), totalling 40 features of seven types: Target context These are features that explore the context of the target word. Given a word t i in position i of a target sentence, we extract: t i , Figure 1: Architecture of QUEST++ i.e., the word itself, bigrams t i−1 t i and t i t i+1 , and Alignment context These features explore the word alignment between source and target sentences. They require the 1-to-N alignment between source and target sentences to be provided. Given a word t i in position i of a target sentence and a word s j aligned to it in position j of a source sentence, the features are: the aligned word s j itself, target-source bigrams s j−1 t i and t i s j+1 , and source-target bigrams t i−2 s j , t i−1 s j , s j t i+1 and s j t i+2 .
Lexical These features explore POS information on the source and target words. Given the POS tag P t i of word t i in position i of a target sentence and the POS tag P s j of word s j aligned to it in position j of a source sentence, we extract: the POS tags P t i and P s j themselves, the bigrams P t i−1 P t i and P t i P t i+1 and trigrams P t i−2 P t i−1 P t i , P t i−1 P t i P t i+1 and P t i P t i+1 P t i+2 . Four binary features are also extracted with value 1 if t i is a stop word, punctuation symbol, proper noun or numeral.
LM These features are related to the n-gram frequencies of a word's context with respect to an LM (Raybaud et al., 2011). Six features are extracted: lexical and syntactic backoff behavior, as well as lexical and syntactic longest preceding n-gram for both a target word and an aligned source word. Given a word t i in position i of a target sentence, the lexical backoff behavior is calculated as: The syntactic backoff behavior is calculated in an analogous fashion: it verifies for the existence of n-grams of POS tags in a POS-tagged LM. The POS tags of target sentence are produced by the Stanford Parser 4 (integrated in QUEST++ ).
Syntactic QUEST++ provides one syntactic feature that proved very promising in previous work: the Null Link (Xiong et al., 2010). It is a binary feature that receives value 1 if a given word t i in a target sentence has at least one dependency link with another word t j , and 0 otherwise. The Stanford Parser is used for dependency parsing.
Semantic These features explore the polysemy of target and source words, i.e. the number of senses existing as entries in a WordNet for a given target word t i or a source word s i . We employ the Universal WordNet, 5 which provides access to WordNets of various languages.
Pseudo-reference This binary feature explores the similarity between the target sentence and a translation for the source sentence produced by another MT system. The feature is 1 if the given word t i in position i of a target sentence S is also present in a pseudo-reference translation R. In our experiments, the pseudo-reference is produced by Moses systems trained over parallel corpora.

Sentence level
Sentence-level QE features have been extensively explored and described in previous work. The number of QUEST++ features varies from 80 to 123 depending on the language pair. The complete list is given as part of QUEST++ 's documentation. Some examples are: • number of tokens in s & t and their ratio, • LM probability of s & t, • ratio of punctuation symbols in s & t, • ratio of percentage of numbers, content-/noncontent words, nouns/verbs/etc in s & t, • proportion of dependency relations between (aligned) constituents in s & t, • difference in depth of syntactic trees of s & t.
In our experiments, we use the set of 80 features, as these can be extracted for all language pairs of our datasets.

Document level
Our document-level features follow from those in the work of (Wong and Kit, 2012) on MT evaluation and (Scarton and Specia, 2014) for documentlevel QE. Nine features are extracted, in addition to aggregated values of sentence-level features for the entire document: • content words/lemmas/nouns repetition in S/T , • ratio of content words/lemmas/nouns in S/T ,

Experiments
In what follows, we evaluate QUEST++'s performance for the three prediction levels and various datasets.

Word-level QE
Datasets We use five word-level QE datasets: the WMT14 English-Spanish, Spanish-English, English-German and German-English datasets, and the WMT15 English-Spanish dataset.
Metrics For the WMT14 data, we evaluate performance in the three official classification tasks: • Binary: A Good/Bad label, where Bad indicates the need for editing the token. • Level 1: A Good/Accuracy/Fluency label, specifying the coarser level categories of errors for each token, or Good for tokens with no error. • Multi-Class: One of 16 labels specifying the error type for the token (mistranslation, missing word, etc.). The evaluation metric is the average F-1 of all but the Good class. For the WMT15 dataset, we consider only the Binary classification task, since the dataset does not provide other annotations.
Settings For all datasets, the models were trained with the CRF module in QUEST++ . While for the WMT14 German-English dataset we use the Passive Aggressive learning algorithm, for the remaining datasets, we use the Adaptive Regularization of Weight Vector (AROW) learning. Through experimentation, we found that this setup to be the most effective. The hyper-parameters for each model were optimised through 10-fold cross validation. The baseline is the majority class in the training data, i.e. a system that always predicts "Unintelligible" for Multi-Class, "Fluency" for Level 1, and "Bad" for the Binary setup.

Results
The F-1 scores for the WMT14 datasets are given in Tables 1-4, for QUEST++ and systems that oficially participated in the task. The results show that QUEST++ was able to outperform all participating systems in WMT14 except for the English-Spanish baseline in the Binary and Level 1 tasks. The results in Table 5 also highlight the importance of selecting an adequate learning algorithm in CRF models.    Oracle word level labels, as given in the original dataset, are also used in a separate experiment to study the potential of this pipelined approach. Settings For learning sentence-level models, the SVR algorithm with RBF kernel and hyperparameters optimised via grid search in QUEST++ is used. Evaluation is done using MAE (Mean Absolute Error) as metric.
Results As shown in Table 6, the use of wordlevel predictions as features led to no improvement. However, the use of the oracle word-level labels as features substantially improved the results, lowering the baseline error by half. We note that the method used in this experiments is the same as that in Section 4.1, but with fewer instances for training the word-level models. Im-    Features The 17 QUEST++ baseline features are aggregated to produce document-level features (Baseline). These are then combined with document-level features (Section 3.3) and finally with features from sentence-level predictions: • maximum/minimum predicted HTER or Likert score, • average predicted HTER or Likert score, • Median, first quartile and third quartile predicted HTER or Likert score. Oracle sentence labels are not possible as they do not exist for the test set documents.
Settings For training and evaluation, we use the same settings as for sentence-level. Table 7 shows the results in terms of MAE. The best result was achieved with the baseline plus HTER features, but no significant improvements over the baseline were observed. Document-level prediction is a very challenging task: automatic metric scores used as labels do not seem to reliably distinguish translations of different source documents, since they were primarily designed to compare alternative translations for the same source document.

Remarks
The source code for the framework, the datasets and extra resources can be downloaded from https://github.com/ghpaetzold/ questplusplus. The license for the Java code, Python and shell scripts is BSD, a permissive license with no restrictions on the use or extensions of the software for any purposes, including commercial. For pre-existing code and resources, e.g., scikit-learn, their licenses apply.