USHEF and USAAR-USHEF participation in the WMT15 QE shared task

We present the results of the USHEF and USAAR-USHEF submissions for the WMT15 shared task on document-level quality estimation. The USHEF sub-missions explored several document and discourse-aware features. The USAAR-USHEF submissions used an exhaustive search approach to select the best features from the ofﬁcial baseline. Results show slight improvements over the baseline with the use of discourse features. More inter-estingly, we found that a model of comparable performance can be built with only three features selected by the exhaustive search procedure.


Introduction
Evaluating the quality of Machine Translation (MT) systems outputs is a challenging topic. Several metrics have been proposed so far comparing the MT outputs to human translations (references) in terms of ngrams matches (such as BLEU (Papineni et al., 2002)) or error rates (such as TER (Snover et al., 2006)). However, in some scenarios, human references are not available. For example, the use of machine translation in a workflow where good enough translations are given to humans for post-editing. Another example is machine translation for gisting by users of online systems.
Quality Estimation (QE) approaches aim to predict the quality of MT outputs without relying on human references (Blatz et al., 2004;Specia et al., 2009). Features from source (original document) and target (MT outputs) and, when available, from the MT system are used to train supervised machine learning models (classifiers or regressors). A number of data points need to be annotated for quality (by humans or automatically) for training, using a given quality metric.
Most QE research is done at sentence level. This task has been a track at WMT shared task for the last four years (Callison-Burch et al., 2012;Bojar et al., 2013;Bojar et al., 2014). In addition to sentence level, the current edition offers for the first time a track on paragraph-level QE. Exploring quality beyond sentence level is interesting for completely automatic translation applications, i.e. without human review. For instance, consider a user looking for information on a product that has several reviews automatically translated into his/her language. This user have no knowledge about the source language. To ensure that the main message of the review is preserved, for this user the quality of each word or sentence individually is not as important as the quality of the review as a whole. Therefore, predicting the quality of the whole document (or paragraph, considering paragraph as short documents) becomes necessary. This paper presents the University of Sheffield (USHEF) and University of Saarland (USAAR) submissions to the Task 3 of the WMT15 QE shared task: paragraph-level scoring and ranking. We submitted systems for both language pairs: English-German (EN-DE) and German-English (DE-EN).
Little previous research has been done to address document-level QE. Soricut and Echihabi (2010) proposed document-aware features in order to rank machine translated documents. Soricut and Narsale (2012) use sentence-level features and predictions to improve document-level QE. Finally, Scarton and Specia (2014) and Scarton (2015) introduced discourse-aware features, which are combined with baseline features adapted from sentence-level work, in order to predict the quality of full documents. Previous work led to some improvements over the baselines used. However, several problems remain to be addressed for improving document-level QE, such as the choice of quality label, as discussed by Scarton et al. (2015).
Our approach focuses on extracting various features and building models with different combination of these features. Two feature selection approaches are considered. The first one is based on Random Forests and backward feature selection. The second performs an exhaustive search on the entire feature space. Features are either based on previous work for sentence-level QE (e.g. number of tokens in the target document) or are discourseaware (e.g. lexical repetition counts).

Document-level features
Along with the official baseline features, we use two different sets of features. The first set contains document-aware features, based on QuEst features for sentence-level QE (Specia et al., 2013;. The second set are features that encompass discourse information, following previous work of Scarton and Specia (2014) and Scarton (2015).

Document-aware features
The 17 baseline features made available by the organisers are the same baseline features used for sentence-level QE, adapted for documentlevel. 1 However, as part of the QuEst framework, other sentence-level features can be easily adapted for document-level QE. Our complete set of document-aware features include: • ratio of number of tokens in source and target (and in target and source) • absolute difference between number tokens in source and target, normalised by source length • language model (LM) perplexity of source/target document (with and without end of sentence marker) • average number of translations per source word in the document (threshold: prob >0.01/0.05/0.1/0.2/0.5) • average number of translations per source word in the document (threshold: prob >0.01/0.05/0.1/0.2/0.5) weighted by the frequency/inverse frequency of each word in the source corpus • average unigram/bigram/trigram frequency in quartile 1/2/3/4 of frequency in the corpus of the source language • percentage of distinct unigrams/bigrams/trigrams seen in a corpus of the source language (in all quartiles) • average word frequency: on average, each type (unigram) in a source document appears n times in the corpus (in all quartiles) • percentage of punctuation marks in source/target document • percentage of content words in the source/target document • ratio of percentage of content words in the source and target • LM log probability of POS of the source/target document • percentage of nouns in the source/target document • percentage of verbs in the source/target document • ratio of percentage of nouns in the source and target documents • ratio of percentage of verbs in the source and target documents • ratio of percentage of pronouns in the source and target documents • number of dependencies with aligned constituents normalised by the total number of dependencies (maximum between source and target) • number of sentences (source and target should be the same).

Discourse-aware features
Discourse is a linguistic phenomenon that happens document-wide and should be considered for document-level evaluation purposes. We considered the discourse-aware features presented in Scarton and Specia (2014), which are already implemented in the QuEst framework (called herein as discourse repetition features): • word/lemma/noun repetition in the source/target document • ratio of word/lemma/noun repetition between source and target documents. Other discourse features were also explored (following the work of Scarton (2015)): • number of pronouns in the source/target document • number of discourse connectives in the source/target document • number of pronouns of each type according to Pitler and Nenkova (2009) (Charniak, 2000) (we count the P RP tags). Discourse connectives are automatically extracted by the parser of Pitler and Nenkova (2009). RST trees and EDUs are extracted by the discourse parser and discourse segmenter of Joty et al. (2013).

Experiments and results
Our systems use only the data provided by the task organisers. For features that require corpora or resources, only those provided by the organisers were used.
Tasks we participate in Task 3 (paragraph-level QE) in both subtasks, scoring and ranking. The evaluation for the scoring task was done using Mean Absolute Error (MAE) and the evaluation for the ranking task was done by DeltaAvg (official metrics of the competition).
Data the official data of Task 3 -WMT15 QE shared task consist of 1215 paragraphs for EN-DE and DE-EN, extracted from the corpora of WMT13 machine translation shared task (Bojar et al., 2013). For training, 800 paragraphs were used and, for test, 415 paragraphs were considered. METEOR (Banerjee and Lavie, 2005) was used as quality labels.
Feature combination we experimented with different feature sets: • baseline (17 baseline (Pedregosa et al., 2011), to rank the features. Once this feature ranking is produced, we apply a backward feature selection approach. Starting with the features with lower positition in the rank, the method consists in consistently eliminate features, aiming to obtain a feature set that better fit the predictions. Exhaustive search 4 We investigate the efficacy of the baseline features by learning one Bayesian Ridge classifier for each feature and evaluating the classifiers based on MAE.
To examine the best set of features among the baseline features, we implemented an exhaustive feature selection search by enumerating all possible feature combinations. Given n number of features, S, there are 2 n -1 number of possible feature combinations since a k-combination of a set forms a subset of k distinct elements of S. The set of n elements, the number of k-combination is equal to the binomial coefficient: And the sum of all possible k-combinations: We note that the exhaustive search for feature selection is only possible in low feature space but from the results above, it is possible to approximate the best feature combination by using the N-best performing features when the classifier is trained solely on each of the feature.
For both languages, the exhaustive search selected three features only. For EN-DE: • average source token length • percentage of unigrams in quartile 4 of frequency of source words in a corpus of the source language • percentage of trigrams in quartile 4 of frequency of source words in a corpus of the source language. For DE-EN: • type/token ratio • percentage of unigrams in quartile 1 of frequency of source words in a corpus of the source language • percentage of trigrams in quartile 1 of frequency of source words in a corpus of the source language.
Machine learning algorithms for the feature combination experiments (with backward feature selection) we used the SVR implementation in the scikit-learn toolkit with parameters optimised via grid search. Table 1 shows the results of all experiments, for both language directions (EN-DE and DE-EN) and for scoring (MAE) and ranking (DeltaAvg) subtasks. 5 For EN-DE, BFF showed the best result for scoring, and Baseline + discourse repetition showed the best result for ranking. For DE-EN, Backward feature selection showed the best results for both scoring and ranking (although BFF showed similar results for scoring).

Results
However, no statistically significant difference was found between the systems. This means that the use of sophisticated discourse-aware features did not lead to improvements, with a simple combination of three features from the baseline set able to produce similar results. The reason for these results is most likely connected to the data. We expect the discourse-aware features to work better with documents, since they naturally contain discourse phenomena. However, the data of the shared task consists of short paragraphs, many with only one sentence only. In this case, discourse-aware features are less effective.
BFF systems investigate the efficacy of the baseline features by learning one Bayesian Ridge classifier for each feature and evaluating the classifiers based on the Mean Average Error (MAE). percentage of unigrams in quartile 1 of frequency in a corpus of the source language 6.61 10.11 10 percentage of unigrams in quartile 4 of frequency in a corpus of the source language 6.72 9.81 11 percentage of bigrams in quartile 1 of frequency in a corpus of the source language 6.62 10.00 12 percentage of bigrams in quartile 4 of frequency in a corpus of the source language 6.64 10.05 13 percentage of trigrams in quartile 1 of frequency in a corpus of the source language 6.59 10.01 14 percentage of trigrams in quartile 4 of frequency in a corpus of the source language 6.62 9.97 15 percentage of unigrams in the source document seen in a corpus (SMT training corpus) 6.76 9.75 16 number of punctuation marks in source document 6.71 10.10 17 number of punctuation marks in target document 6.72 10.00 Table 2: MAE of classifiers trained with one baseline feature -the top three features are shown in bold Table 2 shows the MAE of these classifiers. We note that the exhaustive feature selection search is only possible in low feature spaces. However from the results above it is possible to approximate the best feature combination by using the N-best performing features when the classifier is trained solely on each of the feature. Unsurprisingly, the best feature set for DE-EN corresponds to the top three features that are most effective individually (when classifiers were built for these features individually). In the reverse direction (EN-DE), the best feature combination corresponds to the top 6 features that are most effective individually. The classifier trained on the top 3 features (8, 10, 15) for EN-DE yielded an MAE of 9.72.