Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement

Poorly translated text is often disﬂuent and difﬁcult to read. In contrast, well-formed translations require less time to process. In this paper, we model the differences in reading patterns of Machine Translation (MT) evaluators using novel features extracted from their gaze data, and we learn to predict the quality scores given by those evaluators. We test our predictions in a pairwise ranking scenario, measuring Kendall’s tau correlation with the judgments. We show that our features provide information beyond ﬂuency, and can be combined with BLEU for better predictions. Furthermore, our results show that reading patterns can be used to build semi -automatic metrics that anticipate the scores given by the evaluators.


Introduction
Human evaluation has been the preferred method for tracking the progress of MT systems. In the past, the prevalent criterion was to judge the quality of a translation in terms of fluency and adequacy, on an absolute scale (White et al., 1994). However, different evaluators focused on different aspects of the translations, which increased the subjectivity of their judgments. As a result, evaluations suffered from low inter-and intra-annotator agreements (Turian et al., 2003;Snover et al., 2006). This caused a shift towards a ranking-based approach (Callison-Burch et al., 2007). Unfortunately, the disagreement between evaluators is still a challenge that cannot be easily resolved due to the non-transparent thought-process that evaluators follow to make a judgment.
The eye-mind hypothesis (Just and Carpenter, 1980;Potter, 1983) states that when completing a task, people cognitively process objects that are in front of their eyes (i.e. where they fixate their gaze). 1 Based on this assumption, it has been possible to study reading behavior and patterns (Rayner, 1998;Garrod, 2006;Hansen and Ji, 2010).
The overall difficulty of a sentence and its syntactic complexity affects reading behavior (Coco and Keller, 2015). Ill-formed sentences take longer to process, and may cause the reader to jump back while reading. Hence, by looking into how evaluators read the translations and their accompanying references, we can learn about: (i) the complexity of a reference sentence, and (ii) the quality of a translation sentence.
Using reading patterns from evaluators could be a useful tool for MT evaluation: (i) to shed light into the evaluation process: e.g. the general reading behavior that evaluators follow to complete their task; (ii) to understand which parts of a translation are more difficult for the annotator; and (iii) to develop semi-automatic evaluation systems that use reading patterns to predict translation quality.
In this paper, we make a first step towards (iii): using reading patterns as a method for distinguishing between good and bad translations. Our hypothesis is that bad translations are difficult to read, which may be reflected by the reading patterns of the evaluators. Motivated by the notion of reading difficulty, we extracted novel features from the evaluator's gaze data, and used them to model and predict the quality of translations as perceived by evaluators.
1 Except in cases of covert attention.

Features and Model
A perfectly grammatical sentence can be difficult to read for several reasons: unfamiliar vocabulary, complex syntactic structure, syntactic or semantic ambiguity, etc. (Harley, 2013). Reading automatic translations is even more challenging due to untranslated words, incorrect word order, morphological disagreements, etc. Cognitively processing difficult sentences generally results in modified reading patterns (Garrod, 2006;Coco and Keller, 2015).
In this paper, we analyze the reading patterns of human judges in terms of the word transitions (jumps), and the time spent on each word (dwell time); and use them as features to predict the quality score of a specific translation. For the sake of simplicity, as recommended by Guzmán et al. (2015), we only consider a monolingual evaluation scenario and ignore the source text . However, our features and experimental setup can be extended to include source-side features.

Features
Jump features While reading text, the gaze of a person does not visit every single word, but it advances in jumps called saccades. These jumps can go forwards (progressions) or backwards (regressions). The number of regressions correlates with the reading difficulty of a sentence (Garrod, 2006;Schotter et al., 2014;Metzner, 2015). In an evaluation scenario, a fluent reading would mean monotonic gaze movement. On the contrary, the reader may need to jump back multiple times while reading a poor translation. We classify the word-transitions according to the direction of the jump and distance between the start and end words. For subsequent words n, n + 1, this would mean a forward jump of distance equal to 1. All jumps with distance greater than 4 were sorted into a 5+ bucket. Additionally, we separate the features for reference and translation jumps. We also count the total number of jumps.
Total jump distance We additionally aggregate jump distances 2 to count the total distance covered while evaluating a sentence. We have reference distance and translation distance features. Again, the 2 Jump count and distance features have also shown to be useful in SMT decoders (Durrani et al., 2011). idea is that for a well-formed sentence, gaze distance should be less, compared to a poorly-formed one.
Inter-region jumps While reading a translation, evaluators can jump between the translation and a reference to compare them. Intuitively, more jumps of this type could signify that the translation is harder to evaluate. Here we count the number of transitions between reference and translation.
Dwell time The amount of time a person fixates on a region is a crucial marker for processing difficulty in sentence comprehension (Clifton et al., 2007) and moderately correlates with the quality of a translation (Doherty et al., 2010). Our feature counts the time spent by the reader on each particular word. We separate reference and translation features.
Lexicalized Features The features discussed above do not associate gaze movements with the words being read. We believe that this information can be critical to judge the overall difficulty of the reference sentence, and to evaluate which translation fragments are problematic to the reader. To compute the lexicalized features, we extract streams of reference and translation lexical sequences based on the gaze jumps, and score them using a tri-gram language model. Let R i = r 1 , r 2 , . . . , r m be a sub-sequence of gaze movement over reference and there are R 1 , R 2 , . . . , R n sequences, the lex feature is computed as follows: The normalization factor |R i | is used to make the probabilities comparable. We also use unnormalized scores as additional feature. A similar set of features lex(T ) is computed for the translations. All features are normalized by the length of the sentence.

Model
For predicting the quality scores given by an evaluator, we use a linear regression model with ridge 1083 regularization. The ridge coefficientβ is the value of β that minimizes the error: Here the parameter λ controls the amount of shrink applied to regression coefficients. A high value of λ shrinks the coefficients close to zero (Hastie et al., 2001). We used the implementation provided in the glmnet package of R (Friedman et al., 2010), which inherits a cross-validation mechanism that finds the best value of λ on the training data.

Experimental Setup
We used a subset of the Spanish-English portion of the WMT'12 Evaluation task. We selected 60 medium-length sentences which have been evaluated previously by at least 2 different annotators.
For each sentence we selected the best and worst translations according to a human evaluation score based on the expected wins (Callison-Burch et al., 2012). As a result, we had 60 references with two corresponding translations each, adding up to a total of 120 evaluation tasks. Each evaluation task was performed by 6 different evaluators, resulting in 720 evaluations. The annotators were presented with a translationreference pair at a time. The two evaluation tasks corresponding to the same reference were presented at two different times with at least 40 other tasks in-between. This was done to prevent any possible spurious effects that may arise from remembering the content of a first translation, when evaluating the second translation of the same sentence. During each evaluation task, the evaluators were asked to assess the quality of a translation by providing a score between 0-100 (Graham et al., 2013). The observed inter-annotator agreement (Cohen's kappa) among our annotators was 0.321. This is slightly higher than the overall inter-annotator agreement of 0.284 reported in WMT'12 for the Spanish-English. 3 For reading patterns we use the EyeTribe eye-tracker at 3 For a rough comparison only. Note that these two numbers are not exactly comparable given that they are calculated on different subsets of the same data. Still, there is a fair agreement between the our evaluators and the expected wins from WMT'12 (avg. pairwise kappa of 0.381) a sampling frequency of 30Hz. Please refer to Abdelali et al. (2016) for our Eye-Tracking setup and to know about iAppraise, an evaluation environment that supports eye-tracking.

Evaluation
In our evaluation, we used eye-tracking features to predict the quality of a translation in a pairwise scenario in a protocol similar to the one from WMT'12. First, we obtained the predicted scoresŷ k A ,ŷ k B for translations A and B when evaluated by evaluator k. Then, we computed the agreements w.r.t. the scores y k A , y k B provided by the evaluator for the same pair of translations. That is, we considered an agreement when rankings were in order, e.g.ŷ k A > y k B ⇐⇒ y k A > y k B . Otherwise, we considered it a disagreement. Finally, we computed Kendall's tau correlation coefficient as follows: τ = agg−dis agg+dis . We evaluated the performance using a 10-fold crossvalidation. While the folds were selected randomly, we ensured that all translations corresponding to the same sentence were included in the same fold, to prevent any overlap between train and test.

Results
In this section, we first analyze the results of coherent feature sets to measure their predictive power and to validate the intuitions about the information they capture. Later, we use combination of features and assess their suitability as evaluation metrics.

Gaze as a translation quality predictor
In Table 1, we show the results for the predictive models trained on different feature sets. For simplicity, we divide the feature groups in: reference only features (I), translation only features (II), translation and reference features (III); and lexicalized features (IV). In the last group, we also add a tri-gram language model scores for comparison purposes.
Reference only features In section I of the table, we observe the prediction results for the models that only used features from the references. Unsurprisingly, most of these features lack the predictive power to determine whether translation A is better than translation B (τ from 0.06 to 0.13). One would expect that important phenomena that can be observed only on the reference (e.g. the overall difficulty of the sentence), are neutralized in a pairwise setting, because an evaluator would read both instances of the reference text similarly. 4 However, some features like the dwell time (τ = 0.13) yield better results than others. This could be explained by the need to go back to the reference, when reading a confusing translation, thus spending more time reading the reference.
Translation only features In section II, we observe the results for the translation features. At a first glance, we realize that the correlation results are much higher than for the reference features (τ from 0.17 to 0.23). This supports the hypothesis that reading patterns can help to distinguish good from bad translations. Furthermore, it also supports specific intuitions about these reading patterns. For example, the fluency of a sentence is important (forward jumps, τ = 0.17), but the number of regressions are better predictors of the quality of a sentence (τ = 0.22). Additionally, the time spent reading a translation (dwell time) is a good predictor of the quality (τ = 0.22). All of the above validate the intuition that reading patterns capture information about the quality of a translation. In general, using translation eye-tracking features in a pairwise evaluation, can help to predict which translation is better.
Translation and reference features Reference and translation features are not independent. Interregion jumps capture the number of times that evaluators go between translation and references before making judgment. In section III, we observe that these features can be useful to predict the quality of a translation (τ = 0.18).
Lexicalized features In the last rows of the table, we show that reading patterns help to evaluate more than just the fluency of a translation. A simple language model score (B LM ), is a weaker quality predictor (τ = 0.17) than most of the eye-tracking translation features. Using the lexicalized version of the jump features gives additional predictive power (τ = 0.22). Furthermore, by adding the total num- Lexicalized gaze jumps combined (6) 0.22 Table 1: Results of individual eye-tracking features based on reference region, translation region, interregion and lexicalized information ber of jumps and backward jumps to the LM features, we would obtain a considerable gain in correlation (τ = 0.30). This suggests that the reading patterns capture information about more than just fluency.

Gaze to build an evaluation metric
So far, we've shown that the individual sets of features based on reading patterns can help to predict translation quality, and that this goes beyond simple fluency. One question that remains to be answered is whether these features could be used as a whole to evaluate the quality of a translation semi-automatically. That is, whether we can use the gaze information, and other lexical information to anticipate the score that an evaluator will assign to a translation. Here, we present evaluation results combining several of these gaze features, and compare them against BLEU (Papineni et al., 2002), which uses lexical information and is designed to measure not only fluency but also adequacy.
In Table 2, we present results in the following way: in (I) we present the best non-lexicalized feature combinations that improve the predictive power of the model. In (II) we re-introduce the results of lexicalized jumps feature. In (III) we present results of BLEU and the combination of eye-tracking features with it. Finally in (IV) we present the humanto-human agreement measured in average Kendall's tau and in max human-to-human Kendall's tau.
Combinations of translation jumps In section I we present several combinations of features. All of them include the backward jumps feature. This feature provides predictive power (τ = 0.22), which is orthogonal to other features. This is in line with our initial hypothesis that for a bad translation, an evaluator needs to go back and forth several times to understand it. Combining the backward jumps with the total number of jumps (CTJ 1 ) slightly increases the correlation to τ = 0.25. Adding the jump distance (CTJ 2 ) also increases its τ to 0.27. While this correlation is lower than BLEU (τ = 0.34), it does showcase the predictive power of the reading patterns.
Combinations with BLEU When we combined BLEU with the translation jumps, we observed an increment in the τ to 0.37. Combining BLEU with the lexicalized jumps, yields the best combination (τ = 0.42). Although moderate, these increments suggest that the reading patterns could be capturing additional phenomenon besides adequacy and fluency, such as structural complexity. These phenomena remain to be explored in future work.
Human performance On average, evaluators agreements with each other are fair (τ = 0.33) and below the best combination (CB 3 ), while the maximum agreement of any two evaluators is relatively higher (τ = 0.53). This tells us that on average the semi-automatic approach to evaluation that we propose here is already competitive to predictions done by another (average) human. However, there is still room for improvement with respect to the mostagreeing pair of evaluators.  Sets shows the name of the systems whose features are combined for that particular run. We also included the average and maximum observed tau between any two evaluators, as a reference. Doherty et al. (2010) conducted a study using eyetracking for MT evaluation and showed correlation between fixations and BLEU scores. Doherty and O'Brien (2014) evaluated the quality of machine translation output in terms of its usability by an end user. Guzmán et al. (2015) used eye-tracking to show that having monolingual environment improves the consistency of the evaluation.

Related Work
Our work is different, as we: i) proposed novel eyetracking features and ii) model gaze movements to predict human judgment.

Conclusion
We have shown that the reading patterns detected through eye-tracking can be used to predict human judgments of automatic translations. To this end, we extracted novel lexicalized and non-lexicalized features from the eye-tracking data motivated by notions of reading difficulty, and used them to predict the quality of a translation. We have shown that these features capture more than just the fluency of a translation, and provide complementary information to BLEU. In combination, these features can be used to produce semi-automatic metrics with improved the correlation with human judgments.
In the future, we plan to extend our experiments to a large set of users and different language pairs. Additionally we plan to improve the feature set to take into account phenomena such as early termination, i.e. when an evaluator makes a judgment before finishing reading a translation. We plan to deepen our analysis to determine what kind of information is being used beyond fluency and adequacy.