UAlacant word-level machine translation quality estimation system at WMT 2015

This paper describes the Universitat d’Alacant submissions (labelled as UAla-cant) for the machine translation quality estimation (MTQE) shared task in WMT 2015, where we participated in the word-level MTQE sub-task. The method we used to produce our submissions uses external sources of bilingual information as a black box to spot sub-segment correspondences between a source segment S and the translation hypothesis T produced by a machine translation system. This is done by segmenting both S and T into overlapping sub-segments of variable length and translating them in both translation directions, using the available sources of bilingual information on the ﬂy . For our submissions, two sources of bilingual information were used: machine translation (Apertium and Google Translate) and the bilingual concordancer Reverso Context. After obtaining the sub-segment correspondences, a collection of features is extracted from them, which are then used by a binary classifer to obtain the ﬁnal “GOOD” or “BAD” word-level quality labels. We prepared two submissions for this year’s edition of WMT 2015: one using the features produced by our system, and one combining them with the baseline features published by the organisers of the task, which were ranked third and ﬁrst for the sub-task, respectively.


Introduction
Machine translation (MT) post-editing is nowadays an indispensable step that allows to use machine translation for dissemination. Consequently, MT quality estimation (MTQE) (Blatz et al., 2004;Specia et al., 2010;Specia and Soricut, 2013) has emerged as a mean to minimise the post-editing effort by developing techniques that allow to estimate the quality of the translation hypotheses produced by an MT system. In order to boost the scientific efforts on this problem, the WMT 2015 MTQE shared task proposes three tasks that allow to compare different approaches at three different levels: segment-level (sub-task 1), word-level (sub-task 2), and document-level (sub-task 3).
Our submissions tackle the word-level MTQE sub-task, which proposes a framework for evaluating and comparing different approaches. This year, the sub-task used a dataset obtained by translating segments in English into Spanish using MT. The task consists in identifying which words in the translation hypothesis had to be post-edited and which of them had to be kept unedited by applying the labels "BAD" and "GOOD", respectively. In this paper we describe the approach behind the two submissions of the Universitat d'Alacant team to this sub-task. For our submissions we applied the approach proposed by Esplà-Gomis et al. (2015b), who use black-box bilingual resources from the Internet for word-level MTQE. In particular, we combined two on-line MT systems, Apertium 1 and Google Translate, 2 and the bilingual concordancer Reverso Context 3 to spot sub-segment correspondences between a sentence S in the source language (SL) and a given translation hypothesis T in the target language (TL). To do so, both S and T are segmented into all possible overlapping sub-segments up to a certain length and translated into the TL and the SL, respectively, by means of the sources of bilingual information mentioned above. These sub-segment correspondences are used to extract a collection of features that is then used by a binary classifier to determine the final word-level MTQE labels.
One of the novelties of the task this year is that the organisation provided a collection of baseline features for the dataset published. Therefore, we submitted two systems: one using only the features defined by Esplà-Gomis et al. (2015b), and another combining them with the baseline features published by the organisers of the shared task. The results obtained by our submissions were ranked third and first, respectively.
The rest of the paper is organised as follows. Section 2 describes the approach used to produce our submissions. Section 3 describes the experimental setting and the results obtained. The paper ends with some concluding remarks.

Sources of bilingual information for word-level MTQE
The approach proposed by Esplà-Gomis et al. (2015b), which is the one we have followed in our submissions for the MTQE shared task in WMT 2015, uses binary classification based on a collection of features computed for each word by using available sources of bilingual information. These sources of bilingual information are obtained from on-line tools and are used on-the-fly to detect relations between the original SL segment S and a given translation hypothesis T in the TL. This method has been previously used by the authors in other cross-lingual NLP tasks, such as word-keeping recommendation (Esplà-Gomis et al., 2015a) or cross-lingual textual entailment (Esplà-Gomis et al., 2012), and consists of the following steps: first, all the overlapping sub-segments σ of S up to given length L are obtained and translated into the TL using the sources of bilingual information available. The same process is carried out for all the overlapping sub-segments τ of T , which are translated into the SL. The resulting collections of sub-segment translations M S→T and M T →S are then used to spot sub-segment correspondences between T and S. In this section we describe a collection of features designed to identify these relations for their exploitation for word-level MTQE.

Positive features
Given a collection of sub-segment translations M = {σ, τ }, such as the collections M S→T and M T →S ) described above, one of the most obvious features consists in computing the amount of subsegment translations (σ, τ ) ∈ M that confirm that word t j in T should be kept in the translation of S. We consider that a sub-segment translation (σ, τ ) confirms t j if σ is a sub-segment of S, and τ is a sub-segment of T that covers position j. Based on this idea, we propose the collection of positive features Pos n : Pos n (j, S, T, M ) = |{τ : (σ, τ ) ∈ conf n (j, S, T, M )}| |{τ : τ ∈ seg n (T ) ∧ j ∈ span(τ, T )}| where seg n (X) represents the set of all possible n-word sub-segments of segment X and function span(τ, T ) returns the set of word positions spanned by the sub-segment τ in the segment T . 4 Function conf n (j, S, T, M ) returns the collection of sub-segment pairs (σ, τ ) that confirm a given word t j , and is defined as: where seg * (X) is similar to seg n (X) but without length constraints. 5 We illustrate this collection of features with an example.
Suppose the Catalan segment S ="Associació Europea per a la Traducció Automàtica", an English translation hypothesis T ="European Association for the Automatic Translation", and the most adequate (reference) translation T ="European Association for Machine Translation". According to the reference, the words the and Automatic in the translation hypothesis should be marked as BAD: the should be removed and Automatic should be replaced by Machine. Finally, suppose that the collection M S→T of subsegment pairs (σ, τ ) is obtained by applying the available sources of bilingual information to translate into English the sub-segments in S up to length 3: 6 M S→T ={("Associació", "Association"), ("Europea", "European"), ("per", "for"), ("a", "to"), ("la", "the"), ("Traducció", "Translation"), ("Automàtica", "Automatic"), ("Associació Europea", "European Association"), ("Europea per", "European for"), ("per a", "for"), ("a la", "to the"), ("la Traducció", "the Translation"), ("Traducció Automàtica", "Machine Translation"), ("Associació Europea per", "European Association for"), ("Europea per a", "European for the"), ("per a la", " for the"), ("a la Traducció", "to the Translation"), ("la Traducció Automàtica", "the Machine Translation")} Note that the sub-segment pairs (σ, τ ) in bold are those confirming the translation hypothesis T , while the rest contradict some parts of the hypothesis. For the word Machine (which corresponds to word position 5), there is only one sub-segment pair confirming it ("Automàtica", "Automatic") with length 1, and no one with lengths 2 and 3. Therefore, we have that: In addition, we have that the sub-segments τ in seg * (T ) covering the word Automatic for lengths in [1, 3] are: {"for the Automatic" , "the Automatic Translation" } Therefore, the resulting positive features for this word would be: A second collection of features, which use the information about the translation frequency between the pairs of sub-segments in M is also used. This information is not available for MT, but it is for the bilingual concordancer we have used (see Section 3). This frequency determines how often σ is translated as τ and, therefore, how reliable this translation is. We define Pos freq n to obtain these features as: where function occ(σ, τ, M ) returns the number of occurrences in M of the sub-segment pair (σ, τ ).
Following the running example, we may have an alternative and richer source of bilingual information, such as a sub-segmental translation memory, which contains 99 occurrences of word Automàtica translated as Automatic, as well as the following alternative translations: Machine (11 times), and Mechanic (10 times). Therefore, the positive feature using these frequencies for sub-segments of length 1 would be: Both positive features, Pos(·) and Pos freq (·), are computed for t j for all the values of sub-segment length n ∈ [1, L]. In addition, they can be computed for both M S→T and M T →S ; this yields 4L positive features in total for each word t j .

Negative features
The negative features, i.e. those features that help to identify words that should be post-edited in the translation hypothesis T , are also based on subsegment translations (σ, τ ) ∈ M , but they are used in a different way. Negative features use those subsegments τ that fit two criteria: (a) they are the translation of a sub-segment σ from S but are not sub-segments of T ; and (b) when they are aligned to T using the edit-distance algorithm (Wagner and Fischer, 1974), both their first word θ 1 and last word θ |τ | can be aligned, therefore delimiting a sub-segment τ of T . Our hypothesis is that those words t j in τ which cannot be aligned to τ are likely to need postediting. We define our negative feature collection Neg mn as: where alignmentsize(τ, T ) returns the length of the sub-segment τ delimited by τ in T . Function NegEvidence mn (·) returns the set of subsegments τ of T that are considered negative evidence and is defined as: In this function length constraints are set so that sub-segments σ take lengths m ∈ [1, L]. While for the positive features, only the length of τ was constrained, the experiments carried out by Esplà-Gomis et al. (2015b) indicate that for the negative features, it is better to constrain also the length of σ.
On the other hand, the case of the sub-segments τ is slightly different: n does not stand for the length of the sub-segments, but the number of words in τ which are aligned to T . 7 Function IsNeg(·) defines the set of conditions required to consider a subsegment τ a negative evidence for word t j : IsNeg(j, τ, T ) = ∃j , j ∈ [1, |T |] : j < j < j ∧ aligned(t j , θ 1 ) ∧ aligned(t j , θ |τ | ) ∧ ∃θ k ∈ seg 1 (τ ) : aligned(t j , θ k ) where aligned(X, Y ) is a binary function that checks whether words X and Y are aligned or not. For our running example, only two sub-segment pairs (σ, τ ) fit the conditions set by function IsNeg(j, τ, T ) for the word Automatic: ("la Traducció", "the Translation"), and ("la Traducció Automàtica", "the Machine Translation"). As can be seen, for both (σ, τ ) pairs, the words the and Translation in the sub-segments τ can be aligned to the words in positions 4 and 6 in T , respectively, which makes the number of words aligned n = 2. In this way, we would have the evidences: NegEvidence 2,2 (5, S, T, M ) = {"the Translation" } NegEvidence 3,2 (5, S, T, M ) = {"the Machine Translation" } As can be seen, in the case of sub-segment τ = "the Translation" , these alignments suggest that word Automatic should be removed, while for the sub-segment τ = the Machine Translation" they suggest that word Automatic should be replaced by word Machine. The resulting negative features are: Negative features Neg mn (·) are computed for t j for all the values of SL sub-segment lengths m ∈ [1, L] and the number of TL words n ∈ [2, L] which are aligned to words θ k in sub-segment τ . Note that the number of aligned words between T and τ cannot be smaller than 2 given the constraints set by function IsNeg(j, τ, T ). This results in a collection of L × (L − 1) negative features. Obviously, for these features only M S→T is used, since in M T →S all the sub-segments τ can be found in T .

Experiments
This section describes the dataset provided for the word-level MTQE sub-task and the results obtained by our method on these datasest. This year, the task consisted in measuring the word-level MTQE on a collection of segments in Spanish that had been obtained through machine translation from English. The organisers provided a dataset consisting of: • training set: a collection of 11,272 segments in English (S) and their corresponding machine translations in Spanish (T ); for every word in T , a label was provided: BAD for the words to be post-edited, and GOOD for those to be kept unedited; • development set: 1,000 pairs of segments (S, T ) with the corresponding MTQE labels that can be used to optimise the binary classifier trained by using the training set; • test set: 1,817 pairs of segments (S, T ) for which the MTQE labels have to be estimated with the binary classifier trained on the training and the development sets.

Binary classifier
A multilayer perceptron (Duda et al., 2000, Section 6) was used for classification, as implemented in Weka 3.6 (Hall et al., 2009), following the approach by Esplà-Gomis et al. (2015b). A subset of 10% of the training examples was extracted from the training set before starting the training process and used as a validation set. The weights were iteratively updated on the basis of the error computed in the other 90%, but the decision to stop the training (usually referred as the convergence condition) was based on this validation set, in order to minimise the risk of overfitting. The error function used was based on the the optimisation of the metric used for ranking, i.e. the F BAD 1 metric. Hyperparameter optimisation was carried out on the development set, by using a grid search (Bergstra et al., 2011) in order to choose the hyperparameters optimising the results for the metric to be used for comparison, F 1 for class BAD: • Number of nodes in the hidden layer: Weka (Hall et al., 2009) makes it possible to choose from among a collection of predefined network designs; the design performing best in most cases happened to have a single hidden layer containing the same number of nodes in the hidden layer as the number of features.
• Learning rate: this parameter allows the dimension of the weight updates to be regulated by applying a factor to the error function after each iteration; the value that best performed for most of our training data sets was 0.1.
• Momentum: when updating the weights at the end of a training iteration, momentum smooths the training process for faster convergence by making it dependent on the previous weight value; in the case of our experiments, it was set to 0.03.

Evaluation
As already mentioned, two configurations of our system were submitted: one using only the features defined in Section 2, and one combining them with the baseline features. In order to obtain our features we used two sources of bilingual information, as already mentioned: MT and a bilingual concordancer. As explained above, for our experiments we used two MT systems which are freely available on the Internet: Apertium and Google Translate. The bilingual concordancer Reverso Context was also used for translating sub-segments. Actually, only the sub-sentential translation memory of this system was used, which provides the collection of TL translation alternatives for a given SL subsegment, together with the number of occurrences of the sub-segments pair in the translation memory. Four evaluation metrics were proposed for this task: • The precision P c , i.e. the fraction of instances correctly labelled among all the instances labelled as c, where c is the class assigned (either GOOD or BAD in our case); • The recall R c , i.e. the fraction of instances correctly labelled as c among all the instances that should be labelled as c in the test set; • The F c 1 score, which is defined as although the F c 1 score is computed both for GOOD and for BAD, it is worth noting that the F 1 score for the less frequent class in the data set (label BAD, in this case) is used as the main comparison metric; • The F w 1 score, which is the version of F c 1 weighted by the proportion of instances of a given class c in the data set: where N BAD is the number of instances of the class BAD, N GOOD is the number of instances of the class GOOD, and N TOTAL is the total number of instances in the test set. Table 1 shows the results obtained by our system, both on the development set during the training phase and on the test set. The table also includes the results for the baseline system as published by the organisers of the shared task, which uses the baseline features provided by them and a standard logistic regression binary classifier. As can be seen in Table 1, the results obtained on the development set and the test set are quite similar and coherent, which highlights the robustness of the approach. The results obtained clearly outperform the baseline on the main evaluation metric (F BAD 1 ). It is worth noting that, on this metric, the  Table 1: Results of the two systems submitted to the WMT 2015 sub-task on word-level MTQE: the one using only sources of bilingual information (SBI) and the one combining these sources of information with the baseline features (SBI+baseline). The table also includes the results of the baseline system proposed by the organisation; in this case only the F1 scores are provided because, at the time of writing this paper, the rest of metrics remain unpublished.

Results
SBI and SBI+baseline submissions scored first and third among the 16 submissions to the shared task. 8 The submission scoring second obtained very similar results; for F BAD 1 it obtained 43.05%, while our submission obtained 43.12%. On the other hand, using the metric F w 1 for comparison, our submissions ranked 10 and 11 in the shared task, although it is worth noting that our system was optimised using only the F BAD 1 metric, which is the one chosen by the organisers for ranking submissions.

Concluding remarks
In this paper we described the submissions of the UAlacant team for the sub-task 2 in the MTQE shared task of the WMT 2015 (word-level MTQE). Our submissions, which were ranked first and third, used online available sources bilingual of information in order to extract relations between the words in the original SL segments and their TL machine translations. The approach employed is aimed at being system-independent, since it only uses resources produced by external systems. In addition, adding new sources of information is straightforward, which leaves considerable room for improvement. In general, the results obtained support the conclusions obtained by Esplà-Gomis et al. (2015b) regarding the feasibility of this approach and its performance.