Bilingual Embeddings and Word Alignments for Translation Quality Estimation

This paper describes our submission UFAL MULTIVEC to the WMT16 Quality Estimation Shared Task, for English-German sentence-level post-editing effort prediction and ranking. Our approach exploits the power of bilingual distributed representations, word alignments and also manual post-edits to boost the performance of the baseline QuEst++ set of features. Our model outperforms the baseline, as well as the winning system in WMT15, Referential Translation Machines (RTM), in both scoring and ranking sub-tasks.


Introduction
Recently, the task of quality estimation (QE) for machine translation (MT) output attracted interest among researchers in the machine translation community. QE systems play an important role in improving post-editing efficiency (in terms of the time and effort) in different ways, e.g. by filtering out low quality translations to avoid spending time post-editing them, or by providing end-users with an estimate on how good or bad the translation is.
In 2012, WMT established the first sentencelevel quality estimation shared task (Callison-Burch et al., 2012). Since then, new sub-tasks, language pairs and datasets in different domains were introduced every year ( Bojar et al., 2013Bojar et al., , 2014Bojar et al., , 2015. In contrast to automatic evaluation (the "metrics task"), QE task aims to develop systems that provide predictions on the quality of machine translated text without access to reference translations (Blatz et al., 2004;Specia et al., 2009).
Sentence-level QE is the most popular track in the WMT QE shared task, due to its presence in all editions of the task since the beginning. Many features have been explored by participating systems, including lexical, syntactic, semantic, embeddingbased features (Shah et al., 2015), as well as features dependent on any details the particular MT systems may provide (Soricut et al., 2012;Camargo de Souza et al., 2013). In our model, we try to exploit the power of bilingual distributed representations combined with word alignment information to boost the performance of translation quality estimation. For this purpose, we use the implementation provided by the Multivec tool (Bérard et al., 2016) for the bilingual distributed representation model, described by Luong et al. (2015) and the GIZA++ word alignment model (Och and Ney, 2003).
The rest of this paper is organized as follows. In Sections 2 and 3, we give an overview of the bilingual distributional model and word alignment for our purposes. Section 4 gives a detailed description of our feature set, including the features derived from manual post-edits of other sentences. Section 5 describes the datasets and resources we used to build our model. Section 6 discusses the experiments conducted and the official results. The final Section 7 concludes the paper.

Bilingual Distributed Representations
Word embeddings have shown a great potential in tackling various NLP tasks recently, including multilingual tasks. However, there is a major problem with using word embeddings in a multilingual setting because models are trained independently for each of the languages and the resulting representations can use the vector space very differently. Therefore, measuring similarity between words in different languages will be difficult because even similar words would likely have very different representations. Much research work has been conducted to address this problem. According to Luong et al. (2015), the approaches developed to learn bilingual models fall into three categories: Bilingual Mapping , where word representations are trained for each language independently and a linear mapping is then learned to transform representations from one language to another (Mikolov et al., 2013a). In our submission, we use the BiSkip bilingual model, belonging to the Bilingual Training category, to measure the similarity between the source and target sentences using their compositional vector representations, where the term compositional indicates that the vector for the sentence is a simple sum of the vectors of all words.
BiSkip model adapts Mikolov et al. (2013b) skipgram model for the bilingual case. The joint representations are learnt using Algorithm 1 to the following objective: where M ono 1 and M ono 2 are the monolingual representations of each language, Bi is used tie the two monolingual spaces, and the hyperparameters α and β are used to balance the influence of the monolingual components over the bilingual one.

Data: Word-Aligned Parallel Corpus
Output: BiSkip Vector Representation for source-target sentence pair do for a(w s ,w t ) ∈ set of alignment links do Predict neighbors of w s ; Predict neighbors of w t ; Use w s to predict neighbors of w t ; Use w t to predict neighbors of w s ; end end Algorithm 1: BiSkip learning algorithm by Luong et al. (2015) 3 Word Alignments For cross-lingual semantic similarity, a word alignment model is an important component. According to the evaluation of the semantic textual similarity task in SemEval 2015, the best performing systems in both the English and Spanish subtasks relied mainly on word alignment techniques (Sultan et al., 2015;Hänig et al., 2015). Inspired by these results, we add features based on word alignment to the QE system.
According to , alignmentbased features are used for word-level QE only and there is no alignment-based features included in the baseline feature set for sentence-level QE.
We use GIZA++ (Och and Ney, 2003) to obtain the alignments. By default, GIZA++ alignments are not symmetric. We symmetrize them by taking the intersection of the two directions, leading to high-precision alignments. For pre-processing, we lowercase and stem words (naively taking just the first four letters) on both sides of the input.
Some of our features rely on the alignments of our training data (the ITcorpus and the training part of the QEcorpus, see Section 5 below) and some need alignments between the source and the evaluated translation candidates (the development and test part of the QEcorpus). We thus use two sets of alignments: Run-1 obtained by aligning only the ITcorpus.
Run-2 obtained by aligning the ITcorpus concatenated with the QEcorpus.

Features
This section describes the different types of features we use in our QE system. We extend the set of baseline features (Section 4.1) with features based on bilingual embeddings (Section 4.2), word alignment (Section 4.3) and also n-grams seen in a collection of manually post-edited texts (Section 4.4).

QuEst++ Baseline Features
A set of 17 system-independent features was developed by  to set the baseline system for QE tasks. The features set is extracted using QuEst++ 1 , an open source implementation of the baseline for quality estimation for different granularities (sentence, word, and document level QE). QuEst++ extracts features from either or both the source and target sides (i.e. the source sentence and the candidate translation), and also language model features relying on large monolingual data.

Bilingual Embedding Features (BE)
In our submission, we use three features derived from bilingual embeddings: SentSim simply takes the value of cosine similarity between the source and target sentences in the bilingual compositional vector space.
WordSim uses the bilingual vector model and also word-alignment links. We take the average value of cosine similarity between source words and their aligned counterparts in the target sentence. The alignment links between the source and target are established automatically. Specifically, we use Run-2 alignments as defined in Section 3.
NounSim is similar to WordSim, but instead of taking all alignment links, we compute the average cosine similarity of only the links where the source (English) word is a noun. The POS tags were produced by Stanford POS Tagger (Toutanova et al., 2003).

Alignment-Based Features
We propose several features based on automatic word alignments as obtained in Section 3. We assume that a good translation aligns well word-by-word with the source. While this need not be the case for human translations, it usually holds for machine-translated text. To assess the translation quality of a segment, we thus take an alignment quality score. In our submission, alignment quality scores are inspired by components of the conditional probability P (t 1 . ..t l |s 1 ...s m , a 1 ...a m ), where s i denotes the source words, t j denotes the target words and a i are the alignment links for each source word to the target (unambiguous, due to the intersection). We define the score as: The score is a simple sum of lexical translation probabilities (longer sentences with more aligned words thus get a higher score) and the lexical translation probabilities P (t|s) are estimated from the count c(s, t) how often the source s and target t words were aligned in our word-aligned corpus.
The formulas resemble IBM Model 1 (Brown et al., 1993), but the counts used to compute our probability estimates are based on the whole sequence of GIZA++ models and after the heuristic symmetrization.
Run-1 alignment (see Section 3) is used in this step to avoid unreliable alignments that could be produced from aligning the poor machine translation examples in the QE datasets.

POS Alignment Features
Two more alignment-based features were introduced to estimate translation quality of each source-target sentence pair with the help of their POS tags. In our experiments, we restrict the range of POS tags used to produce our features to only nouns, verbs, adverbs, and adjectives. The POS tags for both English and German come again from Stanford POS Tagger.
The two introduced features are: Number of correctly matched tags represents the number of source words that are aligned to target words with the same POS tag.
Number of wrongly matched tags represents the number of source words that are aligned to target words with a different POS tag.
Since the alignments are needed for the source and candidate translation, they come from Run-2.

Post-Edited N -grams
As mentioned earlier, in quality estimation, there is no access to reference translation. However, the QE task organizers provided the participants with training data (called "QEcorpus" in Section 5) consisting of 12k training segments and 1k development segments machine-translated and manually post-edited. To benefit from this valuable resource, we introduce another set of features representing the most frequent bigrams in translation text that were changed through the post-editing.
The list of bigrams was extracted on the basis of GIZA++ alignment, preprocessing tokens and symmetrizing the two directions the same way as in Section 3. We extract all word-aligned bigrams occurring more than 10 times in the training and development 13k sentences, greatly reducing the number of bigrams to a few dozens of most general ones. Each of the bigrams serves as an independent boolean feature in the model.
Although lowercasing seems to be more helpful during the alignment, we avoid it during the actual bigram extraction since case changes are mostly rightful and important post-edits when translating into German. On the other hand, the order in which the words and their alignments are occurred in the text is checked to be reserved (e.g. bigrams with the second target word positioned before the first target word are excluded). Table 1 summarizes the number of extracted bigrams. Lowercased n-grams would be more general so more would survive the thresholding, but we opt to use the cased n-grams.

Lowercasing
Extracted Bigrams Thresholded (>10) On 71294 80 Off 73313 74 Having that the 1k development segments are used to extract the N-grams features, we report the performance of the N-grams features on the 2k testing segments only.

Data
Our experiments use the following corpora: QEcorpus (our name) denotes the English-German corpus released by the WMT16 QE task organizers. It is the first time when this language pair appears in the segmentlevel QE. QEcorpus consist of 15k source sentences in the IT domain, divided into 12k training, 1k development and 2k testing segments. Source sentences are provided with their machine translations, post-editions and HTER (Snover et al., 2006) as post-editing effort scores.  As pre-processing, the corpus used in each setup is first cleaned from hyperlinks and then tokenized using Moses tokenizer 4 .

Experiments
In our submission, we use the Python wrapper for BiSkip provided in the MultiVec tool 5 (Bérard et al., 2016). To train the model, we use the ITcorpus with the default configuration of the tool. The model was trained using a learning rate α set to 0.05 and sample (a threshold on words' frequency) set to 0.001. As a prediction model, we use the Linear Regression model to predict the post-editing effort need for each translation. In our experiments, we tried different combinations of the introduced features. Best results are obtained by training the model using all the features.
Tables 3 and 4 list the results of examined feature combinations on the development and test parts of QEcorpus, respectively. (The golden truth of the test part was made available only after the outputs submission deadline.) The models are evaluated in terms of Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson's correlation (Pearson's r) for postediting effort prediction, and Spearman's rank correlation coefficient (Spearman's ρ) for the ranking task.
Results show that adding the alignment quality score to the set of baseline features gives the 5 https://github.com/eske/multivec best performance compared to the other introduced features on the test set.
When added alone, features based on POS tags or bilingual embeddings do not help and sometimes even slightly degrade the performance, but apparently, they are useful in the combination. Our submission to the task corresponds to the line "All Features" in Table 4.
Additionally, we experimented with replacing the parallel ITcorpus with only the comparable (but larger) ComparableNews when extracting bilingual embeddings. As documented in Table 5, the size of the monolingual data is apparently more important for the quality of the alignments. MultiVec, given two corpora, extracts the word alignments automatically, and obviously, it is going to fail most of the time when given a non-parallel corpus. Nevertheless, the few random alignments are probably sufficient to blend the source and target subspaces of the vector representation of words, because the setup with all BE features trained on ComparableNews instead of ITcorpus works better.  The official results of the WMT16 Sentence-Level QE task use Pearson's correlation as the primary evaluation metric for Scoring sub-task and Spearman's rank correlation as the primary evaluation metric for Ranking sub-task. According to the official evaluation, our model is ranked 7 th (out of 14) and 6 th (out of 11) in the scoring and ranking sub-tasks respectively. As illustrated in Tables 6 and 7, our model outperforms the baseline system as well as the Referential Translation Machine model (RTM), the best performing system in WMT15 (Bicici et al., 2015), in both scoring and ranking sub-tasks on WMT16 IT-domain datasets.

Conclusion
In this paper, we described our submission to the WMT16 Quality Estimation Shared Task for English-German sentence-level post editing effort prediction and ranking. We introduced a new set of system independent features using bilingual distributed representations, word alignments and also frequent n-grams appearing in manually postedited texts. Combined with baseline features, our features show an improvement in the performance of post-editing effort prediction in QE task.
An interesting observation is that the bilingual embeddings perform better when trained on a larger but only comparable corpus than on an indomain parallel corpus. The bilingual embeddings are not trained specifically for the QE prediction and their contribution is thus arguably limited.
In the future, we plan to investigate more variants to the core learning model as well as training the embeddings for the specific task.