Strategy-Based Technology for Estimating MT Quality

This paper introduces our SAU-KERC system that achieved F1 score of 0.39 in the world-level quality estimation task in WMT2015. The goal is to assign each translated word a “OK” or “BAD” label indicating translation quality. We adopt the sequence labeling model, conditional random fields (CRF), to predict the labels. Since “BAD” labels are rare in the training and development sets, recognition rate of "BAD" is low. To solve this problem, we propose two strategies. One is to replace “OK” label with sub-labels to balance label distribution. The other is to reconstruct the training set to include more "BAD" words.


Introduction
QE task is proposed to estimate the quality of machine translation without relying on reference translations. It contains three levels --word, sentence, and document and our work focuses on the word-level task. The word-level task was proposed in 2013 and was divided into binary classification and multi-class classification. This year only binary classification was considered in WMT2015.
OK/BAD: If a word need editing, then it is BAD. It is OK, otherwise.
As a confidence estimation problem, methods aim to confidence estimation before 2013. A lot of researchers started to investigate confidence measures for machine translation for nearly a decade (Gandrabur and Foster, 2003;Quirk, 2004;Ueffing et al., 2003). Many different confidence measures are investigated in (Blatz et al 2003). They are based on source and target language models features, n-best list, word-lattices, translation tables, and so on. The authors also present efficient ways of classifying words as "correct" or "incorrect" by using native Bayes, single-or multi-layer perceptron. (Blatz et al 2003) combines several features and use neural network and naï ve Bayes learning algorithms to predict whether a word is ok or bad. (Xiong et al., 2010) combines syntax feature, vocabulary feature and word posterior probability feature, which are extracted based on LG parsing, and use the binary classifier based on Maximum Entropy Model to predict the label of each word in machine translation(ok or bad).
Some good ideas are proposed in word-level QE task of WMT. (Luong et al., 2013) use both internal and external features into a conditional random fields(CRF) model to predict the label for each word in the MT hypothesis. (Wisniewskiet al., 2014) rely on a random forest classifier and 16 features to predict the label of a word. (Souza et al., 2014) train two classifier models by using bidirectional long short-term memory recurrent neural networks and CRF to complete word level QE Task.
In WMT2015, the high ratio of OK labels in the training set and development set makes the task an unbalanced classification problem. Generally, it is hard to solve unbalanced classification problem effectively using common machine learning algorithms and features. To balance the label distribution, we propose two strategies: refining OK label(ROL) and changing training set structure(CTS). We augment the CRF model with these two strategies to improve the performance.
The rest of this paper is organized as follows. Section 2 gives the selected features. Section 3 introduces the learning algorithm and the strategies we used. Section 4 shows the structure of experimental data. Section 5 analyzes the exper-iment results. The last part is our summary of this task.

Feature
The features used in this paper were from portion of features provided by organizer and portion of (Luong et al., 2014) features.

Organizer's Feature
Target word: the combinations of target words in the window ±2(two before, two after of current word ). First aligned word: source word with maximum alignment probability with target word. Is stop word: whether the target word is a stop word, punctuation symbol, proper name or number. Back-off: a score assigned to the word according to how many times the target Language Model has to Back-off in order to assign a probability to the word sequence, as described in (Raybaud et al., 2011). Target/source pos: the target word pos and the source word pos; the bigram and trigram sequences. Polysemy count: the number of senses of each word.

LIG System Feature
Target pos /target LM: the longest target word n-gram length and the longest target pos n-gram length. Is in google: taking google translation as a pseudo-reference translation, we check whether a target word appear in the sentence generated by Google.

Other Feature
Target word frequency: the number of times the word appears in the machine translation result. The distance between source and target word: the distance between positions of a target word and its aligned word in the sentence; if a target has not aligned word, then the distance is maximum.

Feature selection
In the CRF feature template, we chose 85 combinations of features in total. In fact, there are thousands of combinations of features which can be extended by the ten basic features, but too many features combined together do not contributed to the MT estimation system, instead this will cause a negative impact. Another problem is that if too much features are combined together, the current data set will have a good effect, but if the data set will appear for a bad effect, which is characterized by over-fitting. Thus feature selection is very critical for each system, and it directly affects the classifier accuracy and generalization ability.
At present, (Yu S H et al. 2007) feature selection can be divided into three strategies according to the formation of features subsets, namely global optimization, random search and heuristic search. Global optimization strategy commonly uses branch and bound algorithm, which search space is O(2 ), random search strategy commonly use a genetic algorithm, which search space is smaller than O( 2 ). Heuristic search strategy commonly uses algorithms which have separate feature combination, the sequence former selection method (SFS), the sequence behind selection algorithms (SBS). Its search space is O( 2 ), although the heuristic search strategy has high efficiency, the result of heuristic search is not the global optimum(Yao Xu et al. 2012).
The selection method used in this paper is to add a feature to see if it has a contribution to the system. Eventually we keep 85 features, but it is not the optimal combination. We test data sets by using ten-fold cross-validation approach to prevent overfitting.

Labeling Method
Word level QE task of WMT2015 aims at marking each word in MT as OK or BAD. There must be some corresponding relationship among words in a MT output, so we also can regard word-level QE task as Sequence labeling task. We combine the ML method of CRF(using pocket CRF toolkit) with features describes in section 2 to train a sequence labeling model to predict word label.
The parameterization of CRF is shown as follows: is defined as characteristic function at the edge, called transfer features which depend on the current position and the previous one; is defined as characteristic function at the node, called state characteristics which depend on the current position. The conditional probability of each tag sequence equals to the sum of state probability and transfer probability of input sequence.
In QE task, the ratio between OK and BAD roughly equals to 4:1, which is very unbalanced. So it leads to two phenomena as fellows: 1. the probability labeling OK is much larger than the probability labeling BAD. 2. The probability that transfer to OK is much larger than the probability that transfer to BAD in train corpus; which will result in model bias. So the performance of the model trained just by using CRF and features of section 2 is not satisfactory.
In order to solve the unbalanced problem of word label, we propose two strategies: 1. Refine OK label(ROL); 2. Change train set structure(CTS).

Refine OK Label
We divide OK into OK_B, OK_I, OK_E and OK. OK_B is the start of OK continuous sequence; OK_I is the middle section of OK continuous sequence; OK_E is the end of OK continuous sequence; OK indicates the discontinuous label of OK as shown in figure 1. ROL can reduce the probability that a word is marked as OK to a certain extent. When we regard each label of words as a state, we can draw that ROL can reduce the probability of transfer to OK and enhance the probability of transfer to BAD tags in each output.

Change Train Set Structure
Our first strategy smooths the ratio between labels by refining OK label. However, even with refining, the proportion of BAD is still much smaller than other labels. So the second strategy we proposed will raise the proportion of bad by changing the structure of train set. Implementation of this strategy: a. Calculated the proportion of bad in each MT sentence in train set b. Delete MT sentence that has no BAD label in train set. c. MT sentence that BAD ratio is greater than threshold K be added repeatedly into train set.
This strategy will reduce the number of OK and increase the number of BAD, consequently reducing the ratio between OK and BAD.

Data
There is just one translation corpus from English to Spanish in word-level QE task of WMT2015.
The detail information of corpus shows in

Threshold K Determination
There is a threshold K in the strategy of changing training set structure. The size of threshold has influence on MT estimation performance, so we conducted a series of tests to analysis the size of K. Meaningful range of the threshold value of K should ensure reducing the proportion of OK and BAD. From    Figure 1, the F1 score of BAD is highest when threshold K takes 0.3. However, we had set the value of K at 0.6 due to time reason during QE task. We believe that the score will be higher when K is equal to 0.3.

QE Experimental Analysis
There are four comparative experiments to prove the validity of the strategies proposed in this paper. Experiment names are as follows:   In QE task of WMT2015, Label distribution disequilibrium phenomenon can lead to Paranoid problem, which impacts the performance of QE system seriously. As shown in table 4 and table 5, the strategies that refine OK label and change structure of train set can solve label disequilibrium problem to a certain degree. The F_BAD is 34.28 when using the strategy of refining OK label alone, and the F_BAD is 32.69 when using the strategy of changing structure of training set. The strategy that refines OK label is more effective than the one that change the structure of the training set.

Conclusion
For the problem of Label distribution disequilibrium in word-level QE task of WMT2015, We proposed two strategies: one is refining OK label, the other one is changing structure of train set. Combined with the strategies, we use CRF and some grammar features to train a model which can enhance the correct number of BAD label, and the strategy of ROL is more effective. But, from Table 5, the F1 scores of the original method is that F_BAD is 28.34 and the F_OK is 88.75. When we add the two strategies, the F_BAD increases to 39.11 and the F_OK reduces to 86.36. In the future, we hope to overcome the shortcomings of the two strategies to improve both F1 scores of the two labels.