Top-Rank Enhanced Listwise Optimization for Statistical Machine Translation

Pairwise ranking methods are the most widely used discriminative training approaches for structure prediction problems in natural language processing (NLP). Decomposing the problem of ranking hypotheses into pairwise comparisons enables simple and efficient solutions. However, neglecting the global ordering of the hypothesis list may hinder learning. We propose a listwise learning framework for structure prediction problems such as machine translation. Our framework directly models the entire translation list’s ordering to learn parameters which may better fit the given listwise samples. Furthermore, we propose top-rank enhanced loss functions, which are more sensitive to ranking errors at higher positions. Experiments on a large-scale Chinese-English translation task show that both our listwise learning framework and top-rank enhanced listwise losses lead to significant improvements in translation quality.


Introduction
Discriminative training methods for structured prediction in natural language processing (NLP) aim to estimate the parameters of a model that assigns a score to each hypothesis in the (possibly very large) search space.For example, in statistical machine translation (SMT), the model assigns a score to each possible translation, and in syntactic parsing, the function assigns a score to each possible syntactic tree.Ideally, the model should assign scores that rank hypotheses according to their true quality.In this paper, we consider the problem of discriminative training for SMT.
Traditional SMT systems use log-linear models with only about a dozen features, such as translation probabilities and language model probabilities (Yamada and Knight, 2001;Koehn et al., 2003;Chiang, 2005;Liu et al., 2006).These models can be tuned by minimum error rate training (MERT) (Och, 2003), which directly optimizes BLEU using coordinate ascent combined with a global line search.
To enable training of modern SMT systems, which can have thousands of features or more, many research efforts have been made towards scalable discriminative training methods (Chiang et al., 2008;Hopkins and May, 2011;Bazrafshan et al., 2012).Most of these methods either define loss functions that push the model to correctly compare pairs of hypotheses, or use approximate optimization methods that effectively do the same.For practical reasons, only a subset of the pairs are considered; these pairs are selected by either sampling (Hopkins and May, 2011) or heuristic methods (Watanabe et al., 2007;Chiang et al., 2008).
But this pairwise approach neglects the global ordering of the list of hypotheses, which may lead to problems trying to learn good parameter values.Inspired by research in information retrieval (IR) (Cao et al., 2007;Xia et al., 2008), we propose to directly model the ordering of the whole translation list, instead of decomposing it into translation pairs.Previous research has tried to integrate listwise methods into SMT, but almost all of them focus on the reranking task, which aims to rescore the fixed translation lists generated by a baseline system.They try to either use listwise approaches to training the reranking model (Li et al., 2013;Niehues et al., 2015) or replace the pointwise ranking function, i.e. the log-linear model, with a listwise ranking function by introducing listwise features (Zhang et al., 2016).In this paper, we focus on listwise approaches that can learn better discriminative models for SMT.We present a listwise learning framework for tuning translation systems that uses two listwise ranking objectives originally developed for IR, ListNet (Cao et al., 2007) and ListMLE (Xia et al., 2008).But unlike standard IR problems, structured prediction problems usually have a huge search space, and at each training iteration, the list of search results can vary.The usual strategy is to form the union of all lists of search results, but this can lead to a "patchy" list that doesn't represent the full search space well.The listwise approaches always based on the permutation probability distribution over the list.Modeling the distribution over a "patchy" list, whose elements were generated by different parameters will affect listwise approaches' performance.To address this issue, we design an instance-aggregating method: Instead of treating the data as a fixed-size set of lists that each grow over time as new translations are added at each iteration, we treat the data as a growing set of lists; each time a sentence is translated, the k-best list of translations is added as a new list.
We also extend standard listwise training by considering the importance of different instances in the list.Based on the intuition that instances at the top may be more important for ranking, we propose top-rank enhanced loss functions, which incorporate a position-dependent cost that penalizes errors occurring at the top of the list more strongly.
We conduct large-scale Chinese-to-English translation experiments showing that our top-rank enhanced listwise learning methods significantly outperform other tuning methods with high dimensional feature sets.Additionally, even with a small basic feature set, our methods still obtain better results than MERT.

Log-linear models
In this paper, we assume a log-linear model, which defines a scoring function on target translation hypotheses e, given a source sentence f : where h(e | f ) is the feature vector and w is the feature weight vector.
+1   +2   +3   +4 Figure 1: An example of word-phrase features for a phrase translation.The f i and e j represent the ith in the source phrase and j-th word in the target phrase, respectively.
The process of training a SMT system includes both learning the sub-models, which are included in the feature vector h, and learning the weight vector w.
Then the decoding of SMT systems can be formulated as a search for the translation ê with the highest model score: where E is the set of all reachable hypotheses.

SMT Features
In this paper, we use a hierarchical phrase based translation system (Chiang, 2005).For convenient comparison, we divide features of SMT into the following three sets.
Basic Features: The basic features are those commonly used in hierarchical phrase based translation systems, including a language model, four translation model features, word, phrase and rule penalties, and penalties for unknown words, the glue rule and null translations.
Extended Features: Inspired by Chen et al. (2013), we manually group the parallel training data into 15 sets, according to their genre and origin.The translation models trained on each set are used as separate features.We also add an indicator feature for each individual training set to mark where the translation rule comes from.The extended features provide additional 60 translation model features and 16 indicator features, which is too many to be tuned with MERT.
Sparse Features: We use word-phrase pair features as our sparse features, which reflect the word-phrase correspondence in a hierarchical phrase (Watanabe et al., 2007).Figure 1 illustrates an example of word-phrase pair features for a phrase translation pair f i , ..., f i+3 and e j , ..., e j+4 .Word-phrase pair features (f i , e j+1 ), (f i+1 , e j ), (f i+2 , e j+2 e j+3 ), (f i+3 , e j+4 ) will be fired for the translation rule with the given word alignment.In practice, these feature only fire when all the source and target words in the feature are both in the top 100 most frequent words.

Tuning via Pairwise Ranking
The beam search strategy for SMT decoding process makes it convenient to get a k-best translation list for each source sentence.Given a set of source sentences and their corresponding translation lists, the tuning problem could be regarded as a ranking task.Many recently proposed SMT tuning methods are based on the pairwise ranking framework (Chiang et al., 2008;Hopkins and May, 2011;Bazrafshan et al., 2012).
Pairwise ranking optimization (PRO) (Hopkins and May, 2011) is a commonly used tuning method.The idea of PRO is to sample pairs (e, e ) from the k-best list, and train a linear binary classifier to predict whether eval(e) > eval(e ) or eval(e) < eval(e ), where eval(•) is an extrinsic metric like BLEU.In this paper, we use sentencelevel BLEU with add-one smoothing (Lin and Och, 2004).
The method gets a comparable BLEU score to MERT and MIRA (Chiang et al., 2008), and scales well on large feature sets.Other pairwise ranking methods employ similar procedures.

Listwise Learning Framework
Although ranking methods have shown their effectiveness in tuning for SMT systems (Hopkins and May, 2011;Watanabe, 2012;Dreyer and Dong, 2015), most proposed ranking approaches view tuning as pairwise ranking.These approaches decompose the ranking of the hypothesis list into pairs, which might limit the training method's ability to learn better parameters.To preserve the ranking information, we first formulate training as an instance of the listwise ranking problem.Then we propose a learning method based on the iterative learning framework of SMT tuning and further investigate the top-rank enhanced losses.

The Permutation Probability Model
In order to directly model the translation list, we first introduce a probabilistic model proposed by Guiver and Snelson (2009).A ranking of a list of k translations can be thought of as a function π from [1, k] to translations, where each π(t) is the t-th translation candidate in the ranking.A scoring function z (which could be either the model score, s, or the BLEU score, eval) induces a probability distribution over rankings: . (4)

Loss Functions
Based on the probabilistic model above, the loss function can be defined as the difference between the distribution over the ranking according to eval(•) and s(•).Thus, we introduce the following two standard listwise losses.
ListNet: The ListNet loss is the cross entropy between the distributions calculated from eval(•) and s(•), respectively, over all permutations.
Due to the exponential number of permutations, Cao et al. (2007) propose a top-one loss instead.Given the function eval(•) and s(•), the top-one loss is defined as: where e j is the j-th element in the k-best list, and P z (e j ) is the probability that translation e j is ranked at the top by the function z.ListMLE: The ListMLE loss is the negative log-likelihood of the permutation probability of the correct ranking π eval , calculated according to s(•) (Xia et al., 2008): . (5) The training objective, which we want to minimize, is simply the total loss over all the lists in the tuning set.

Training with Instance Aggregating
Because there can be exponentially many possible translations of a sentence, it's only feasible to rank the k best translations rather than all of them; because the feature weights change at each iteration, we have a different k-best list to rank at each iteration.This is different from standard ranking problems in which the training instances stay the same each iteration.for source sentences f do end for

6:
Training: w i+1 = Optimization(T, w i ) 7: end for Many previous tuning methods address this problem by merging the k-best list at the current iteration with the k-best lists at all previous iterations into a single list (Hopkins and May, 2011).We call this k-best merging.More formally, if E i f is the k-best list of source sentence f at iteration i, then at each iteration, the model is trained on the set of lists: For each source sentence f , T has only one training sample, which is a better and better approximation to the full hypothesis set of f as more iterations pass.
Unlike previous tuning methods, our tuning method focuses on the distribution over permutation of the whole list.Moreover, unlike with listwise optimization methods used in IR, the k-best list produced for a source sentence at one iteration can differ dramatically from the k-best list produced at the next iteration.Merging k-best lists across iterations, each of which represents only a tiny fraction of the full search space, will lead to a "patchy" list that may hurt the learning performance of the listwise optimization algorithms.
To address this challenge, we propose instance aggregating: instead of merging k-best lists across different iterations, we view the translation lists from different iterations as individual training instances: With this method, each source sentence f has i training instances at the i-th training iteration.In this way, we avoid "patchy" lists and obtain a better set of instances for tuning.end for 8: end for The above instance aggregating method can be used in a MERT-like iterative tuning algorithm as shown in Algorithm 1, which can be easily integrated into current open source systems.The two standard listwise losses can be easily optimized using gradient-based methods (Algorithm 2); both losses are convex, so convergence to a global optimum is guaranteed.The gradients of ListNET and ListMLE with respect to the parameters w for a single sentence are: ∂s(e j ) ∂w ∂s(e j ) ∂w ∂s(π eval (t)) ∂w For optimization, We use a mini-batch stochastic gradient descent (SGD) algorithm together with AdaDelta (Zeiler, 2012) algorithm to adaptively set the learning rate.

Top-Rank Enhanced Losses
In evaluating an SMT system, one naturally cares much more about the top-ranked results than the lower-ranked results.Therefore, we think that getting the ranking right at the top of a list is more relevant for tuning.Therefore, we should pay more attention to the top-ranked translations instead of forcing the model to rank the entire list correctly.
Position-dependent Attention: To do this, we assign a higher cost to ranking errors that occur at the top and a lower cost to errors at the bottom.To make the cost sensitive to position, we define it as: where j is the position in the ranking and k is the size of the list.
Based on this cost function, we propose simple top-rank enhanced listwise losses as extensions of both the ListNet loss and the ListMLE loss.The loss functions are defined as follows: .
Along similar lines, Xia et al. ( 2008) also proposed a top-n ranking method, which assumes that only the correct ranking of top-n hypotheses is useful.Compared to our top-rank enhanced losses, it may be too harsh to discard information about the rest of the ordering altogether; our method retains the whole ordering but weights it by position.

Data and Preparation
We conduct experiments on a large scale Chinese-English translation task.The parallel data comes from LDC corpora 1 , which consists of 8.2 million of sentence pairs.Monolingual data includes Xinhua portion of Gigaword corpus.We use NIST MT03 evaluation test data as the development set, MT02, MT04 and MT05 as the test set.
The Chinese side of the corpora is word segmented using ICTCLAS 2 .Word alignments of the  and Ney, 2003) in both directions and refined under the "grow-diag-final-and" method.We train a 5-gram language model on the monolingual data with Modified Kneser-Ney smoothing (Chen and Goodman, 1999).Throughout the experiments, our translation system is an in-house implementation of the hierarchical phrase-based translation system (Chiang, 2005).The translation quality is evaluated by 4-gram case-insensitive BLEU (Papineni et al., 2002).Statistical significance testing between systems is conducted by bootstrap resampling implemented by Clark et al. ( 2011).

Tuning Settings
We build baselines for extended and sparse feature sets with two different tuning methods.First, we tune with PRO (Hopkins and May, 2011).As reported by Cherry and Foster (2012), it's hard to find the setting that performs well in general.
We use MegaM version (Daumé III, 2004) with 30 iterations for basic feature set and 100 iterations for extended and sparse feature sets.Second, we run the k-best batch MIRA (KB-MIRA) which shows comparable results with online version of MIRA (Cherry and Foster, 2012;Green et al., 2013).In our experiments, we run KB-MIRA with standard settings in Moses3 .For the basic feature set, the baseline is tuned with MERT (Och, 2003).For all our listwise tuning methods, we set batch size to 10.In our experiments, we can't find a epoch size perform well in general, so we set epoch size to 100 for ListMLE with basic features, 200 for ListMLE with extended and sparse features, and 300 for ListNet.These values are set to achieve the best performance on the development set.
We set beam size to 20 throughout our experiments unless otherwise noted.Following Clark et al. ( 2011), we run the same training procedure 3 times and present the average results for stability.All tuning methods are executed for 40 iter-  We investigate the effect on the extended feature set.
ations of the outer loop and returned the weights that achieve the best development BLEU scores.For all tuning methods on sparse feature set, we use the weight vector tuned by PRO on the extended feature set as initial weights.

Experiments of Listwise Learning Framework
We first investigate the effectiveness of our instance aggregating training procedure.The results are presented in Table 2.The table compare training with instance aggregating and k-best merging.
As the result suggested, with the instance aggregating method, the performance improves on both listwise tuning approaches.For the rest of this paper, we use the instance aggregating as standard setting for listwise tuning approaches.
To verify the performance of our proposed listwise learning framework, we first compare systems with standard listwise losses to the baseline systems.The first four rows in Table 3 show the results.ListNet can outperform PRO by 0.55 BLEU score and 0.26 BLEU score on extended feature set and sparse feature set, respectively.Its main reason is that our listwise methods can obtain structured order information when we take com-plete translation list as instance.
We also observe that ListMLE can only get a modest performance compare to ListNet.We think the objective function of standard ListMLE which forces the whole list ranking in a correct order is too hard.ListNet mainly benefits from its top one permutation probability which only concerns the permutation with the best object ranked first.

Effect of Top-rank Enhanced Losses
To verify our assumption that the correct rank in the top portion of a list is more informative, we conduct this set of experiments.Figure 2 shows the results of top-n ListMLE with different n.Compared to ListMLE in Table 2, we find topn ListMLE can make significant improvements, which means that the top rank is more important.We can observe an improvement in all test sets when we set n from 1 to 5, but when we further increase n, the results dropped.This situation indicates that the correct ranking at the top of the list is more informative and forcing the model to rank the bottom correctly as important as the top will sacrifice the ability to guide better search.
In Table 3, top-5 ListMLE which only aims to rank the top five translations correctly can outperform the baseline and standard ListMLE.With our position-dependent attention, the top-rank enhanced ListMLE can make further improvement over the baseline system(+1.07and +0.73 on extended and sparse feature sets, respectively.)and achieves the best performance.
The top-n loss might be too loose as an approximation of the measure of BLEU.Compared to topn ListMLE, our top-rank enhanced ListMLE can further utilize the different portions of the list by different weights.To verify the claim, we further examined the learning processes of the two losses.For simplicity, the experiment is conducted on a translation list generated by random parameters.The results are shown in Figure 3.We can see that our top-rank enhanced loss almost completely inversely correlates with BLEU after iteration 70.In contrast, after iteration 150, although top-5 loss is still decreasing, BLEU starts to drop.
Due to the high computation cost of ListNet, we only perform the top-rank enhanced ListMLE in this paper.Our preliminary experiments indicate that the performance of ListNet can be further improved with a top-2 loss.We think our top-rank   enhanced method is also useful for ListNet, but due to its computational demands it needs to be further investigated.

Impact of the Size of Candidate Lists
Our listwise tuning methods directly model the order of the translation list, it is clear that the choice of the translation list size k has an impact on our methods.A larger candidate list size may result in the availability of more information during tuning.
In order to verify our tuning methods' capability of handling the larger translation list, we increase k from 20 to 100.The comparison results are shown in Table 4.With a larger size k, our tuning methods also perform better than baselines.For List-Net and top-5 ListMLE, we observe that the improvements over baseline is smaller than size 20.This results show that the order information loss caused by directly drop the bottom is aggravated with larger list size.However, our top-rank enhanced method still get a slight better result than size 20 and significant improvement over baseline by 1.1 BLEU score.This indicate that our toprank enhanced method is more stable and can still effectively exploit the larger size translation list.

Performance on Basic Feature Set
Since the effectiveness of high dimensional feature set, recent work pays more attention to this scenario.Although previous discriminative tuning methods can effectively handle high dimensional feature set, MERT is still the dominant tuning method for basic features.Here, we investigate our top-rank enhanced tuning methods' capability of handling basic feature set.MERT by 0.25 BLEU score.These results show that our top-ranked enhanced tuning method can learn more informations of translation list even with a basic feature set.

Related Work
The ranking problem is well studied in IR community.There are many methods been proposed, including pointwise (Nallapati, 2004), pairwise (Herbrich et al., 1999;Burges et al., 2005) and listwise (Cao et al., 2007;Xia et al., 2008) algorithms.Experiment results show that listwise methods deliver better performance than pointwise and pairwise methods in general (Liu, 2010).
Most NLP researches take ranking as an extra step after searching from its output space (Charniak and Johnson, 2005;Collins and Terry Koo, 2005;Duh, 2008).In SMT research, listwise approaches also have been employed for the reranking tasks.For example, Li et al. (2013) utilized two listwise approaches to rerank the translation outputs and achieved the best segmentlevel correlation with human judgments.Niehues et al. (2015) employed ListNet to rescore the kbest translations, which significantly outperforms MERT, KB-MIRA and PRO.Zhang et al. (2016) viewed the log-linear model as a pointwise ranking function and shifted it to listwise ranking function by introducing listwise features and outperformed the log-linear model.Compared to these efforts, our method takes a further step by integrating listwise ranking methods into the iterative training.
There are also some researches use ranking methods for tuning to guide better search.In SMT, previous attempts on using ranking as a tuning methods usually perform pairwise comparisons on a subset of translation pairs (Chiang et al., 2008;Hopkins and May, 2011;Watanabe, 2012;Bazrafshan et al., 2012;Guzmán et al., 2015).Dreyer and Dong (2015) even took all translation pairs of the k-best list as training instances, which only ob-tained a comparable result with PRO and the implementation is more complicate.In this paper, we model the entire list as a whole unit, and propose training objectives that are sensitive to different parts of the list.

Conclusion
In this paper, we propose a listwise learning framework for statistical machine translation.In order to adapt listwise approaches, we use an iterative training framework in which instances from different iterations are aggregated into the training set.To emphasize the top order of the list, we further propose top-rank enhanced listwise learning losses.Compared to previous efforts in SMT tuning, our method directly models the order information of the complete translation list.Experiments show our method could lead to significant improvements of translation quality in different feature sets and beam size.
Our current work focuses on the traditional SMT task.For future work, it will be interesting to integrate our methods to modern neural machine translation systems or other structure prediction problems.It may also be interesting to explore more methods on listwise tuning framework, such as investigating different methods to enhance top order of translation list directly w.r.t a given evaluation metric.
Algorithm 1 MERT-like tuning algorithm Require: Training sentences {f }, maximum number of iterations I, randomly initialized model parameters w 0 .1: for i = 0 to I do 2:

Algorithm 2
Listwise Optimization AlgorithmRequire: Training instances T , model parameters w, maximum number of epochs J, batch size b, number of batches B 1: for j = 0 to J do

Table 2 :
The comparison of instances aggregating and k-best merging on the extended feature set.(Netm and MLE m denote ListNet and ListMLE with k-best merging respectively.) 7HVWFigure 2: Effect of different n for Top-n ListMLE.

Table 3 :
BLEU4 in percentage for comparing of baseline systems and systems with listwise losses.+ , * marks results that are significant better than the baseline system with p < 0.01 and p < 0.05.(ListMLE-T5 and ListMLE-TE refer to top-5 LisMLE and our top-rank enhanced ListMLE, respectively.)Listwise losses v.s.BLEU in (a) top-5 ListMLE and (b) top-rank enhanced ListMLE

Table 4 :
Comparison of baselines and listwise approaches with a larger k-best list on extended feature set.

Table 5 :
Comparison of baseline and liswise approaches on basic feature set.