Non-linear Learning for Statistical Machine Translation

Modern statistical machine translation (SMT) systems usually use a linear combination of features to model the quality of each translation hypothesis. The linear combination assumes that all the features are in a linear relationship and constrains that each feature interacts with the rest features in an linear manner, which might limit the expressive power of the model and lead to a under-fit model on the current data. In this paper, we propose a non-linear modeling for the quality of translation hypotheses based on neural networks, which allows more complex interaction between features. A learning framework is presented for training the non-linear models. We also discuss possible heuristics in designing the network structure which may improve the non-linear learning performance. Experimental results show that with the basic features of a hierarchical phrase-based machine translation system, our method produce translations that are better than a linear model.


Introduction
One of the core problems in the research of statistical machine translation is the modeling of translation hypotheses.Each modeling method defines a score of a target sentence e = e 1 , e 2 , ..., e i , ..., e I , given a source sentence f = f 1 , f 2 , ..., f j ...f J , where each e i is the ith target word and f j is the jth source word.The well-known modeling method starts from the Source-Channel model (Brown et al., 1993)(Equation 1).The scoring of e decomposes to the calculation of a translation model and a language model.P r(e|f ) = P r(e)P r(f |e)/P r(f ) (1) The modeling method is extended to log-linear models by Och and Ney (2002), as shown in Equation 2, where h m (e|f ) is the mth feature function and λ m is the corresponding weight.

P r(e|f
Because the normalization term in Equation 2 is the same for all translation hypotheses of the same source sentence, the score of each hypothesis, denoted by s L , is actually a linear combination of all features, as shown in Equation 3. The log-linear models are flexible to incorporate new features and show significant advantage over the traditional source-channel models, thus become the state-of-the-art modeling method and are applied in various translation settings (Yamada and Knight, 2001;Koehn et al., 2003;Chiang, 2005;Liu et al., 2006).
It is worth noticing that log-linear models try to separate good and bad translation hypotheses using a linear hyper-plane.However, complex interactions between features make it difficult to linearly separate good translation hypotheses from bad ones (Clark et al., 2014).
Taking features in a typical phrase-based machine translation system (Koehn et al., 2003) as an example, the language model feature favors shorter hypotheses; the word penalty feature encourages longer hypotheses.The phrase translation probability feature selects phrases that occurs more frequently in the training corpus, which sometimes are long with lower translation probability, as in translating named entities or id-ioms; sometimes are short but with high translation probability, as in translating verbs or pronouns.These three features jointly decide the choice of translations.Simply use the weighted sum of their values may not be the best choice for modeling translations.
As a result, log-linear models may under-fit the data.This under-fitting may prevents the further improvement of translation quality.
In this paper, we propose a non-linear modeling of translation hypotheses based on neural networks.The traditional features of a machine translation system are used as the input to the network.By feeding input features to nodes in a hidden layer, complex interactions among features are modeled, resulting in much stronger expressive power than traditional log-linear models.(Section 3) Employing a neural network as non-linear models for SMT has two issues to be tackled.The first issue is the parameter learning.Loglinear models rely on minimum error rate training (MERT) (Och, 2003) to achieve best performance.When the scoring function become nonlinear, the intersection points of these non-linear functions could not be effectively calculated and enumerated.Thus MERT is no longer suitable for learning the parameters.To solve the problem , we present a framework for effective training including several criteria to transform the training problem into a binary classification task, a unified objective function and an iterative training algorithm.(Section 4) The second issue is the structure of neural network.Single layer neural networks are equivalent to linear models; two-layer networks with sufficient nodes are capable of learning any continuous function (Bishop, 1995).Adding more layers into the network could model complex functions with less nodes, but also brings the problem of vanishing gradient (Erhan et al., 2009).We adapt a two-layer feed-forward neural network to keep the training process efficient.We notice that one major problem that prevent a neural network training reaching a good solution is that there are too many local minimums in the parameter space.Thus we discuss how to constrain the learning of neural networks with our intuition and observations of the features.(Section 5) Experiments are conducted to compare various settings and verify the effectiveness of our proposed learning framework.Experimental results show that our framework could achieve better translation quality even with the same traditional features as previous linear models.(Section 6)

Related work
Many research has been attempting to bring nonlinearity into the training of SMT.These efforts could be roughly divided into the following three categories.
The first line of research attempted to reinterpret original features via feature transformation or additional learning.For example, Maskey and Zhou (2012) use a deep belief network to learn representations of the phrase translation and lexical translation probability features.Clark et al. (2014) used discretization to transform realvalued dense features into a set of binary indicator features.Lu et al. (2014) learned new features using a semi-supervised deep auto encoder.These work focus on the explicit representation of the features and usually employ extra learning procedure.Our proposed method only take the original feature with no transformation as input.Feature transformation or combination are performed implicitly during the training of the network and integrated with the optimization of translation quality.
The second line of research attempted to use non-linear models instead of log-linear models, which is most similar in spirit with our work.Duh and Kirchhoff (2008) used the boosting method to combine several results of MERT and achieved improvement in a re-ranking setting.Liu et al. (2013) proposed an additive neural network which employed a two-layer neural network for embedding-based features.To avoid local minimum, they still rely on a pre-training and posttraining from MERT or PRO.Comparing to these efforts, our proposed method takes a further step that it is integrated with iterative training, instead of re-ranking, and works without the help of any pre-trained linear models.
The third line of research attempted to add non-linear features/components into the loglinear learning framework.
Neural network based models are trained as language models (Vaswani et al., 2013;Auli and Gao, 2014), translation models (Gao et al., 2014) or joint language and translation models (Auli et al., 2013;Devlin et al., 2014).Liu et al. (2013) also introduced word embedding for source and target side of translation rule as local features.In this paper we focus on enhancing the expressive power of the modeling, which is independent of the research of enhancing translation system with new designed features.We believe additional improvement could be achieved by incorporating more features into our framework.

Non-linear Translation
The non-linear modeling of translation hypotheses could be used in both phrase-based system and syntax-based systems.In this paper, we take the hierarchical phrase based machine translation system (Chiang, 2005) as an example and introduce how we fit the non-linearity into the system.

Decoding
The basic decoding algorithm could be kept almost the same as traditional phrase-based or syntax-based translation systems (Yamada and Knight, 2001;Koehn et al., 2003;Chiang, 2005;Liu et al., 2006).For example, in the experiments of this paper, we use a CKY style decoding algorithm following Chiang (2005).
Our non-linear translation system is different from traditional systems in the way to calculate the score for each hypothesis.Instead of calculating the score as a linear combination, we use neural networks (Section 3.2) to perform a non-linear combination of feature values.
We also use the cube-pruning algorithm (Chiang, 2005) to keep the decoding efficient.Although the non-linearity in model scores may cause more search errors in finding the highest scoring hypothesis, in practice it still achieves reasonable results.

Two-layer Neural Networks
We employ a two-layer neural network as the nonlinear model for scoring translation hypotheses.The structure of a typical two-layer feed-forward neural network includes an input layer, a hidden layer, and a output layer (as shown in Figure 1).
We use the input layer to accept input features, the hidden layer to combine different input features, the output layer with only one node to output the model score for each translation hypothesis based on the value of hidden nodes.More specifically, the score of hypothesis e, denoted as s N , is defined as: where M , b is the weight matrix, bias vector of the neural nodes, respectively; σ is the activation function, which is often set to non-linear functions such as the tanh function or sigmoid function; subscript h and o indicates the parameters of hidden layer and output layer, respectively.

Features
We use the standard features of a typical hierarchical phrase based translation system (Chiang, 2005).
Adding new features into the framework is left as a future direction.The features as listed as following: • p(α|γ) and p(γ|α): conditional probability of translating α as γ and translating α as γ, where α and γ is the left and right hand side of a initial phrase (or hierarchical translation rule), respectively; • p w (α|γ) and p w (γ|α): lexical probability of translating words in α as words in γ and translating words in γ as words in α; • p lm : language model probability; • wc: accumulated count of individual words generated during translation; • pc: accumulated count of initial phrases used; • rc: accumulated count of hierarchical rule phrases used; • gc: accumulated count of glue rule used in this hypothesis; • uc: accumulated count of unknown source word; • nc: accumulated count of source phrases that translate into null;

Non-linear Learning Framework
Traditional machine translation systems rely on MERT to tune the weight of different features.MERT performs efficient search by enumerating the score function of all the hypotheses and using intersections of these linear functions to form the "upper-envelope" of the model score function (Och, 2003).When the scoring function is non-linear, it is not feasible to find the intersections of these functions.In this section, we discuss alternatives to train the parameter for non-linear models.

Training Criteria
The task of machine translation is a complex problem with structural output space.Decoding algorithms search for the translation hypothesis with the highest score, according to a given scoring function, from an exponentially large set of candidate hypotheses.The purpose of training is to select the scoring function, so that the function score the hypotheses "correctly".The correctness is often introduced by some extrinsic metrics, such as BLEU (Papineni et al., 2002).
We denote the scoring function as s(f , e; θ), or simply s, which is parametrized by θ; denote the set of all candidate hypotheses as C; denote the extrinsic metric as eval(•).Note that, in linear cases, s is a linear function as in Equation 3, while in the non-linear case described in this paper, s is the scoring function in Equation 4.
Ideally, the training objective is to select a scoring function s, from all functions S, that scores the correct translation (or references), denoted as ê, higher than any other hypotheses (Equation 5).
In practice, the candidate set C is exponentially large and hard to enumerate; the correct translation ê may not even exist in the current search space for various reasons, e.g.unknown source word.As a result, we seek the following three alternatives as approximations to the ideal objective.

Best v.s. Rest (BR)
To score the best hypothesis in the n-best set ẽ higher than the rest hypotheses.This objective is very similar to MERT in that it tries to optimize the score of ẽ and doesn't concern about the ranking of rest hypothesis.In this case, the n-best set C nbest is used to approximate C, and ẽ to approximate ê.

Best v.s. Worst (BW)
To score the best hypothesis higher than the worst hypothesis in the n-best set.This objective is motivated by the practice of separating the "hope" and "fear" translation hypothesis (Chiang, 2012).We take a simpler strategy which uses the best and worst hypothesis in C nbest as the "hope" and "fear" hypothesis, respectively, in order to avoid multi-pass decoding.

Pairwise (PW)
To score the better hypotheses in sampled hypothesis pairs higher than the worse ones in the same pair.This objective is adapted from the Pairwise Ranking Optimization (PRO) (Hopkins and May, 2011), which tries to ranking all the hypotheses instead of selecting the best one.We use the same sampling strategy as their original paper.
Note that each of the above criterions transforms the original problem of selecting best hypotheses from an exponential space to a certain pair-wise comparison problem, which could be easily trained as standard binary classifiers.

Training Objective
For the binary classification task, we use a hinge loss following Watanabe (2012).Because the network has a lot of parameters compared with the linear model, we use a L 1 norm instead of L 2 norm as the regularization term, to favor sparse solutions.We define our training objective function in Equation 6.
D is the given training data; (e 1 , e 2 ) is a training hypothesis-pair, with the assumption that e 1 is the one with higher eval(•) score; N is the total number of hypothesis-pairs in D; T is the set of hypothesis-pairs for each source sentence.The set T is decided by the criterion used for training.For the BR setting, the best hypothesis is paired with every other hypothesis in the n-best list (Equation 7); while for the BW setting, it is only paired with the worst hypothesis (Equation 8).The generation of T in PW setting is the same with PRO sampling, we refer the readers to the original paper of Hopkins and May (2011).

Training Procedure
In standard training algorithm for classification, the training instances stays the same in each iteration.In machine translation, decoding algorithms usually return a very different n-best set with different parameters.This is due to the exponentially large size of search space.MERT and PRO extend the current nbest set by merging the n-best set of all previous iterations into a pool (Papineni et al., 2002;Hopkins and May, 2011).In this way, the enlarged n-best set may give a better approximation of the true hypothesis set C and may lead to better and more stable training results.
We argue that the training should still focus on hypotheses obtained in current round, because in each iteration the searching for the n-best set is independent of previous iterations.To compromise the above two goals, in our practice, training hypothesis pairs are first generated from the current n-best set, then merged with the pairs generated from all previous iterations.In order to make the model focus more on pairs from current iteration, we assign pairs in previous iterations a small constant weight and assign pairs in current iteration a relatively large constant weight.This is inspired by the AdaBoost algorithm (Schapire, 1999) in weighting instances.
Following the spirit of MERT, we propose a iterative training procedure (Algorithm 1).
As shown in Algorithm 1, the training procedure starts by randomly init model parameters θ 0 (line 1).In ith iteration, the decoding algorithm decodes each sentence f to get the n-best set C nbest θ i+1 ← Optimize(T all , θ i ) 11: end for T i are combined with pairs from previous iterations before used for training (line 9).θ i+1 is obtained by solving Equation 6 using the Conjugate Sub-Gradient method (Le et al., 2011) (line 10).

Structure of the Network
Although neural networks bring strong expressive power to the modeling of translation hypothesis, training a neural network is prone to resulting in local minimum which may affect the training results.We speculate that one reason for these local minimums is the structure of a well-connected network has too many parameters.Take a neural network with k nodes in the input layer and m nodes in the hidden layer as an example.Every node in the hidden layer is connected to each of the k input nodes.This simple structure resulting in at least k × m parameters.
In Section 4.2, we use L 1 norm in the objective function in order to get sparser solutions.In this section, we propose some constrained network structures according to our prior knowledge of the features.These structures have much less parameters or simpler structures comparing to original neural networks, thus reduce the possibility of getting stuck in local minimums.

Network with two-degree Hidden Layer
We find the first pitfall of the standard two-layer neural network is that each node in the hidden layer receives input from every input layer node.Features used in SMT are usually manually designed, which has their concrete meanings.For a network of several hidden nodes, combining every features into every hidden node may be redundant and not necessary to represent the quality of a hypothesis.
As a result, we take a harsh step and constrain the nodes in hidden layer to have a in-degree of two, which means each hidden node only accepts inputs from two input nodes.We do not use any other prior knowledge about features in this setting.So for a network with k nodes in the input layer, the hidden layer should contain C 2 k = k(k − 1)/2 nodes to accept all combinations from the input layer.We name this network structure as Two-Degree Hidden Layer Network (TDN).
It is easy to see that a TDN has C 2 k × 2 = k(k − 1) parameters for the hidden layer because of the constrained degree.This is one order of magnitude less than a standard two-layer network with the same number of hidden nodes, which has Note that we perform a 2-degree combination that looks similar in spirit with those combination of atomic features in large scale discriminative learning for other NLP tasks, such as POS tagging and parsing.However, unlike the practice in these tasks that directly combines values of different features to generate a new feature type, we first linearly combine the value of these features and perform non-linear transformation on these values via an activation function.

Network with Grouped Features
It might be a too strong constraint to require the hidden node have in-degree of 2. In order to relax this constraint, we need more prior knowledge from the features.Our first observation is that there are different types of features.These types are different from each other in terms of value ranges, sources, importance, etc.For example, language model features usually take a very small value of probability, and word count feature takes a integer value and usually has a much higher weight in linear case than other count features.
The second observation is that features in the same group are basically of the same type and may not have complex interaction with each other.For example, it is reasonable to combine language model features with word count features in a hidden node.But it may not be necessary to combine the count of initial phrases and the count of unknown words into a hidden node.
Based on the above two intuitions, we design a new structure of network that has the following constraints: given a disjoint partition of features: G 1 , G 1 , G k , every hidden node takes input from a set of input nodes, where any two nodes in this set come from two different feature groups.We name this network structure as Grouped Network (GN).
In practice, we divide the basic features in Section 3.3 into five groups: language model features, translation probability features, lexical probability features, the word count feature, and the rest of count features.

General Settings
We conduct experiments on a large scale machine translation tasks.The parallel data comes from LDC, including LDC2002E18, LDC2003E14, LDC2004E12, LDC2004T08, LDC2005T10, LDC2007T09, which consists of 8.6 million of sentence pairs.Monolingual data includes Xinhua portion of Gigaword corpus.We use multi-references data MT03 as training data, MT02 as development data, and MT04, MT05 as test data.These data are mainly in the same genre, avoiding the extra consideration of domain adaptation.The Chinese side of the corpora is word segmented using ICTCLAS1 .Our translation system is an in-house implementation of the hierarchical phrase-based translation system (Chiang, 2005).We set the beam size to 20.We train a 5-gram language model on the monolingual data with MKN smoothing(Chen and Goodman, 1998).For each parameter tuning experiments, we ran the same training procedure 3 times and present the average results.

Data
The translation quality is evaluated use 4-gram case-insensitive BLEU (Papineni et al., 2002).Significant test is performed using bootstrap re-sampling implemented by Clark et al. (2011)

Experiments of Training Criteria
This set experiments evaluates different training criteria discussed in Section 4.1.We generate hypothesis-pair according to BW, BR and PW criteria, respectively, and perform training with these pairs.In the PW criterion, we use the sampling method of PRO (Hopkins and May, 2011) and get the 50 hypothesis pairs for each sentence.We use 20 hidden nodes for all three settings to make a fair comparison.
The results are presented in Table 2.The first two rows compare training with and without the weighted combination of hypothesis pairs we discussed in Section 4.3.As the result suggested, with the weighted combination of hypothesis pairs from previous iterations, the performance improves significantly on both test sets.
Although the system performance on the dev set varies, the performance on test sets are almost comparable.This suggest that although the three training criteria are based on different assumptions, their are basically equivalent for training translation systems.
We also compares the three training criteria in their number of new instances per iteration and final training accuracy (  3 are the accuracy each system achieves after training stops.They are not calculated on the same set of instances, thus not directly comparable.We use the differences in accuracy as an indicator for the difficulties of the corresponding learning task. For the rest of this paper, we use the BW criterion because it is much simpler compared to sampling method of PRO (Hopkins and May, 2011).

Experiments of Network Structures
We make several comparisons of the network structures and compare them with a baseline hierarchical phrase-based translation system (HPB) (Table 4).
We first compares the neural network with different number of hidden nodes.The systems TLayer 20 , TLayer 30 and TLayer 50 are standard two-layer feed forward neural network with 20, 30 and 50 hidden layer nodes 3 .We can see that training a larger network do lead to an improvement in translation quality.However training a larger network is often time-consuming.We experimented with neural networks with 100 and more hidden nodes (TLayer 100 ).But TLayer 30 takes 10 times longer in training time for each iter-3 TLayer20 is the same system as BW in Table 4: BLEU4 in percentage for comparing of systems using different network structures (HPB refers to the baseline hierarchical phrase-based system.TLayer , TDN, GN refer to the standard 2-layer network, Two-Degree Hidden Layer Network, Grouped Network, respectively.Subscript of TLayer indicates the number of nodes in the hidden layer.)+ , * marks results that are significant better than the baseline system with p < 0.01 and p < 0.05.
ation than TLayer 20 and did not finish by the time of submission deadline.
We then compared the two network structures proposed in Section 5.The Two-Degree Hidden Layer Network (TDN) already perform comparable to the baseline system.But it constrain all input to the hidden node to be of degree 2, which is likely to be too restrictive.With the grouped feature, we could design networks such as GN, which shows significant improvement over the baseline systems and achieves the best performance among all neural systems.Note that GN is in a much larger scale, but is also sparse in parameters and takes significant less training time than standard neural networks.

Conclusion
In this paper, we discuss a non-linear framework for modeling translation hypothesis for statistical machine translation system.We also present a learning framework including training criterion and algorithms to integrate our modeling into a state of the art hierarchical phrase based machine translation system.Compared to previous effort in bringing in non-linearity into machine translation, our method uses a single two-layer neural networks and performs training independent with any previous linear training methods (e.g.MERT).Our method also trains its parameters without any pre-training or post-training procedure.Experiment shows that our method could improve the baseline system even with the same feature as input, in a large scale Chinese-English machine translation task.
In training neural networks with hidden nodes, we use heuristics to reduce the complexity of network structures and obtain extra advantages over standard networks.It shows that heuristics and intuitions of the data and features are still important to a machine translation system.
As future work, it is necessary to integrate more features into our learning framework.It is also interesting to see how the non-linear modeling fit in to more complex learning tasks which involves domain specific learning techniques.

e 2 ∈
C nbest and e 1 = e 2 } (7) T BW = {(e 1 , e 2 )|e 1 = arg max (line 5).Training hypothesis pairs T are extracted from C nbest according to the training criterion described in Section 4.2 (line 6).New collected pairs Algorithm 1 Iterative Training Algorithm Input: the set of training sentences D, max number of iteration I 1: θ 0 ← RandomInit(), 2: for i = 0 to I do

Table 1 :
Experimental data and statistics.
. We employ a twolayer neural network with 11 input layer nodes,

Table 3 )
. Compared to BR which tries to separate the best hypothesis from the rest hypotheses in the n-best set, and PW which tries to obtain a correct ranking of all hy-