The Geometry of Statistical Machine Translation

Most modern statistical machine translation systems are based on linear statistical models. One extremely effective method for estimating the model parameters is minimum error rate training (MERT), which is an efﬁcient form of line optimisation adapted to the highly non-linear objective functions used in machine translation. We describe a polynomial-time generalisation of line optimisation that computes the error surface over a plane embedded in parameter space. The description of this al-gorithm relies on convex geometry, which is the mathematics of polytopes and their faces. Using this geometric representation of MERT we investigate whether the optimisation of linear models is tractable in general. Previous work on ﬁnding optimal solutions in MERT (Galley and Quirk, 2011) established a worst-case complexity that was exponential in the number of sentences, in contrast we show that exponential dependence in the worst-case complexity is mainly in the number of features. Although our work is framed with respect to MERT, the convex geometric description is also applicable to other error-based training methods for linear models. We believe our analysis has important ramiﬁcations because it suggests that the current trend in building statistical machine translation systems by introducing a very large number of sparse features is inherently not robust.


Introduction
The linear model of Statistical Machine Translation (SMT) (Och and Ney, 2002) where translation scores are a linear combination of the D × 1 feature vector h(e, f ) ∈ R D under the 1 × D model parameter vector w. Convex geometry (Ziegler, 1995) is the mathematics of such linear equations presented as the study of convex polytopes. We use convex geometry to show that the behaviour of training methods such as MERT (Och, 2003;Macherey et al., 2008), MIRA (Crammer et al., 2006), PRO (Hopkins and May, 2011), and others converge with a high feature dimension. In particular we analyse how robustness decreases in linear models as feature dimension increases. We believe that severe overtraining is a problem in many current linear model formulations due to this lack of robustness.
In the process of building this geometric representation of linear models we discuss algorithms such as the Minkowski sum algorithm (Fukuda, 2004) and projected MERT (Section 4.2) that could be useful for designing new and more robust training algorithms for SMT and other natural language processing problems.

Training Linear Models
Let f 1 . . . f S be a set of S source language sentences with reference translations r 1 . . . r S . The goal is to estimate the model parameter vector w so as to minimize an error count based on an automated metric, such as BLEU (Papineni et al., 2002), assumed to be 376 additive over sentences: Optimisation can be made tractable by restricting the search to rescoring of K-best lists of translation hypotheses, {e s,i , 1 ≤ i ≤ K} S s=1 . For f s , let h s,i = h(e s,i , f s ) be the feature vector associated with hypothesis e s,i . Restricted to these lists, the general decoder of Eqn. 1 becomeŝ e(f s ; w) = argmax e s,i {wh(e s,i , f s )} (3) Although the objective function in Eqn.
(2) cannot be solved analytically, MERT as described by Och (2003) can be performed over the K-best lists. The line optimisation procedure considers a subset of parameters defined by the line w (0) + γd, where w (0) corresponds to an initial point in parameter space and d is the direction along which to optimise. Eqn.
(3) can be rewritten as: Line optimisation reduces the D-dimensional procedure in Eqn.
(2) to a 1-Dimensional problem that can be easily solved using a geometric algorithm for many source sentences (Macherey et al., 2008). More recently, Galley and Quirk (2011) have introduced linear programming MERT (LP-MERT) as an exact search algorithm that reaches the global optimum of the training criterion. A hypothesis e s,i from the sth K-best list can be selected by the decoder only if for some parameter vector w = 0. If such a solution exists then the system of inequalities is feasible, and defines a convex region in parameter space within which any parameter w will yield e s,i . Testing the system of inequalities in (5) and finding a parameter vector can be cast as a linear programming feasibility problem (Galley and Quirk, 2011), and this can be extended to find a parameter vector that optimizes Eqn. 2 over a collection of K-best lists. We discuss the complexity of this operation in Section 4.1. Hopkins and May (2011) note that for the sth source sentence, the parameter w that correctly ranks its K-best list must satisfy the following set of constraints for 1 ≤ i, j ≤ K: where ∆ computes the difference in error between two hypotheses. The difference vectors (h s,j − h s,i ) associated with each constraint can be used as input vectors for a binary classification problem in which the aim is to predict whether the the difference in error ∆(e s,i , e s,j ) is positive or negative. Hopkins and May (2011) call this algorithm Pairwise Ranking Optimisation (PRO). Because there are SK 2 difference vectors across all source sentences, a subset of constraints is sampled in the original formulation; with effcient calculation of rankings, sampling can be avoided (Dreyer and Dong, 2015). The online error based training algorithm MIRA (Crammer et al., 2006) is also used for SMT (Watanabe et al., 2007;Chiang et al., 2008;Chiang, 2012). Using a sentence-level error function, a set of S oracle hypotheses are indexed with the vectorî: For a given s the objective at iteration n + 1 is : subject to ξ j ≥ 0 and for 1 ≤ j ≤ K,î s = j : where {ξ} are slack variables added to allow infeasible solutions, and C controls the trade-off between error minimisation and margin maximisation. The online nature of the optimiser results in complex implementations, therefore batch versions of MIRA have been proposed (Cherry and Foster, 2012;Gimpel and Smith, 2012). Although MERT, LP-MERT, PRO, and MIRA carry out their search in very different ways, we can compare them in terms of the constraints they are attempting to satisfy. A feasible solution for LP-MERT is also an optimal solution for MERT, and vice versa. The constraints (Eqn. (5)) that define LP-MERT are a subset of the constraints (Eqn. (6)) that define PRO and so a feasible solution for PRO will also be feasible for LP-MERT; however the converse is not necessarily true. The constraints that define MIRA (Eqn. (7)) are similar to the LP-MERT constraints (5), although with the addition of slack variables and the ∆ function to handle infeasible solutions. However, if a feasible solution is available for MIRA, then these extra quantities are unnecessary. With these quantities removed, then we recover a 'hard-margin' optimiser, which utilises the same constraint set as in LP-MERT. In the feasible case, the solution found by MIRA is also a solution for LP-MERT.

Survey of Recent Work
One avenue of SMT research has been to add as many features as possible to the linear model, especially in the form of sparse features (Chiang et al., 2009;Hopkins and May, 2011;Cherry and Foster, 2012;Gimpel and Smith, 2012;Flanigan et al., 2013;Galley et al., 2013;Green et al., 2013). The assumption is that the addition of new features will improve translation performance. It is interesting to read the justification for many of these works as stated in their abstracts. For example Hopkins and May (2011) state that: We establish PRO's scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems Cherry and Foster (2012) state: Among other results, we find that a simple and efficient batch version of MIRA performs at least as well as training online. Along similar lines Gimpel and Smith (2012) state: [We] present a training algorithm that is easy to implement and that performs comparable to others. In defence of MERT, Galley et al. (2013) state: Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets. Green et al. (2013) also note that feature-rich models are rarely used in annual MT evaluations, an observation they use to motivate an investigation into adaptive learning rate algorithms.
Why do such different methods give such remarkably 'comparable' performance in research settings? And why is it so difficult to get general and unambiguous improvements through the use of high dimensional, sparse features? We believe that the explanation is in feasibility. If the oracle index vector i is feasible then all training methods will find very similar solutions. Our belief is that as the feature dimension increases, the chance of an oracle index vector being feasible also increases.

Convex Geometry
We now build on the description of LP-MERT to give a geometric interpretation to training linear models. We first give a concise summary of the fundamentals of convex geometry as presented by (Ziegler, 1995) after which we work through the example in Cer et al. (2008) to provide an intuition behind these concepts.

Convex Geometry Fundamentals
In this section we reference definitions from convex geometry (Ziegler, 1995) in a form that allows us to describe SMT model parameter optimisation. Vector Space The real valued vector space R D represents the space of all finite D-dimensional feature vectors. Dual Vector Space The dual vector space (R D ) * are the real linear functions R D → R. Polytope The polytope H s ⊆ R D is the convex hull of the finite set of feature vectors associated with the K hypotheses for the sth sentence, i.e. H s = conv(h s,1 , . . . , h s,K ).   (5) is feasible defines a decision boundary in (R D ) * between the parameters that maximise h s,i and those that maximise h s,j . Normal Cone For the face F in polytope H s the normal cone N F takes the form.
If the face is a vertex F = {h s,i } then its normal cone N {h s,i } is the set of feasible parameters that satisfy the system in (5).
Normal Fan The set of all normal cones associated with the faces of H s is called the normal fan N (H s ).

Drawing a Normal Fan
Following the example in Cer et al. (2008) we analyze a system based on two features: the translation P T M (f |e) and language P LM (e) models. For brevity we omit the common sentence index, so that h i = h s,i . The system produces a set of four hypotheses which yield four feature vectors Table 1). To this set of four hypotheses, we add a fifth hypothesis and feature vector h 5 to illustrate an infeasible solution. These feature vectors are plotted in Figure 1. conditions for a vertex in Eqn. (8), because we can draw a decision boundary that interests the vertex and no other h ∈ H. We also note h 5 is not a vertex, and is redundant to the description of H. Figure 1 of Cer et al. (2008) actually shows a normal fan, although it is not described as such. We now describe how this geometric object is constructed step by step in Figure 2. In Part (a) we identify the edge [h 4 , h 1 ] in R 2 with a decision boundary represented by a dashed line. We have also drawn a vector w normal to the decision boundary that satisfies Eqn. (8). This parameter would result in a tied model score such that wh 4 = wh 1 . When moving to (R 2 ) * we see that the normal cone N [h 4 ,h 1 ] is a ray parallel to w. This ray can be considered as the set of parameter vectors that yield the edge [h 4 , h 1 ]. The ray is also a decision boundary in (R 2 ) * , with parameters on either side of the decision boundary maximising either h 4 or h 1 . Any vector parallel to the edge [h 4 , h 1 ], such as (h 1 − h 4 ), can be used to define this decision boundary in (R 2 ) * .
Next in Part (b), with the same procedure we define the normal cone for the edge [h 3 , h 1 ]. Now both the edges from parts (a) and (b) share the the vertex h 1 . This implies that any parameter vector that lies between the two decision boundaries (i.e. between the two rays N [h 3 ,h 1 ] and N [h 4 ,h 1 ] ) would maximise the vertex h 1 : this is the set of vectors that comprise Finally in Part (d) we draw the full fan. We have omitted the axes in (R 2 ) * for clarity. The normal cones for all 4 vertices have been identified.

Training Set Geometry
The previous discussion treated only a single sentence. For a training set of S input sentences, let i be an index vector that contains S elements. Each element is an index i s to a hypothesis and a feature vector for the sth sentence. A particular i specifies a set of hypotheses drawn from each of the K-best lists. LP-MERT builds a set of K S feature vectors associated with S dimensional index vectors i of the form h i = h 1,i 1 + . . . + h S,i S . The polytope of these feature vectors is then constructed.
In convex geometry this operation is called the Minkowski sum and for the polytopes H s and H t , is defined as (Ziegler, 1995) We illustrate this operation in the top part of Figure  3. The Minkowski sum is commutative and associative and generalises to more than two polytopes (Gritzmann and Sturmfels, 1992).
For the polytopes H s and H t the common refinement (Ziegler, 1995) is Each cone in the common refinement is the set of parameter vectors that maximise two faces in H s and H t . This operation is shown in the bottom part of Figure 3. As suggested by Figure 3 the Minkowski sum and common refinement are linked by the following Proof. See Gritzmann and Sturmfels (1992) This implies that, with h i defined for the index vector i, the Minkowski sum defines the parameter vectors that satisfy the following (Tsochantaridis et al., 2005, Eqn. 3)  Figure 3: An example of the equivalence between the Minkowski sum and the common refinement.

Computing the Minkowski Sum
In the top part of the Figure 3 we see that computing the Minkowski sum directly gives 12 feature vectors, 10 of which are unique. Each feature vector would have to be tested under LP-MERT. In general there are K S such feature vectors and exhaustive testing is impractical. LP-MERT performs a lazy enumeration of feature vectors as managed through a divide and conquer algorithm. We believe that in the worst case the complexity of this algorithm could be O(K S ).
The lower part of Figure 3 shows the computation of the common refinement. The common refinement appears as if one normal fan was superimposed on the other. We can see there are six decision boundaries associated with the six edges of the Minkowski sum. Even in this simple example, we can see that the common refinement is an easier quantity to compute than the Minkowski sum.
We now briefly describe the algorithm of Fukuda (2004) that computes the common refinement. Consider the example in Figure 3. For H 1 and H 2 we have drawn an edge in each polytope with a dashed line. The corresponding decision boundaries in their normal fans have also been drawn with dashed lines. Now consider the vertex h 1,3 + h 2,2 in H = H 1 + H 2 and note it has two incident edges. These edges are parallel to edges in the summand polytopes and correspond to decision boundaries in the normal cone N {h 1,3 +h 2,2 } .
We can find the redundant edges in the Minkowski sum by testing the edges suggested by the summand polytopes. If a decision boundary in (R D ) * is redundant, then we can ignore the feature vector that shares the decision boundary. For example h 1,4 + h 2,2 is redundant and the decision boundary N [h 1,3 ,h 1,4 ] is also redundant to the description of the normal cone N {h 1,3 +h 2,2 } . The test for redundant edges can be performed by a linear program.
Given a Minkowski sum H we can define an undirected cyclic graph G(H) = (vert(H), E) where E is the set of edges. The degree of a vertex in G(H) is the number of edges incident to a vertex; δ is denoted as the maximum degree of the vertices.
The linear program for testing redundancy of decision boundaries has a runtime of O(D 3.5 δ) (Fukuda, 2004). Enumerating the vertices of graph G(H) is not trivial due to it being an undirected and cyclic graph. The solution is to use a reverse search algorithm (Avis and Fukuda, 1993). Essen- tially reverse search transforms the graph into a tree. The vertex associated with w (0) is denoted as the root of the tree, and from this root vertices are enumerated in reverse order of model score under w (0) . Each branch of the tree can be enumerated independently, which means that the enumeration can be parallelised. The complexity of the full algorithm is O(δ(D 3.5 δ)| vert(H)|) (Fukuda, 2004). In comparison with the O(K S ) for LP-MERT the worst case complexity of the reverse search algorithm is linear with respect to the size of vert(H).

Two Dimensional Projected MERT
We now explore whether the reverse search algorithm is a practical method for performing MERT using an open source implementation of the algorithm (Weibel, 2010). For reasons discussed in the next section, we wish to reduce the feature dimension. For M < D, we can define a projection ma- There are technical constraints to be observed, discussed in Waite (2014). We note that when M = 1 we obtain Eqn. (4).
For our demonstration, we plot the error count over a plane in (R D ) * . Using the CUED Russian-to-English (Pino et al., 2013) entry to WMT'13 (Bojar et al., 2013) we build a tune set of 1502 sentences. The system uses 12 features which we initially tune with lattice MERT (Macherey et al., 2008) to get a parameter w (0) . Using this parameter we generate 1000-best lists. We then project the feature functions in the 1000-best lists to a 3-dimensional representation that includes the source-to-target phrase probability (UtoV), the word insertion penalty (WIP), and the model score due to w (0) . We use the Minkowski sum algorithm to compute BLEU as γ ∈ (R 2 ) * is applied to the parameters from w (0) . Figure 4 displays some of the characteristics of the algorithm 1 . This plot can be interpreted as a 3-dimensional version of Figure 3 in Macherey et al. (2008) where we represent the BLEU score as a heatmap instead of a third axis. Execution was on 12 CPU cores, leading to the distinct search regions, demonstrating the parallel nature of the algorithm. Weibel (2010) uses a depth-first enumeration order of G(H), hence the narrow and deep exploration of (R D ) * . A breadth-first ordering would focus on cones closer to w (0) . To our knowledge, this is the first description of a generalised line optimisation algorithm that can search all the parameters in a plane in polynomial time. Extensions to higher dimensional search are straightforward.

Robustness of Linear Models
In the previous section we described the Minkowski sum polytope. Let us consider the following upper bound theorem Theorem 1. Let H 1 , . . . ., H S be polytopes in R D with at most N vertices each. Then for D > 2 the upper bound on number of vertices of H 1 +. . .+H S is O(S D−1 K 2(D−1) ).
Proof. See Gritzmann and Sturmfels (1992) Each vertex h i corresponds to a single index vector i, which itself corresponds to a single set of selected hypotheses. Therefore the number of distinct sets of hypotheses that can be drawn from the S K-best lists in bounded above by O(min(K S , S D−1 K 2(D−1) )).
For low dimension features, i.e. for D : S D−1 K 2(D−1) K S , the optimiser is therefore tightly constrained. It cannot pick arbitrarily from the individual K-best lists to optimise the overall BLEU score. We believe this acts as an inherent form of regularisation.
For example, in the system of Section 4.2 (D=12, S=1502, K=1000), only 10 −4403 percent of the K S possible index vectors are feasible. However, if the feature dimension D is increased to D = 493, then S D−1 K 2(D−1) K S and this inherent regularisation is no longer at work: any index vector is feasible, and sentence hypotheses can chosen arbitrarily to optimise the overall BLEU score.
This exponential relationship of feasible solutions with respect to feature dimension can be seen in Figure 6 of Galley and Quirk (2011). At low feature dimension, they find that the LP-MERT algorithm can run to completion for a training set size of hundreds of sentences. As feature dimension increases, the runtime increases exponentially.
PRO and other ranking methods are similarly constrained for low dimensional feature vectors. Theorem 2. If H is a D-dimensional polytope, then for D ≥ 3 the following is an upper bound on the number of edges |E| Proof. This is a special case of the upper bound theorem. See Ziegler (1995, Theorem 8.23).
Each feasible pairwise ranking of pairs of hypotheses corresponds to an edge in the Minkowski sum polytope. Therefore in low dimension ranking methods also benefit from this inherent regularisation.
For higher dimensional feature vectors, these upper bounds no longer guarantee that this inherent regularisation is at work. The analysis suggests -but does not imply -that index vectors, and their corresponding solutions, can be picked arbitrarily from the K-best lists. For MERT overtraining is clearly a risk.
MIRA and related methods have a regularisation mechanism due to the margin maximisation term in Figure 5: We redraw the normal fan from Figure 2 with potential optimal parameters under the 2 regularisation scheme of Galley et al. (2013) marked. The thick red line is the subspace of (R 2 ) * optimised. The dashed lines mark the distances between the decision boundaries and w (0) . their objective functions. Although this form of regularisation may be helpful in practice, there is no guarantee that it will prevent overtraining due to the exponential increase in feasible solutions. For example the adaptive learning rate method of Green et al. (2013) finds gains of over 13 BLEU points in the training set with the addition of 390,000 features, yet only 2 to 3 BLEU points are found in the test set.

A Note on Regularisation
The above analysis suggest a need for regularisation in training with high dimensional feature vectors. Galley et al. (2013) note that regularisation is hard to apply to linear models due to the magnitude invariance of w in Eqn. (1). Figure 2 makes the difficulty clear: the normal cones are determined entirely by the feature vectors of the training samples, and within any particular normal cone a parameter vector can be chosen with arbitrary magnitude. This renders schemes such as L1 or L2 normalisation ineffective. To avoid this, Galley et al. (2013) describe a regularisation scheme for line optimisation that encourages the optimal parameter to be found close to w (0) . The motivation is that w (0) should be a trusted initial point, perhaps taken from a lowerdimensional model. We briefly discuss the challenges of doing this sort of regularisation in MERT.
In Figure 5 we reproduce the normal fan from Figure 2. In this diagram we represent the set of parameters considered by a line optimisation as a thick red line. Let us assume that both e 1 and e 2 have a similarly low error count. Under the regularisation scheme of Galley et al. (2013) we have a choice of w (1) or w (2) , which are equidistant from w (0) . In this affine projection of parameter space it is unclear which one is the optimum. However, if we consider the normal fan as a whole we can clearly see that w ∈ N {h i } is the optimal point under the regularisation. However, it is not obvious in the projected parameter space thatŵ is the better choice. This analysis suggests that direct intervention, e.g. monitoring BLEU on a held-out set, may be more effective in avoiding overtraining.

Discussion
The main contribution of this work is to present a novel geometric description of MERT. We show that it is possible to enumerate all the feasible solutions of a linear model in polynomial time using this description. The immediate conclusion from this work is that the current methods for estimating linear models as done in SMT works best for low dimensional feature vectors.
We can consider the SMT linear model as a member of a family of linear models where the output values are highly structured, and where each input yields a candidate space of possible output values. We have already noted that the constraints in (13) are shared with the structured-SVM (Tsochantaridis et al., 2005), and we can also see the same constraints in Eqn. 3 of Collins (2002). It is our belief that our analysis is applicable to all models in this family and extends far beyond the discussion of SMT here.
We note that the upper bound on feasible solutions increases polynomially in training set size S, whereas the number of possible solutions increases exponentially in S. The result is that the ratio of feasible to possible solutions decreases with S. Our analysis suggests that inherent regularisation should be improved by increasing training set size. This confirms most researchers intuition, with perhaps even larger training sets needed than previously believed.
Another avenue to prevent overtraining would be to project high-dimensional feature sets to low dimensional feature sets using the technique described in Section 4.1. We could then use existing training methods to optimise over the projected feature vec-tors.
We also note that non-linear models methods, such as neural networks (Schwenk et al., 2006;Kalchbrenner and Blunsom, 2013;Devlin et al., 2014;Cho et al., 2014) and decision forests (Criminisi et al., 2011) are not bound by these analyses. In particular neural networks are non-linear functions of the features, and decision forests actively reduce the number of features for individual trees in the forrest. From the perspective of this paper, the recent improvements in SMT due to neural networks are well motivated.