A Coactive Learning View of Online Structured Prediction in Statistical Machine Translation

We present a theoretical analysis of online parameter tuning in statistical machine translation (SMT) from a coactive learning view. This perspective allows us to give regret and generalization bounds for latent perceptron algorithms that are common in SMT, but fall outside of the standard convex optimization scenario. Coac-tive learning also introduces the concept of weak feedback, which we apply in a proof-of-concept experiment to SMT, showing that learning from feedback that consists of slight improvements over predictions leads to convergence in regret and translation error rate. This suggests that coactive learning might be a viable framework for interactive machine translation. Furthermore, we ﬁnd that surrogate translations replacing references that are unreachable in the decoder search space can be interpreted as weak feedback and lead to convergence in learning, if they admit an underlying linear model.


Introduction
Online learning has become the tool of choice for large scale machine learning scenarios. Compared to batch learning, its advantages include memory efficiency, due to parameter updates being performed on the basis of single examples, and runtime efficiency, where a constant number of passes over the training sample is sufficient for convergence (Bottou and Bousquet, 2004). Statistical Machine Translation (SMT) has embraced the potential of online learning, both to handle millions of features and/or millions of data in parameter tuning via online structured prediction (see Liang et al. (2006) for seminal early work), and in interactive learning from user post-edits (see Cesa-Bianchi et al. (2008) for pioneering work on online computer-assisted translation). Online learning algorithms can be given a theoretical analysis in the framework of online convex optimization (Shalev-Shwartz, 2012), however, the application of online learning techniques to SMT sacrifices convexity because of latent derivation variables, and because of surrogate translations replacing human references that are unreachable in the decoder search space. For example, the objective function actually optimized in Liang et al.'s (2006) application of Collins' (2002) structure perceptron has been analyzed by Gimpel and Smith (2012) as a non-convex ramp loss function (McAllester and Keshet, 2011;Do et al., 2008;Collobert et al., 2006). Since online convex optimization does not provide convergence guarantees for the algorithm of Liang et al. (2006), Gimpel and Smith (2012) recommend CCCP (Yuille and Rangarajan, 2003) instead for optimization, but fail to provide a theoretical analysis of Liang et al.'s (2006) actual algorithm under the new objective.
The goal of this paper is to present an alternative theoretical analysis of online learning algorithms for SMT from the viewpoint of coactive learning (Shivaswamy and Joachims, 2012). This framework allows us to make three main contributions: • Firstly, the proof techniques of Shivaswamy and Joachims (2012) are a simple and elegant tool for a theoretical analysis of perceptron-style algorithms that date back to the perceptron mistake bound of Novikoff (1962). These techniques provide an alternative to an online gradient descent view of perceptron-style algorithms, and can easily be extended to obtain regret bounds for a la-tent perceptron algorithm at a rate of O 1 √ T , with possible improvements by using re-scaling. This bound can be directly used to derive generalization guarantees for online and online-to-batch conversions of the algorithm, based on well-known concentration inequalities. Our analysis covers the approach of Liang et al. (2006) and supersedes Sun et al. (2013)'s analysis of the latent perceptron by providing simpler proofs and by adding a generalization analysis. Furthermore, an online learning framework such as coactive learning covers problems such as changing n-best lists after each update that were explicitly excluded from the batch analysis of Gimpel and Smith (2012) and considered fixed in the analysis of Sun et al. (2013).
• Our second contribution is an extension of the online learning scenario in SMT to include a notion of "weak feedback" for the latent perceptron: Coactive learning follows an online learning protocol, where at each round t, the learner predicts a structured object y t for an input x t , and the user corrects the learner by responding with an improved, but not necessarily optimal, object y t with respect to a utility function U . The key asset of coactive learning is the ability of the learner to converge to predictions that are close to optimal structures y * t , although the utility function is unknown to the learner, and only weak feedback in form of slightly improved structuresȳ t is seen in training. We present a proof-of-concept experiment in which translation feedback of varying grades is chosen from the n-best list of an "optimal" model that has access to full information. We show that weak feedback structures correspond to improvements in TER (Snover et al., 2006) over predicted structures, and that learning from weak feedback minimizes regret and TER.
• Our third contribution is to show that certain practices of computing surrogate references actually can be understood as a form of weak feedback. Coactive learning decouples the learner (performing prediction and updates) from the user (providing feedback in form of an improved translation) so that we can compare different surrogacy modes as different ways of approximate utility maximization. We show experimentally that learning from surrogate "hope" derivations (Chiang, 2012) minimizes regret and TER, thus favoring surrogacy modes that admit an underlying linear model, over "local" updates (Liang et al., 2006) or "oracle" derivations (Sokolov et al., 2013), for which learning does not converge.
It is important to note that the goal of our experiments is not to present improvements of coactive learning over the "optimal" full-information model in terms of standard SMT performance. Instead, our goal is to present experiments that serve as a proof-of-concept of the feasibility of coactive learning from weak feedback for SMT, and to propose a new perspective on standard practices of learning from surrogate translations. The rest of this paper is organized as follows. After a review of related work (Section 2), we present a latent percpetron algorithm and analyze its convergence and generalization properties (Section 3). Our first set of experiments (Section 4.1) confirms our theoretical analysis by showing convergence in regret and TER for learning from weak and strong feedback. Our second set of experiments (Section 4.2) analyzes the relation of different surrogacy modes to minimization of regret and TER.

Related Work
Our work builds on the framework of coactive learning, introduced by Shivaswamy and Joachims (2012). We extend their algorithms and proofs to the area of SMT where latent variable models are appropriate, and additionally present generalization guarantees and an online-to-batch conversion. Our theoretical analysis is easily extendable to the full information case of Sun et al. (2013). We also extend our own previous work (Sokolov et al., 2015) with theory and experiments for online-tobatch conversion, and with experiments on coactive learning from surrogate translations.
Learning from weak feedback is related to binary response-based learning where a meaning representation is "tried out" by iteratively generating system outputs, receiving feedback from world interaction, and updating the model parameters. Such world interaction consists of database access in semantic parsing (Kwiatowski et al. (2013), Berant et al. (2013), or Goldwasser and Roth (2013), inter alia). Feedback in response-based learning is given by a user accepting or rejecting system predictions, but not by user corrections.
Lastly, feedback in form of numerical utility values for actions is studied in the frameworks of reinforcement learning (Sutton and Barto, 1998) or in online learning with limited feedback, e.g., multi-armed bandit models (Cesa-Bianchi and Lugosi, 2006). Our framework replaces quantitative feedback with immediate qualitative feedback in form of a structured object that improves upon the utility of the prediction.

Notation and Background
Let X denote a set of input examples, e.g., sentences, and let Y(x) denote a set of structured outputs for x ∈ X , e.g., translations. We define Y = ∪ x Y(x). Furthermore, by H(x, y) we denote a set of possible hidden derivations for a structured output y ∈ Y(x), e.g., for phrase-based SMT, the hidden derivation is determined by a phrase segmentation and a phrase alignment between source and target sentences. Every hidden derivation h ∈ H(x, y) deterministically identifies an output y ∈ Y(x). We define H = ∪ x,y H(x, y).
Let φ : X ×Y ×H → R d denote a feature function that maps a triplet (x, y, h) to a d-dimensional vector. For phrase-based SMT, we use 14 features, defined by phrase translation probabilities, Algorithm 1 Feedback-based Latent Perceptron 1: Initialize w ← 0 2: for t = 1, . . . , T do 3: Observe xt 4: (yt, ht) ← arg max (y,h) w t φ(xt, y, h) 5: Obtain weak feedbackȳt 6: if yt =ȳt then language model probability, distance-based and lexicalized reordering probabilities, and word and phrase penalty. We assume that the feature function has a bounded radius, i.e. that φ(x, y, h) ≤ R for all x, y, h. By ∆ h,h we denote a distance function that is defined for any h, h ∈ H, and is used to scale the step size of updates during learning. In our experiments, we use the ordinary Euclidean distance between the feature vectors of derivations. We assume a linear model with fixed parameters w * such that each input example is mapped to its correct derivation and structured output by using (y * , h * ) = arg max y∈Y(x),h∈H(x,y) w * φ(x, y, h). We define for each given input x, its highest scoring derivation over all outputs Y(x) such that h(x; w) = arg max h ∈H(x,y) max y∈Y(x) w φ(x, y, h ) and the highest scoring derivation for a given output y ∈ Y(x) such that h(x|y; w) = arg max h ∈H(x,y) w φ(x, y, h ). In the following theoretical exposition we assume that the arg max operation can be computed exactly.

Feedback-based Latent Perceptron
We assume an online setting, in which examples are presented one-by-one. The learner observes an input x t , predicts an output structure y t , and is presented with feedbackȳ t about its prediction, which is used to make an update to an existing parameter vector. Algorithm 1 is called "Feedbackbased Latent Perceptron" to stress the fact that it only uses weak feedback to its predictions for learning, but does not necessarily observe optimal structures as in the full information case (Sun et al., 2013). Learning from full information can be recovered by setting the informativeness parameter α to 1 in Equation (2) below, in which case the feedback structureȳ t equals the optimal structure y * t . Algorithm 1 differs from the algorithm of Shivaswamy and Joachims (2012) by a joint maximization over output structures y and hid-den derivations h in prediction (line 4), by choosing a hidden derivationh for the feedback structureȳ (line 7), and by the use of the re-scaling factor ∆h t,ht in the update (line 8), whereh t = h(x t |ȳ t ; w t ) and h t = h(x t ; w t ) are the derivations of the feedback structure and the prediction at time t, respectively. In our theoretical exposition, we assume thatȳ t is reachable in the search space of possible outputs, that is,ȳ t ∈ Y(x t ).

Feedback of Graded Utility
The key in the theoretical analysis in Shivaswamy and Joachims (2012) is the notion of a linear utility function, determined by parameter vector w * , that is unknown to the learner: Upon a system prediction, the user approximately maximizes utility, and returns an improved object y t that has higher utility than the predicted y t s.t.
Importantly, the feedback is typically not the optimal structure y * t that is defined as U (x t , y).
While not receiving optimal structures in training, the learning goal is to predict objects with utility close to optimal structures y * t . The regret that is suffered by the algorithm when predicting object y t instead of the optimal object y * t is To quantify the amount of information in the weak feedback, Shivaswamy and Joachims (2012) define a notion of α-informative feedback, which we generalize as follows for the case of latent derivations. We assume that there exists a derivationh t for the feedback structureȳ t , such that for all predictions y t , the (re-scaled) utility of the weak feedbackȳ t is higher than the (re-scaled) utility of the prediction y t by a fraction α of the maximum possible utility range (under the given utility model). Thus ∀t, ∃h t , ∀h and for α ∈ (0, 1]: where ξ t ≥ 0 are slack variables allowing for violations of (2) for given α. For slack ξ t = 0, user feedback is called strictly α-informative.

Convergence Analysis
A central theoretical result in learning from weak feedback is an analysis that shows that Algorithm 1 minimizes an upper bound on the average regret (1), despite the fact that optimal structures are not used in learning: Then the average regret of the feedback-based latent perceptron can be upper bounded for any α ∈ (0, 1], for any w * ∈ R d : A proof for Theorem 1 is similar to the proof of Shivaswamy and Joachims (2012) and the original mistake bound for the perceptron of Novikoff (1962). 1 The theorem can be interpreted as follows: we expect lower average regret for higher values of α; due to the dominant term T , regret will approach the minimum of the accumulated slack (in case feedback structures violate Equation (2)) or 0 (in case of strictly α-informative feedback). The main difference between the above result and the result of Shivaswamy and Joachims (2012) is the term D T following from the rescaled distance of latent derivations. Their analysis is agnostic of latent derivations, and can be recovered by setting this scaling factor to 1. This yields D T = T , and thus recovers the main factor √ T in their regret bound. In our algorithm, penalizing large distances of derivations can help to move derivations h t closer toh t , therefore decreasing D T as learning proceeds. Thus in case D T < T , our bound is better than the original bound of Shivaswamy and Joachims (2012) for a perceptron without re-scaling. As we will show experimentally, re-scaling leads to a faster convergence in practice.

Generalization Analysis
Regret bounds measure how good the average prediction of the current model is on the next example in the given sequence, thus it seems plausible that a low regret on a sequence of examples should imply good generalization performance on the entire domain of examples.
Generalization for Online Learning. First we present a generalization bound for the case of online learning on a sequence of random examples, based on generalization bounds for expected average regret as given by Cesa-Bianchi et al. (2004). Let probabilities P and expectations E be defined with respect to the fixed unknown underlying distribution according to which all examples are drawn. Furthermore, we bound our loss func- Plugging the bound on REG T of Theorem 1 directly into Proposition 1 of Cesa-Bianchi et al. (2004) gives the following theorem: Theorem 2. Let 0 < δ < 1, and let x 1 , . . . , x T be a sequence of examples that Algorithm 1 observes. Then with probability at least 1 − δ, The generalization bound tells us how far the expected average regret E[REG T ] (or average risk, in terms of Cesa-Bianchi et al. (2004)) is from the average regret that we actually observe in a specific instantiation of the algorithm.
Generalization for Online-to-Batch Conversion. In practice, perceptron-type algorithms are often applied in a batch learning scenario, i.e., the algorithm is applied for K epochs to a training sample of size T and then used for prediction on an unseen test set (Freund and Schapire, 1999;Collins, 2002). The difference to the online learning scenario is that we treat the multi-epoch algorithm as an empirical risk minimizer that selects a final weight vector w T,K whose expected loss on unseen data we would like to bound. We assume that the algorithm is fed with a sequence of examples x 1 , . . . , x T , and at each epoch k = 1, . . . , K it makes a prediction y t,k . The correct label is y * t . For k = 1, . . . , K and t = 1, . . . , T , let t,k = U (x t , y * t ) − U (x t , y t,k ), and denote by ∆ t,k and ξ t,k the distance at epoch k for example t, and the slack at epoch k for example t, respectively. Finally, we denote by D T,K = T t=1 ∆ 2 t,K , and by w T,K the final weight vector returned after K epochs. We state a condition of convergence 2 : 2 This condition is too strong for large datasets. However, we believe that a weaker condition based on ideas from the Condition 1. Algorithm 1 has converged on training instances x 1 , . . . , x T after K epochs if the predictions on x 1 , . . . , x T using the final weight vector w T,K are the same as the predictions on x 1 , . . . , x T in the Kth epoch.
Denote by E X ( (x)) the expected loss on unseen data when using w T,K where (x) = U (x, y * ) − U (x, y ), y * = arg max y U (x, y) and y = arg max y max h w T,K φ(x, y, h). We can now state the following result: Theorem 3. Let 0 < δ < 1, and let x 1 , . . . , x T be a sample for the multiple-epoch perceptron algorithm such that the algorithm converged on it (Condition 1). Then, with probability at least 1−δ, the expected loss of the feedback-based latent perceptron satisfies: The theorem can be interpreted as a bound on the generalization error (lefthand-side) by the empirical error (the first two righthand-side terms) and the variance caused by the finite sample (the third term in the theorem). The result follows directly from McDiarmid's concentration inequality.

Experiments
We used the LIG corpus 3 which consists of 10,881 tuples of French-English post-edits (Potet et al., 2012). The corpus is a subset of the newscommentary dataset provided at WMT 4 and contains input French sentences, MT outputs, postedited outputs and English references. To prepare SMT outputs for post-editing, the creators of the corpus used their own WMT10 system (Potet et al., 2010), based on the Moses phrase-based decoder (Koehn et al., 2007) with dense features. We replicated a similar Moses system using the same monolingual and parallel data: a 5-gram language model was estimated with the KenLM toolkit (Heafield, 2011) on news.en data (48.65M sentences, 1.13B tokens), pre-processed with the tools from the cdec toolkit (Dyer et al., 2010). perceptron cycling theorem (Block and Levin, 1970;Gelfand et al., 2010) should suffice to show a similar bound. Parallel data (europarl+news-comm, 1.64M sentences) were similarly pre-processed and aligned with fast align (Dyer et al., 2013). In all experiments, training is started with the Moses default weights. The size of the n-best list, where used, was set to 1,000. Irrespective of the use of re-scaling in perceptron training, a constant learning rate of 10 −5 was used for learning from simulated feedback, and 10 −4 for learning from surrogate translations. Our experiments on online learning require a random sequence of examples for learning. Following the techniques described in Bertsekas (2011) to generate random sequences for incremental optimization, we compared cyclic order (K epochs of T examples in fixed order), randomized order (sampling datapoints with replacement), and random shuffling of datapoints after each cycle, and found nearly identical regret curves for all three scenarios. In the following, all figures are shown for sequences in the cyclic order, with redecoding after each update. Furthermore note that in all three definitions of sequence, we never see the fixed optimal feedback y * t in training, but instead in general a different feedback structureȳ t (and a different prediction y t ) every time we see the same input x t .

Idealized Weak and Strong Feedback
In a first experiment, we apply Algorithm 1 to user feedback of varying utility grade. The goal of strict (ξt = 0) slack (ξt > 0)  Table 1: Improved utility vs. improved TER distance to human post-edits for α-informative feedbackȳ t compared to prediction y t using default weights at α = 0.1.
this experiment is to confirm our theoretical analysis by showing convergence in regret for learning from weak and strong feedback. We select feedback of varying grade by directly inspecting the optimal w * , thus this feedback is idealized. However, the experiment also has a realistic background since we show that α-informative feedback corresponds to improvements under standard evaluation metrics such as lowercased and tokenized TER, and that learning from weak and strong feedback leads to convergence in TER on test data. For this experiment, the post-edit data from the LIG corpus were randomly split into 3 subsets: PE-train (6,881 sentences), PE-dev, and PE-test (2,000 sentences each). PE-train was used for our online learning experiments. PE-test was held out for testing the algorithms' progress on unseen data. PE-dev was used to obtain w * to define the utility model. This was done by MERT optimization (Och, 2003) towards post-edits under the TER target metric. Note that the goal of our experi-  ments is not to improve SMT performance over any algorithm that has access to full information to compute w * . Rather, we want to show that learning from weak feedback leads to convergence in regret with respect to the optimal model, albeit at a slower rate than learning from strong feedback. The feedback data in this experiment were generated by searching the n-best list for translations that are α-informative at α ∈ {0.1, 0.5, 1.0} (with possible non-zero slack). This is achieved by scanning the n-best list output for every input x t and returning the firstȳ t = y t that satisfies Equation (2). 5 This setting can be thought of as an idealized scenario where a user picks translations from the n-best list that are considered improvements under the optimal w * .
In order to verify that our notion of graded utility corresponds to a realistic concept of graded translation quality, we compared improvements in utility to improved TER distance to human postedits. Table 1 shows that for predictions under default weights, we obtain strictly α-informative (for α = 0.1) feedback for 5,725 out of 6,881 datapoints in PE-train. These feedback structures improve utility per definition, and they also yield better TER distance to post-edits in the majority of cases. A non-negative slack has to be used in 1,155 datapoins. Here the majority of feedback structures do not improve TER distance.
Convergence results for different learning scenarios are shown in Figure 1. The left upper part of Figure 1 shows average utility regret against iterations for a setup without re-scaling, i.e., setting ∆h ,h = 1 in the definition of α-informative feedback (Equation (2)) and in the update of Algorithm 1 (line 8). As predicted by our regret analysis, higher α leads to faster convergence, but all three curves converge towards a minimal regret. Also, the difference between the curves for 5 Note that feedback provided in this way might be stronger than required at a particular value of α since for all β ≥ α, strictly β-informative feedback is also strictly αinformative. On the other hand, because of the limited size of the n-best list, we cannot assume strictly α-informative user feedback with zero slack ξt. In experiments where updates are only done if feedback is strictly α-informative we found similar convergence behavior. α = 0.1 and α = 1.0 is much smaller than a factor of ten. As expected from the correspondence of α-informative feedback to improvements in TER, similar relations are obtained when plotting TER scores on test data for training from weak feedback at different utility grades. This is shown in the right upper part of Figure 1.
The left lower part of Figure 1 shows average utility regret plotted against iterations for a setup that uses re-scaling. We define ∆h t,h by the 2distance between the feature vectors φ(x t ,ȳ t ,h t ) of the derivation of the feedback structure and the feature vector φ(x t , y t , h t ) of the derivation of the predicted structure. We see that the curves for all grades of feedback converge faster than the corresponding curves for un-scaled feedback shown in the upper part Figure 1. Furthermore, as shown in the right lower part of Figure 1, TER is decreased on test data as well at a faster rate. 6 Lastly, we present an experimental validation of the online-to-batch application of our algorithm. That is, we would like to evaluate predictions that use the final weight vector w T,K by comparing the generalization error with the empirical error stated in Theorem 3. The standard way to do this is to compare the average loss on heldout data with the the average loss on the training sequence. Figure 3 shows these results for models trained on α-informative feedback of α ∈ {0.1, 0.5, 1.0} for 10 epochs. Similar to the online learning setup, higher α results in faster convergence. Furthermore, curves for training and heldout evaluation converge at the same rate.

Feedback from Surrogate Translations
In this section, we present experiments on learning from real human post-edits. The goal of this experiment is to investigate whether the stan- dard practices for extracting feedback from observed user post-edits for discriminative SMT can be matched with the modeling assumptions of the coactive learning framework. The customary practice in discriminative learning for SMT is to replace observed user translations by surrogate translations since the former are often not reachable in the search space of the SMT decoder. In our case, only 29% of the post-edits in the LIGcorpus were reachable by the decoder. We compare four heuristics of generating surrogate translations: oracles are generated using the lattice oracle approach of Sokolov et al. (2013) which returns the closest path in the decoder search graph as reachable surrogate translation. 7 A local surrogateỹ is chosen from the n-best list of the linear model as the translation that achieves the best TER score with respect to the actual postedit y:ỹ = arg min y ∈n-best(xt;wt) TER(y , y). This corresponds to the local update mode of Liang et al. (2006). A filtered surrogate translationỹ is found by scanning down the n-best list, and accepting the first translation as feedback that improves TER score with respect to the human post-edit y over the 1-best prediction y t of the linear model: TER(ỹ, y) < TER(y t , y). Finally, a hope surrogate is chosen from the nbest list as the translation that jointly maximizes model score under the linear model and negative TER score with respect to the human postedit:ỹ = arg max y ∈n-best(xt;wt) (−TER(y , y) + w t φ(x t , y , h)). This corresponds to what Chiang (2012) termed "hope derivations". Informally, oracles are model-agnostic, as they can pick a surrogate even from outside of the n-best list; local is constrained to the n-best list, though still ignoring the ordering according to the linear 7 While the original algorithm is designed to maximize the BLEU score of the returned path, we tuned its two free parameters to maximize TER. model; finally, filtered and hope represent different ways of letting the model score influence the selected surrogate.
As shown in Figure 2, regret and TER decrease with the increased amount of information about the assumed linear model that is induced by the surrogate translations: Learning from oracle surrogates does not converge in regret and TER. The local surrogates extracted from 1,000-best lists still do not make effective use of the linear model, while filtered surrogates enforce an improvement over the prediction under TER towards the human post-edit, and improve convergence in learning. Empirically, convergence is achieved only for hope surrogates that jointly maximize negative TER and linear model score, with a convergence behavior that is very similar to learning from weak α-informative feedback at α 0.1. We quantify this in Table 2 where we see that the improvement in TER over the prediction that holds for any hope derivation, corresponds to an improvement in α-informativeness: hope surrogates are strictly α-informative in 83.3% of the cases in our experiment, whereas we find a correspondence to strict α-informativeness only in 45.74% or 39.46% of the cases for filtered and local surrogates, respectively.

Discussion
We presented a theoretical analysis of online learning for SMT from a coactive learning perspective. This viewpoint allowed us to give regret and generalization bounds for perceptron-style online learners that fall outside the convex optimization scenario because of latent variables and changing feedback structures. We introduced the concept of weak feedback into online learning for SMT, and provided proof-of-concept experiments whose goal was to show that learning from weak feedback converges to minimal regret, albeit at a slower rate than learning from strong feedback. Furthermore, we showed that the SMT standard of learning from surrogate hope derivations can to be interpreted as a search for weak improvements under the assumed linear model. This justifies the importance of admitting an underlying linear model in computing surrogate derivations from a coactive learning perspective.
Finally, we hope that our analysis motivates further work in which the idea of learning from weak feedback is taken a step further. For example, our results could perhaps be strengthened by applying richer feature sets or dynamic phrase table extension in experiments on interactive SMT. Our theory would support a new post-editing scenario where users pick translations from the n-best list that they consider improvements over the prediction. Furthermore, it would be interesting to see if "light" post-edits that are better reachable and easier elicitable than "full" post-edits provide a strong enough signal for learning.