Counterfactual Learning from Bandit Feedback under Deterministic Logging : A Case Study in Statistical Machine Translation

The goal of counterfactual learning for statistical machine translation (SMT) is to optimize a target SMT system from logged data that consist of user feedback to translations that were predicted by another, historic SMT system. A challenge arises by the fact that risk-averse commercial SMT systems deterministically log the most probable translation. The lack of sufficient exploration of the SMT output space seemingly contradicts the theoretical requirements for counterfactual learning. We show that counterfactual learning from deterministic bandit logs is possible nevertheless by smoothing out deterministic components in learning. This can be achieved by additive and multiplicative control variates that avoid degenerate behavior in empirical risk minimization. Our simulation experiments show improvements of up to 2 BLEU points by counterfactual learning from deterministic bandit feedback.


Introduction
Commercial SMT systems allow to record large amounts of interaction log data at no cost. Such logs typically contain a record of the source, the translation predicted by the system, and the user feedback. The latter can be gathered directly if explicit user quality ratings of translations are supported, or inferred indirectly from the interaction of the user with the translated content. Indirect feedback in form user clicks on displayed ads has been shown to be a valuable feedback signal in response prediction for display advertising (Bottou et al., 2013). Similar to the computational advertising scenario, one could imagine a scenario where SMT systems are optimized from partial information in form of user feedback to predicted translations, instead of from manually created reference translations. This learning scenario has been investigated in the areas of bandit learning (Bubeck and Cesa-Bianchi, 2012) or reinforcement learning (RL) (Sutton and Barto, 1998). Figure 1 illustrates the learning protocol using the terminology of bandit structured prediction (Sokolov et al., 2016;Kreutzer et al., 2017), where at each round, a system (corresponding to a policy in RL terms) makes a prediction (also called action in RL, or pulling an arm of a bandit), and receives a reward, which is used to update the system. Counterfactual learning attempts to reuse existing interaction data where the predictions have been made by a historic system different from the target system. This enables offline or batch learning from logged data, and is important if online experiments that deploy the target system are risky and/or expensive. Counterfactual learning tasks include policy evaluation, i.e. estimating how a target policy would have performed if it had been in control of choosing the predictions for which the rewards were logged, and policy optimization (also called policy learning), i.e. optimizing parameters of a target policy given the logged data from the historic system. Both tasks are called counterfactual, or off-policy in RL terms, since the target policy was actually not in control during logging. Figure 2 shows the learning protocol for off-policy learning from partial feedback. The crucial trick to obtain unbiased estimators to evaluate and to optimize the off-policy system is to correct the sampling bias of the logging policy. This can be done by importance sampling where the estimate is corrected by the inverse propensity score (Rosenbaum and Rubin, 1983) of the historical algorithm, mitigating the problem that predictions there were favored by the historical system are over-represented in the logs. As shown by Langford et al. (2008) or Strehl et al. (2010), a sufficient exploration of the output space by the logging system is a prerequisite for counterfactual learning. If the logging policy acts stochastically in predicting outputs, this condition is satisfied, and inverse propensity scoring can be applied to correct the sampling bias. However, commercial SMT systems usually try to avoid any risk and only log the most probable translation. This effectively results in deterministic logging policies, making theory and practice of off-policy methods inapplicable to counterfactual learning in SMT.
This paper presents a case study in counterfactual learning for SMT that shows that policy optimization from deterministic bandit logs is possible despite these seemingly contradictory theoretical requirements. We formalize our learning problem as an empirical risk minimization over logged data. While a simple empirical risk minimizer can show degenerate behavior where the objective is minimized by avoiding or over-representing training samples, thus suffering from decreased generalization ability, we show that the use of control variates can remedy this problem. Techniques such as doubly-robust policy evaluation and learning (Dudik et al., 2011) or weighted importance sampling (Jiang and Li, 2016;Thomas and Brunskill, 2016) can be interpreted as additive (Ross, 2013) or multiplicative control variates (Kong, 1992) that serve for variance reduction in estimation. We observe that a further effect of these techniques is that of smoothing out deterministic components by taking the whole output space into account. Furthermore, we conjecture that while outputs are logged deterministically, the stochastic selection of inputs serves as sufficient exploration in parameter optimization over a joint feature representation over inputs and outputs. We present experiments using simulated bandit feedback for two different SMT tasks, showing improvements of up to 2 BLEU in SMT domain adaptation from deterministically logged bandit feedback. This result, together with a comparison to the standard case of policy learning from stochastically logged simulated bandit feedback, confirms the effectiveness our proposed techniques.

Related Work
Counterfactual learning has been known under the name of off-policy learning in various fields that deal with partial feedback, namely contextual bandits (Langford et al. (2008); Strehl et al. (2010); Dudik et al. (2011);Li et al. (2015), inter alia), reinforcement learning (Sutton and Barto (1998); Precup et al. (2000); Jiang and Li (2016); Thomas and Brunskill (2016), inter alia), and structured prediction (Swaminathan and Joachims (2015a,b), inter alia). The idea behind these approaches is to first perform policy evaluation and then policy optimization, under the assumption that better evaluation leads to better optimization. Our work puts a focus on policy optimization in an empirical risk minimization framework for deterministically logged data. Since our experiment is a simulation study, we can compare the deterministic case to the standard scenario of policy optimization and evaluation under stochastic logging.
Variance reduction by additive control variates has implicitly been used in doubly robust techniques (Dudik et al., 2011;Jiang and Li, 2016). However, the connection to Monte Carlo techniques has not been made explicit until Thomas and Brunskill (2016), nor has the control variate technique of optimizing the variance reduction by adjusting a linear interpolation scalar (Ross, 2013) been applied in off-policy learning. Similarly, the technique of weighted importance sampling has been used as variance reduction technique in off-policy learning (Precup et al., 2000;Jiang and Li, 2016;Thomas and Brunskill, 2016). The connection to multiplicative control variates (Kong, 1992) has been made explicit in Swaminathan and Joachims (2015b). To our knowledge, our analysis of both control variate techniques from the perspective of avoiding degenerate behavior in learning from deterministically logged data is novel.

Counterfactual Learning from
Deterministic Bandit Logs Problem Definition. The problem of counterfactual learning (in the following used in the sense of counterfactual optimization) for bandit structured prediction can be described as follows: Let X be a structured input space, let Y(x) be the set of possible output structures for input x, and let ∆ : Y → [0, 1] be a reward function (and δ = −∆ be the corresponding task loss function) 1 quantifying the quality of structured outputs. We are given a data log of triples D = {(x t , y t , δ t )} n t=1 where outputs y t for inputs x t were generated by a logging system, and loss values δ t were observed only at the generated data points. In case of stochastic logging with probability π 0 , the inverse propensity scoring approach (Rosenbaum and Rubin, 1983) uses importance sampling to achieve an unbiased estimate of the expected loss under the parametric target policy π w : In case of deterministic logging, we are confined to empirical risk minimization: Equation (2) assumes deterministically logged outputs with propensity π 0 = 1, t = 1, . . . , n of the historical system. We call this objective the deterministic propensity matching (DPM) objective since it matches deterministic outputs of the logging system to outputs in the n-best list of the target system. For optimization under deterministic logging, a sampling bias is unavoidable since objective (2) does not correct it by importance sampling. Furthermore, the DPM estimator may show a degenerate behavior in learning. This problem can be remedied by the use of control variates, as we will discuss in Section 5.
Learning Principle: Doubly Controlled Empirical Risk Minimization. Our first modification of Equation (2) has been originally motivated by the use of weighted importance sampling in inverse propensity scoring because of its observed stability and variance reduction effects (Precup et al., 2000;Jiang and Li, 2016;Thomas and Brunskill, 2016). We call this objective the reweighted deterministic propensity matching (DPM+R) objective: .
From the perspective of Monte Carlo simulation, the advantage of this modification can be explained by viewing reweighting as a multiplicative control variate (Swaminathan and Joachims, 2015b). Let Z = δ t π w (y t |x t ) and W = π w (y t |x t ) be two random variables, then the variance of r = can be approximately written as follows (Kong, 1992): Var(r) ≈ 1 n (r 2 Var(W ) + Var(Z) − 2r Cov(W, Z)). This shows that a positive correlation between the variable W , representing the target model probability, and the variable Z, representing the target model scaled by the task loss function, will reduce the variance of the estimator. Since there are exponentially many outputs to choose from for each input during logging, variance reduction is useful in counterfactual learning even in the deterministic case. Under a stochastic logging policy, a similar modification can be done to objective (1) by reweighting the ratio ρ t = πw(yt|xt) π 0 (yt|xt) asρ t = ρt t ρt . We will use this reweighted IPS objective, called IPS+R, in our comparison experiments that use stochastically logged data.
A further modification of Equation (3) is motivated by the incorporation of a direct reward estimation method in the inverse propensity scorer as proposed in the doubly-robust estimator (Dudik et al., 2011;Jiang and Li, 2016;Thomas and Brunskill, 2016). Letδ(x t , y t ) be a regression-based reward model trained on the logged data, and letĉ be a scalar that allows to optimize the estimator for minimal variance (Ross, 2013). We define a doubly controlled empirical risk minimization objectiveRĉ DC as follows (forĉ = 1 we arrive at a similar objective calledR DC ): From the perspective of Monte Carlo simulation, the doubly robust estimator can be seen as variance reduction via additive control variates (Ross, 2013). Let X = δ t and Y = δ t be two random variables. ThenȲ = y∈Y(xt)δ (x t , y) π w (y|x t ) is the expectation 2 of Y , and Equation (4) can be rewritten as (2013), Chap. 9.2). Again this shows that variance of the estimator can be reduced if the variable X, representing the reward function, and the variable Y , representing the regression-based reward model, are positively correlated. The optimal scalar parameterĉ can be derived easily by taking the derivative of variance term, leading tô In case of stochastic logging the reweighted target probabilityπ w (y t |x t ) is replaced by a reweighted ratioρ t . We will use such reweighted models of the original doubly robust model, with and without optimalĉ, called DR andĉ DR, in our experiments that use stochastic logging.
Learning Algorithms. Applying a stochastic gradient descent update rule w t+1 = w t − η∇R(π w ) t to the objective functions defined above leads to a variety of algorithms. The gradients of the objectives can be derived by using the score function gradient estimator (Fu, 2006) and are shown in Table 1. Stochastic gradient descent algorithms apply to any differentiable policy π w , thus our methods can be applied to a variety of systems, including linear and non-linear models. Since previous work on off-policy methods in RL and contextual bandits has been done in the area of linear classification, we start with an adaptation of off-policy methods to linear SMT models in our work. We assume a Gibbs model π w (y t |x t ) = e α(w φ(xt,yt)) y∈Y(xt) e α(w φ(xt,y)) , based on a feature representation φ : X × Y → R d , a weight vector w ∈ R d , and a smoothing parameter α ∈ R + , yielding the following sim-

Experiments
Setup. In our experiments, we aim to simulate the following scenario: We assume that it is possible to divert a small fraction of the user interaction traffic for the purpose of policy evaluation and to perform stochastic logging on this small data set. The main traffic is assumed to be logged deterministically, following a conservative regime where one-best translations are used  for an SMT system that does not change frequently over time. Since our experiments are simulation studies, we will additionally perform stochastic logging, and compare policy learning for the (realistic) case of deterministic logging with the (theoretically motivated) case of stochastic logging.
In our deterministic-based policy learning experiments, we evaluate the empirical risk minimization algorithms derived from objectives (3) (DPM+R) and (4). For the doubly controlled objective we employ two variants: First,ĉ is set to 1 as in (Dudik et al., 2011) (DC). Second, we calculateĉ as described in Equation (5) (ĉ DC). The algorithms used in policy evaluation and for stochastic-based policy learning are variants of these objectives that replaceπ byρ to yield estimators IPS+R, DR, andĉ DR of the expected loss.
All objectives will be employed in a domain adaptation scenario for machine translation. A system trained on out-of-domain data will be used to collect feedback on in-domain data. This data will serve as the logged data D in the learning experiments. We conduct two SMT tasks with hypergraph re-decoding: The first is German-to-English and is trained using a concatenation of the Europarl corpus (Koehn, 2005), the Common Crawl corpus 3 and the News Commentary corpus (Koehn and Schroeder, 2007). The goal is to adapt the trained system to the domain of transcribed TED talks using the TED parallel corpus (Tiedemann, 2012). A second task uses the French-to-English Europarl data with the goal of domain adaptation to news articles with the News Commentary corpus (Koehn and Schroeder, 2007). We split off two parts from the TED corpus to be used as validation and test data for the learning experiments. As validation data for the News Commentary corpus we use the splits provided at the WMT shared task, namely nc-devtest2007 as validation data and nc-test2007 as test data. An overview of the data statistics can be seen in Table 2.
As baseline, an out-of-domain system is built using the SCFG framework CDEC (Dyer et al., 2010) with dense features (10 standard features and 2 for the language model). After tokenizing and lowercasing the training data, the data were word aligned using CDEC's fast align. A 4-gram language model is build on the target languages for the out-of-domain data using KENLM (Heafield et al., 2013). For News, we additionally assume access to in-domain target language text and train another in-domain language model on that data, increasing the number of features to 14 for News.
The framework uses a standard linear Gibbs model whose distribution can be peaked using a parameter α (see Equation (6)): Higher value of α will shift the probability of the one-best translation closer to 1 and all others closer to 0. Using α > 1 during training will promote to learn models that are optimal when outputting the one-best translation. In our experiments, we found α = 5 to work well on validation data.
Additionally, we tune a system using CDEC's MERT implementation (Och, 2003) on the indomain data with their references. This fullinformation in-domain system conveys the best possible improvement using the given training data. It can thus be seen as the oracle system for the systems which are learnt using the same input-side training data, but have only bandit feedback available to them as a learning signal. All systems are evaluated using the corpus-level BLEU metric (Papineni et al., 2002).
The logged data D is created by translating the in-domain training data of the corpora using  the original out-of-domain systems, and logging the one-best translation. For the stochastic experiments, the translations are sampled from the model distribution. The feedback to the logged translation is simulated using the reference and sentence-level BLEU (Nakov et al., 2012).
Direct Reward Estimation. When creating the logged data D, we also record the feature vectors of the translations to train the direct reward estimate that is needed for (ĉ)DC. Using the feature vector as input and the per-sentence BLEU as the output value, we train a regressionbased random forest with 10 trees using scikitlearn (Pedregosa et al., 2011). To measure performance, we perform 5-fold cross-validation and measure the macro average between estimated rewards and the true rewards from the log: We also report the micro average which quantifies how far off one can expect the model to be for a random sample: 1 n |δ(x t , y t ) −δ(x t , y t )|. The final model used in the experiments is trained on the full training data. Cross-validation results for the regression-based direct reward model can be found in Table 3.
Policy Evaluation. Policy evaluation aims to use the logged data D to estimate the performance of the target system π w . The small logged data D eval that is diverted for policy evaluation is created by translating only 10k sentences of the in-domain training data with the out-ofdomain system and sample translations according to the model probability. Again we record the sentence-level BLEU as the feedback. The reference translations that also exist for those 10k sentences are used to measure the ground truth BLEU value for translations using the fullinformation in-domain system. The goal of evaluation is to achieve a value of IPS+R, DR, and c DR on D eval that are as close as possible to the ground truth BLEU value.
To be able to measure variance, we create five folds of D eval , differing in random seeds. We report the average difference between the ground truth BLEU score and the value of the log-based policy evaluation, as well as the standard deviation in Table 4. We see that IPS+R underestimates the BLEU value by 7.78 on News. DR overestimates instead.ĉ DR achieves the closest estimate, overestimating the true value by less than 1 BLEU. On TED, all policy evaluation results are overestimates. For the DR variants the overestimation result can be explained by the random forests' tendency to overestimate. Optimalĉ DR can correct for this, but not always in a sufficient way.
Policy Learning. In our learning experiments, learning starts with the weights w 0 from the outof-domain model. As this was the system that produced the logged data D, the first iteration will have the same translations in the one-best position. After some iterations, however, the translation that was logged may not be in the first position any more. In this case, the n-best list is searched for the correct translation. Due to speed reasons, the scores of the translation system are normalized to probabilities using the first 1,000 unique entries in the n-best list, rather than using the full hypergraph. Our experiments showed that this did not impact the quality of learning.
In order for the multiplicative control variate to be effective, the learning procedure has to utilize mini-batches. If the mini-batch size is chosen too small, the estimates of the control variates may not be reliable. We test mini-batch sizes of 30k and 10k examples, whereas 30k on News means that we perform batch training since the mini-batch spans the entire training set. Minibatch size β and early stopping point where selected by choosing the setup and iteration that achieved the highest BLEU score on the one-best translations for the validation data. The learning rate η was selected in the same way, whereas the possible values were 1e−4, 1e−5, 1e−6 or, alternatively, Adadelta (Zeiler, 2012), which sets the learning rate on a per-feature basis. The results on both validation and test set are reported in Table 5. Statistical significance of the outof-domain system compared to all other systems is measured using Approximate Randomization testing (Noreen, 1989).
For the deterministic case, we see that in general DPM+R shows the lowest increase but can still significantly outperform the baseline. An explanation of why DPM+R cannot improve any further, will be addressed separately below. DC yields improvements of up to 1.5 BLEU points, whileĉ DC obtains improvements of up to 2 BLEU points over the out-of-domain baseline. In more detail on the TED data, DC can close the gap of nearly 3 BLEU by half between the out-of-domain and the full-information indomain system.ĉ DC can improve by further 0.6 BLEU which is a significant improvement at p = 0.0017. Also note that, whileĉ DC takes more iterations to reach its best result on the validation data,ĉ DC already outperforms DC at the stopping iteration of DC. At this pointĉ DC is better by 0.18 BLEU on the validation set and continues to increase until its own stopping iteration. The final results ofĉ DC falls only 0.8 BLEU behind the oracle system that had references available during its learning process. Considering the substantial difference in information that both systems had available, this is remark-  able. The improvements on the News corpus show similar tendencies. Again there is a gap of nearly 3 BLEU to close and with an improvement of 1.05 BLEU points, DC can achieve a notable result.ĉ DC was able to further improve on this but not as successfully as was the case for the TED corpus. Analyzing the actualĉ values that were calculated in both experiments allows us to gain an insight as to why this was the case: For TED,ĉ is on average 1.35. In the case of News, however,ĉ has a maximum value of 1.14 and thus stays quite close to 1, which would equate to using DC. It is thus not surprising that there is no significant difference between DC andĉ DC.
Comparison to the Stochastic Case. Even if not realistic for commercial applications of SMT, our simulation study allows us to stochastically log large amounts of data in order to compare learning from deterministic logs to the standard case. As shown in Table 5, the relations between algorithms and even the absolute improvements are similar for stochastic and deterministic logging. Significance tests between each deterministic/stochastic experiment pair show a significant difference only in case of DC/DR on TED data. However, the DR result still does not significantly outperform the best deterministic objective on TED (ĉ DC). The p values for all other experiment pairs lie above 0.1. From this we can conclude that it is indeed an acceptable practice to log deterministically. Langford et al. (2008) show that counterfactual learning is impossible unless the logging system sufficiently explores the output space. This condition is seemingly not satisfied if the logging systems acts according to a deterministic policy. Furthermore, since techniques such as "exploration over time" (Strehl et al., 2010) are not applicable to commercial SMT systems that are not frequently changed over time, the case of counterfactual learning for SMT seems hopeless. However, our experiments present evidence to the contrary. In the following, we present an analysis that aims to explain this apparent contradiction.

Analysis
Implicit Exploration. In an experimental comparison between stochastic and deterministic logging for bandit learning in computational advertising, Chapelle and Li (2011) observed that varying contexts (representing user and page visited) induces enough exploration into ad selection such that learning becomes possible. A similar implicit exploration can also be attributed to the case of SMT: An identical input word or phrase can lead, depending on the other words and phrases in the input sentence, to different output words and phrases. Moreover, an identical output word or phrase can appear in different output sentences. Across the entire log, this implicitly performs the exploration on phrase translations that seems to be missing at first glance.
Smoothing by Multiplicative Control Variates. The DPM estimator can show a degenerate behavior in that the objective can be minimized simply by setting the probability of every logged data point to 1.0. This over-represents logged data that received low rewards, which is undesired. Furthermore, systems optimized with this objective cannot properly discriminate between the translations in the output space. This can be seen as a case of translation invariance of the objective, as has been previously noted by Swaminathan and Joachims (2015b): Adding a small constant c to the probability of every data point in the log increases the overall value of the objective without improving the discriminative power between high-reward and low-reward translations. DPM+R solves the degeneracy of DPM by defining a probability distribution over the logged data by reweighting via the multiplicative control variate. After reweighting, the objective value will decrease if the probability of a low-reward translation increased, as it takes away probability mass from other, higher reward samples. Because of this trade-off, balancing the probabilities over low-reward and high-reward samples becomes important, as desired.

Smoothing by Additive Control Variates.
Despite reweighting, DPM+R can still show a degenerate behavior by setting the probabilities of only the highest-reward samples to 1.0, while avoiding all other logged data points. This clearly hampers the generalization ability of the model since inputs that have been avoided in training will not receive a proper ranking of their translations.
The use of an additive control variate can solve this problem by using a reward estimate that takes the full output space into account. The objective will now be increased if the probability of translations with high estimated reward is increased, even if they were not seen in training. This will shift probability mass to unseen data with high estimated-reward, and thus improve the generalization ability of the model.

Conclusion
In this paper, we showed that off-policy learning from deterministic bandit logs for SMT is possible if smoothing techniques based on control variates are used. These techniques will avoid degenerate behavior in learning and improve generalization of empirical risk minimization over logged data. Furthermore, we showed that standard off-policy evaluation is applicable to SMT under stochastic logging policies.
To our knowledge, this is the first application of counterfactual learning to a complex structured prediction problem like SMT. Since our objectives are agnostic of the choice of the underlying model π w , it is also possible to transfer our techniques to non-linear models such as neural machine translation. This will be a desideratum for future work.