Distilling Knowledge for Search-based Structured Prediction

Many natural language processing tasks can be modeled into structured prediction and solved as a search problem. In this paper, we distill an ensemble of multiple models trained with different initialization into a single model. In addition to learning to match the ensemble’s probability output on the reference states, we also use the ensemble to explore the search space and learn from the encountered states in the exploration. Experimental results on two typical search-based structured prediction tasks – transition-based dependency parsing and neural machine translation show that distillation can effectively improve the single model’s performance and the final model achieves improvements of 1.32 in LAS and 2.65 in BLEU score on these two tasks respectively over strong baselines and it outperforms the greedy structured prediction models in previous literatures.


Introduction
Search-based structured prediction models the generation of natural language structure (part-ofspeech tags, syntax tree, translations, semantic graphs, etc.) as a search problem (Collins and Roark, 2004;Liang et al., 2006;Zhang and Clark, 2008;Huang et al., 2012;Sutskever et al., 2014;Goodman et al., 2016). It has drawn a lot of research attention in recent years thanks to its competitive performance on both accuracy and running time. A stochastic policy that controls the whole search process is usually learned by imitating a reference policy. The imitation is usually addressed as training a classifier to predict the ref- erence policy's search action on the encountered states when performing the reference policy. Such imitation process can sometimes be problematic. One problem is the ambiguities of the reference policy, in which multiple actions lead to the optimal structure but usually, only one is chosen as training instance (Goldberg and Nivre, 2012). Another problem is the discrepancy between training and testing, in which during the test phase, the learned policy enters non-optimal states whose search action is never learned (Ross and Bagnell, 2010;Ross et al., 2011). All these problems harm the generalization ability of search-based structured prediction and lead to poor performance.
Previous works tackle these problems from two directions. To overcome the ambiguities in data, techniques like ensemble are often adopted (Di-Dependency parsing Neural machine translation st (σ, β, A), where σ is a stack, β is a buffer, and A is the partially generated tree ($, y1, y2, ..., yt), where $ is the start symbol.  Table 1: The search-based structured prediction view of transition-based dependency parsing (Nivre, 2008) and neural machine translation (Sutskever et al., 2014(Sutskever et al., ). etterich, 2000. To mitigate the discrepancy, exploration is encouraged during the training process (Ross and Bagnell, 2010;Ross et al., 2011;Goldberg and Nivre, 2012;Bengio et al., 2015;Goodman et al., 2016). In this paper, we propose to consider these two problems in an integrated knowledge distillation manner (Hinton et al., 2015). We distill a single model from the ensemble of several baselines trained with different initialization by matching the ensemble's output distribution on the reference states. We also let the ensemble randomly explore the search space and learn the single model to mimic ensemble's distribution on the encountered exploration states. Combing the distillation from reference and exploration further improves our single model's performance. The workflow of our method is shown in Figure 1.
We conduct experiments on two typical searchbased structured prediction tasks: transition-based dependency parsing and neural machine translation. The results of both these two experiments show the effectiveness of our knowledge distillation method by outperforming strong baselines. In the parsing experiments, an improvement of 1.32 in LAS is achieved and in the machine translation experiments, such improvement is 2.65 in BLEU. Our model also outperforms the greedy models in previous works.
Major contributions of this paper include: • We study the knowledge distillation in search-based structured prediction and propose to distill the knowledge of an ensemble into a single model by learning to match its distribution on both the reference states ( §3.2) and exploration states encountered when using the ensemble to explore the search space ( §3.3). A further combination of these two methods is also proposed to improve the performance ( §3.4).
• We conduct experiments on two search-based structured prediction problems: transitionbased dependency parsing and neural machine translation. In both these two problems, the distilled model significantly improves over strong baselines and outperforms other greedy structured prediction ( §4.2). Comprehensive analysis empirically shows the feasibility of our distillation method ( §4.3).

Search-based Structured Prediction
Structured prediction maps an input x = (x 1 , x 2 , ..., x n ) to its structural output y = (y 1 , y 2 , ..., y m ), where each component of y has some internal dependencies. Search-based structured prediction (Collins and Roark, 2004;Daumé III et al., 2005;Daumé III et al., 2009;Ross and Bagnell, 2010;Ross et al., 2011;Doppa et al., 2014;Vlachos and Clark, 2014;Chang et al., 2015) models the generation of the structure as a search problem and it can be formalized as a tuple (S, A, T (s, a), S 0 , S T ), in which S is a set of states, A is a set of actions, T is a function that maps S × A → S, S 0 is a set of initial states, and S T is a set of terminal states. Starting from an initial state s 0 ∈ S 0 , the structured prediction model repeatably chooses an action a t ∈ A by following a policy π(s) and applies a t to s t and enter a new state s t+1 as s t+1 ← T (s t , a t ), until a final state s T ∈ S T is achieved. Several natural language structured prediction problems can be modeled under the search-based framework including dependency parsing (Nivre, 2008) and neural machine translation (Liang et al., 2006;Sutskever et al., 2014). Table 1 shows the search-based structured prediction view of these two problems.
In the data-driven settings, π(s) controls the whole search process and is usually parameterized by a classifier p(a | s) which outputs the proba-Algorithm 1: Generic learning algorithm for search-based structured prediction.
t ← t + 1; 10 end 11 end 12 optimize L N LL ; bility of choosing an action a on the given state s. The commonly adopted greedy policy can be formalized as choosing the most probable action with π(s) = argmax a p(a | s) at test stage. To learn an optimal classifier, search-based structured prediction requires constructing a reference policy π R (s, y), which takes an input state s, gold structure y and outputs its reference action a, and training p(a | s) to imitate the reference policy. Algorithm 1 shows the common practices in training p(a | s), which involves: first, using π R (s, y) to generate a sequence of reference states and actions on the training data (line 1 to line 11 in Algorithm 1); second, using the states and actions on the reference sequences as examples to train p(a | s) with negative log-likelihood (NLL) loss (line 12 in Algorithm 1), where D is a set of training data. The reference policy is sometimes sub-optimal and ambiguous which means on one state, there can be more than one action that leads to the optimal prediction. In transition-based dependency parsing, Goldberg and Nivre (2012) showed that one dependency tree can be reached by several search sequences using Nivre (2008)'s arcstandard algorithm. In machine translation, the ambiguity problem also exists because one source language sentence usually has multiple semantically correct translations but only one reference translation is presented. Similar problems have also been observed in semantic parsing (Goodman et al., 2016). According to Frénay and Verleysen (2014), the widely used NLL loss is vulnerable to ambiguous data which make it worse for searchbased structured prediction.
Besides the ambiguity problem, training and testing discrepancy is another problem that lags the search-based structured prediction performance. Since the training process imitates the reference policy, all the states in the training data are optimal which means they are guaranteed to reach the optimal structure. But during the test phase, the model can predict non-optimal states whose search action is never learned. The greedy search which is prone to error propagation also worsens this problem.

Knowledge Distillation
A cumbersome model, which could be an ensemble of several models or a single model with larger number of parameters, usually provides better generalization ability. Knowledge distillation (Buciluǎ et al., 2006;Ba and Caruana, 2014;Hinton et al., 2015) is a class of methods for transferring the generalization ability of the cumbersome teacher model into a small student model. Instead of optimizing NLL loss, knowledge distillation uses the distribution q(y | x) outputted by the teacher model as "soft target" and optimizes the knowledge distillation loss, In search-based structured prediction scenario, x corresponds to the state s and y corresponds to the action a. Through optimizing the distillation loss, knowledge of the teacher model is learned by the student model p(y | x). When correct label is presented, NLL loss can be combined with the distillation loss via simple interpolation as 3 Knowledge Distillation for Search-based Structured Prediction

Ensemble
As Hinton et al. (2015) pointed out, although the real objective of a machine learning algorithm is to generalize well to new data, models are usually trained to optimize the performance on training data, which bias the model to the training data.
In search-based structured prediction, such biases can result from either the ambiguities in the training data or the discrepancy between training and testing. It would be more problematic to train p(a | s) using the loss which is in-robust to ambiguities and only considering the optimal states. The effect of ensemble on ambiguous data has been studied in Dietterich (2000). They empirically showed that ensemble can overcome the ambiguities in the training data. Daumé III et al. (2005) also use weighted ensemble of parameters from different iterations as their final structure prediction model. In this paper, we consider to use ensemble technique to improve the generalization ability of our search-based structured prediction model following these works. In practice, we train M search-based structured prediction models with different initialized weights and ensemble them by the average of their output distribution as q(a | s) = 1 M m q m (a | s). In Section 4.3.1, we empirically show that the ensemble has the ability to choose a good search action in the optimal-yetambiguous states and the non-optimal states.

Distillation from Reference
As we can see in Section 4, ensemble indeed improves the performance of baseline models. However, real world deployment is usually constrained by computation and memory resources. Ensemble requires running the structured prediction models for multiple times, and that makes it less applicable in real-world problem. To take the advantage of the ensemble model while avoid running the models multiple times, we use the knowledge distillation technique to distill a single model from the ensemble. We started from changing the NLL learning objective in Algorithm 1 into the distillation loss (Equation 1) as shown in Algorithm 2. Since such method learns the model on the states produced by the reference policy, we name it as distillation from reference. Blocks connected by in dashed red lines in Figure 1 show the workflow of our distillation from reference.

Distillation from Exploration
In the scenario of search-based structured prediction, transferring the teacher model's generalization ability into a student model not only includes matching the teacher model's soft targets on the reference search sequence, but also imitating the search decisions made by the teacher model. One way to accomplish the imitation can be sampling Algorithm 2: Knowledge distillation for search-based structured prediction.
Input: training data: {x (n) , y (n) } N n=1 ; the reference policy: π R (s, y); the exploration policy: π E (s) which samples an action from the annealed ensemble q(a | s) if distilling from reference then 7 a t ← π R (s t , y (n) ); search sequence from the ensemble and learn from the soft target on the sampled states. More concretely, we change π R (s, y) into a policy π E (s) which samples an action a from q(a | s) 1 T , where T is the temperature that controls the sharpness of the distribution (Hinton et al., 2015). The algorithm is shown in Algorithm 2. Since such distillation generate training instances from exploration, we name it as distillation from exploration. Blocks connected by in solid blue lines in Figure 1 show the workflow of our distillation from exploration.
On the sampled states, reference decision from π R is usually non-trivial to achieve, which makes learning from NLL loss infeasible. In Section 4, we empirically show that fully distilling from the soft target, i.e. setting α = 1 in Equation 1, achieves comparable performance with that both from distillation and NLL.

Distillation from Both
Distillation from reference can encourage the model to predict the action made by the reference policy and distillation from exploration learns the model on arbitrary states. They transfer the generalization ability of the ensemble from different aspects. Hopefully combining them can further improve the performance. In this paper, we combine distillation from reference and exploration with the following manner: we use π R and π E to generate a set of training states. Then, we learn p(a | s) on the generated states. If one state was generated by the reference policy, we minimize the interpretation of distillation and NLL loss. Otherwise, we minimize the distillation loss only.

Experiments
We perform experiments on two tasks: transitionbased dependency parsing and neural machine translation. Both these two tasks are converted to search-based structured prediction as Section 2.1.
For the transition-based parsing, we use the stack-lstm parsing model proposed by Dyer et al. (2015) to parameterize the classifier. 1 For the neural machine translation, we parameterize the classifier as an LSTM encoder-decoder model by following Luong et al. (2015). 2 We encourage the reader of this paper to refer corresponding papers for more details.

Transition-based Dependency Parsing
We perform experiments on Penn Treebank (PTB) dataset with standard data split (Section 2-21 for training, Section 22 for development, and Section 23 for testing). Stanford dependencies are converted from the original constituent trees using Stanford CoreNLP 3.3.0 3 by following Dyer et al. (2015). Automatic part-of-speech tags are assigned by 10-way jackknifing whose accuracy is 97.5%. Labeled attachment score (LAS) excluding punctuation are used in evaluation. For the other hyper-parameters, we use the same settings as Dyer et al. (2015). The best iteration and α is determined on the development set. BLEU score on dev. set Figure 2: The effect of using different Ks when approximating distillation loss with K-most probable actions in the machine translation experiments.
Reimers and Gurevych (2017) and others have pointed out that neural network training is nondeterministic and depends on the seed for the random number generator. To control for this effect, they suggest to report the average of M differentlyseeded runs. In all our dependency parsing, we set n = 20.

Neural Machine Translation
We conduct our experiments on a small machine translation dataset, which is the Germanto-English portion of the IWSLT 2014 machine translation evaluation campaign. The dataset contains around 153K training sentence pairs, 7K development sentence pairs, and 7K testing sentence pairs. We use the same preprocessing as Ranzato et al. (2015), which leads to a German vocabulary of about 30K entries and an English vocabulary of 25K entries. One-layer LSTM for both encoder and decoder with 256 hidden units are used by following Wiseman and Rush (2016). BLEU (Papineni et al., 2002) was used to evaluate the translator's performance. 4 Like in the dependency parsing experiments, we run M = 10 differentlyseeded runs and report the averaged score.
Optimizing the distillation loss in Equation 1 requires enumerating over the action space. It is expensive for machine translation since the size of the action space (vocabulary) is considerably large (25K in our experiments). In this paper, we use the K-most probable actions (translations on target side) on one state to approximate the whole probability distribution of q(a | s) as a q(a | s) · log p(a | s) ≈ K k q(â k | s) · log p(â k | s), whereâ k is the k-th probable action. We fix α to  Dozat and Manning (2016) 94.08 Kuncoro et al. (2016) 92.06 Kuncoro et al. (2017) 94.60  (Nilsson and Nivre, 2008) shows the improvement of our Distill (both) over Baseline is statistically significant with p < 0.01.
1 and vary K and evaluate the distillation model's performance. These results are shown in Figure  2 where there is no significant difference between different Ks and in speed consideration, we set K to 1 in the following experiments. We tune the temperature T during exploration and the results are shown in Figure 3. Sharpen the distribution during the sampling process generally performs better on development set. Our distillation from exploration model gets almost the same performance as that from reference, but simply combing these two sets of data outperform both models by achieving an LAS of 92.14.

Transition-based Dependency Parsing
We also compare our parser with the other parsers in Table 2. The second group shows the greedy transition-based parsers in previous literatures. Andor et al. (2016) presented an alternative state representation and explored both greedy and beam search decoding.  explores training the greedy parser with dynamic oracle. Our distillation parser outperforms all these greedy counterparts. The third group shows   parsers trained on different techniques including decoding with beam search (Buckman et al., 2016;Andor et al., 2016), training transitionbased parser with beam search (Andor et al., 2016), graph-based parsing (Dozat and Manning, 2016), distilling a graph-based parser from the output of 20 parsers (Kuncoro et al., 2016), and converting constituent parsing results to dependencies (Kuncoro et al., 2017). Our distillation parser still outperforms its transition-based counterparts but lags the others. We attribute the gap between our parser with the other parsers to the difference in parsing algorithms. Table 3 shows the experimental results on IWSLT 2014 dataset. Similar to the PTB parsing results, the ensemble 10 translators outperforms the baseline translator by 3.47 in BLEU score. Distilling from the ensemble by following the reference leads to a single translator of 24.76 BLEU score. Like in the parsing experiments, sharpen the distribution when exploring the search space is more helpful to the model's performance but the differences when T ≤ 0.2 is not significant as shown in Figure 3. We set T = 0.1 in our distillation from exploration experiments since it achieves the best development score. Table 3 shows the exploration result of a BLEU score of 24.64 and it slightly lags the best reference model. Distilling from both the reference and exploration improves the single model's performance by a large margin and achieves a BLEU score of 25.44.

Neural Machine Translation
We also compare our model with other translation models including the one trained with reinforcement learning (Ranzato et al., 2015) and that using beam search in training (Wiseman and Rush, 2016). Our distillation translator outperforms these models.
Both the parsing and machine translation experiments confirm that it's feasible to distill a reasonable search-based structured prediction model by just exploring the search space. Combining the reference and exploration further improves the model's performance and outperforms its greedy structured prediction counterparts.

Analysis
In Section 4.2, improvements from distilling the ensemble have been witnessed in both the transition-based dependency parsing and neural machine translation experiments. However, questions like "Why the ensemble works better? Is it feasible to fully learn from the distillation loss without NLL? Is learning from distillation loss stable?
" are yet to be answered. In this section, we first study the ensemble's behavior on "problematic" states to show its generalization ability. Then, we empirically study the feasibility of fully learning from the distillation loss by studying the effect of α in the distillation from reference setting. Finally, we show that learning from distillation loss is less sensitive to initialization and achieves a more stable model.  Table 4: The ranking performance of parsers' output distributions evaluated in MAP on "problematic" states.

Ensemble on "Problematic" States
As mentioned in previous sections, "problematic" states which is either ambiguous or non-optimal harm structured prediciton's performance. Ensemble shows to improve the performance in Section 4.2, which indicates it does better on these states. To empirically testify this, we use dependency parsing as a testbed and study the ensemble's output distribution using the dynamic oracle.
The dynamic oracle (Goldberg and Nivre, 2012;Goldberg et al., 2014) can be used to efficiently determine, given any state s, which transition action leads to the best achievable parse from s; if some errors may have already made, what is the best the parser can do, going forward? This allows us to analyze the accuracy of each parser's individual decisions, in the "problematic" states. In this paper, we evaluate the output distributions of the baseline and ensemble parser against the reference actions suggested by the dynamic oracle. Since dynamic oracle yields more than one reference actions due to ambiguities and previous mistakes and the output distribution can be treated as their scoring, we evaluate them as a ranking problem. Intuitively, when multiple reference actions exist, a good parser should push probability mass to these actions. We draw problematic states by sampling from our baseline parser. The comparison in Table 4 shows that the ensemble model significantly outperforms the baseline on ambiguous and non-optimal states. This observation indicates the ensemble's output distribution is more "informative", thus generalizes well on problematic states and achieves better performance. We also observe that the distillation model perform better than both the baseline and ensemble. We attribute this to the fact that the distillation model is learned from exploration.

Effect of α
Over our distillation from reference model, we study the effect of α in Equation 1. We vary α from 0 to 1 by a step of 0.1 in both the transitionbased dependency parsing and neural machine translation experiments and plot the model's performance on development sets in Figure 4. Similar trends are witnessed in both these two experiments that model that's configured with larger α generally performs better than that with smaller α. For the dependency parsing problem, the best development performance is achieved when we set α = 1, and for the machine translation, the best α is 0.8. There is only 0.2 point of difference between the best α model and the one with α equals to 1. Such observation indicates that when distilling from the reference policy paying more attention to the distillation loss rather than the NLL is more beneficial. It also indicates that fully learning from the distillation loss outputted by the ensemble is reasonable because models configured with α = 1 generally achieves good performance.

Learning Stability
Besides the improved performance, knowledge distillation also leads to more stable learning. The performance score distributions of differentlyseed runs are depicted as violin plot in Figure 5. Table 5 also reveals the smaller standard derivations are achieved by our distillation methods. As Keskar et al. (2016) pointed out that the general-   ization gap is not due to overfit, but due to the network converge to sharp minimizer which generalizes worse, we attribute the more stable training from our distillation model as the distillation loss presents less sharp minimizers.

Related Work
Several works have been proposed to applying knowledge distillation to NLP problems. Kim and Rush (2016) presented a distillation model which focus on distilling the structured loss from a large model into a small one which works on sequencelevel. In contrast to their work, we pay more attention to action-level distillation and propose to do better action-level distillation by both from reference and exploration. Freitag et al. (2017) used an ensemble of 6translators to generate training reference. Exploration was tried in their work with beam-search. We differ their work by training the single model to match the distribution of the ensemble.
Using ensemble in exploration was also studied in reinforcement learning community (Osband et al., 2016). In addition to distilling the ensemble on the labeled training data, a line of semisupervised learning works show that it's effective to transfer knowledge of cumbersome model into a simple one on the unlabeled data (Liang et al., 2008;Li et al., 2014). Their extensions to knowledge distillation call for further study. Kuncoro et al. (2016) proposed to compile the knowledge from an ensemble of 20 transitionbased parsers into a voting and distill the knowledge by introducing the voting results as a regularizer in learning a graph-based parser. Different from their work, we directly do the distillation on the classifier of the transition-based parser.
Besides the attempts for directly using the knowledge distillation technique, Stahlberg and Byrne (2017) propose to first build the ensemble of several machine translators into one network by unfolding and then use SVD to shrink its parameters, which can be treated as another kind of knowledge distillation.

Conclusion
In this paper, we study knowledge distillation for search-based structured prediction and propose to distill an ensemble into a single model both from reference and exploration states. Experiments on transition-based dependency parsing and machine translation show that our distillation method significantly improves the single model's performance. Comparison analysis gives empirically guarantee for our distillation method.