Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic NLP

Syntax has been shown useful for various NLP tasks, while existing work mostly encodes singleton syntactic tree using one hierarchical neural network. In this paper, we investigate a simple and effective method, Knowledge Distillation, to integrate heterogeneous structure knowledge into a unified sequential LSTM encoder. Experimental results on four typical syntax-dependent tasks show that our method outperforms tree encoders by effectively integrating rich heterogeneous structure syntax, meanwhile reducing error propagation, and also outperforms ensemble methods, in terms of both the efficiency and accuracy.

Constituent and dependency representation for syntactic structure share underlying linguistic and computational characteristics, while differ also in various aspects. For example, the former focuses * Corresponding author. (1) refers to the constituency tree structure, (2) indicates the semantic role labels, (3) refers to the example sentence, (4) represents the dependency tree structure. on revealing the continuity of phrases, while the latter is more effective in representing the dependencies among elements. By integrating the two representations from heterogeneous trees, the mutual benefit has been explored for joint parsing tasks (Collins, 1997;Charniak and Johnson, 2005;Farkas et al., 2011;Yoshikawa et al., 2017;Zhou and Zhao, 2019). Intuitively, complementary advantages from heterogeneous trees can facilitate a range of NLP tasks, especially syntax-dependent ones such as SRL and NLI. Taking the sentence of Figure 1 as example, where an example is shown from SRL 1 task. In this case, the dependency links can locate the relations between arguments and predicates more efficiently, while the constituency structure can aggregate the phrasal spans for arguments, and guide the global path to the predicate. Integrating the features of two structures can better guide the model to focus on the most suitable phrasal granularity (as circled by the dotted box), and also ensure the route consistency between the semantic objective pairs.
In this paper, we investigate the Knowledge Distillation (KD) method, which has been shown to be effective for knowledge ensembling (Hinton et al., 2015;Kim and Rush, 2016;Furlanello et al., 2018), for heterogeneous structure integration. Specifically, we employ a sequential LSTM as the student for distilling heterogeneous syntactic structures from various teacher tree encoders, such as GCN (Kipf and Welling, 2017) and TreeLSTM (Tai et al., 2015a). We consider output distillation, syntactic feature injection and semantic learning. In addition, we introduce an alternative structure injection strategy to enhance the ability of heterogeneous syntactic representations within the shared sequential model. The distilled structure-aware student model can make inference using sequential word inputs alone, reducing the error accumulation from external parsing tree annotations.
We conduct extensive experiments on a wide range of syntax-dependent tasks, including semantic role labeling, relation classification, natural language inference and sentiment classification. Results show that the distilled student outperforms tree encoders, verifying the advantage of integrating heterogeneous structures. The proposed method also outperforms existing ensemble methods and strong baseline systems, demonstrating its high effectiveness on structure information integration.

Syntactic Structures for Text Modeling
Previous work shows that integrating syntactic structure knowledge can improve the performance of NLP tasks (Socher et al., 2013;Cho et al., 2014;Nguyen and Shirai, 2015;Looks et al., 2017;Fei et al., 2020b). Generally, these methods consider injecting either standalone constituency tree or dependency tree by tree encoders such as TreeL-STM (Socher et al., 2013;Tai et al., 2015a) or GCN (Kipf and Welling, 2017). Based on the assumption that the dependency and constituency representation can be disentangled and coexist in one shared model, existing efforts are paid for joint constituent and dependency parsing, verifying the mutual benefit of these heterogeneous structures (Collins, 1997;Charniak, 2000;Charniak and Johnson, 2005;Farkas et al., 2011;Ren et al., 2013;Yoshikawa et al., 2017;Strzyz et al., 2019;Kato and Matsubara, 2019;Zhou and Zhao, 2019). However, little attention is paid for facilitating the syntax-dependent tasks via integrating  heterogeneous syntactic trees. Although the integration from heterogeneous trees can be achieved via widely employed approaches, such as ensemble learning (Wolpert, 1992;Ju et al., 2019) and multi-task training (Liu et al., 2016;Chen et al., 2018;Fei et al., 2020a), they usually suffer from low-efficiency and high computational complexity.

Knowledge Distillation
Our work is related to knowledge distillation techniques. It has been shown that KD is very effective and scalable for knowledge ensembling (Hinton et al., 2015;Furlanello et al., 2018), and existing methods are divided into two categories: 1) output distillation, which makes a teacher model output logits as a student model training objective (Kim and Rush, 2016;Vyas and Carpuat, 2019;, 2) feature distillation, which allows a student to learn from a teacher's intermediate feature representations (Zagoruyko and Komodakis, 2017;Sun et al., 2019). In this paper, we enhance the distillation of heterogeneous structures via both output and feature distillations by employing a sequential LSTM as the student. Our work is also closely related to Kuncoro et al., (2019), who distill syntactic structure knowledge to a student LSTM model. The difference lies in that they focus on transferring tree knowledge from syntax-aware language model for achieving scalable unsupervised syntax induction, while we aim at integrating heterogeneous syntax for improving downstream tasks.

Method
As shown in Figure 2, the overall architecture consists of a sequential LSTM (Hochreiter and Schmidhuber, 1997) student, and several tree teachers for dependency and constituency structures.

Tree Encoder Teachers
Different tree models can encode the same tree structure, resulting in different heterogeneous tree representations. Following previous work (Tai et al., 2015b;Marcheggiani and Titov, 2017;, we consider encoding dependency trees by Child-Sum TreeLSTM and constituency trees by N-ary TreeLSTM. We also employ GCN to encode dependency and constituency structures separately. We employ a bidirectional tree encoder to fully capture the structural information interaction. Formally, we denote X = {x 1 , · · · , x n } as an input sentence, dep n } as the dependency tree and X con = {x con 1 , · · · , x con n } as the constituency tree.
Encoding dependency structure. We first use the standard Child-Sum TreeLSTM to encode the dependency structure, where each node j in the tree takes as input the embedding vector x j corresponding to the head word. The conventional bottom-up fashion is: where W , U and b are parameters. C(j) refers to the set of child nodes of j. h j , i j , o j and c j are the hidden state, input gate, output gate and memory cell of the node j, respectively. f jk is a forget gate for each child k of j. σ(·) is an activation function and is element-wise multiplication. Similarly, the top-down TreeLSTM has the same transition equations as the bottom-up TreeLSTM, except that the direction and the number of dependent nodes are different. We concatenate the tree representations of two directions for each node: . Compared with TreeLSTM, GCN is more computationally efficient in performing the tree propagation for each node in parallel with O(1) complexity. Considering the constructed dependency graph G = (V, E), where V are sets of nodes, and E are sets of bidirectional edges between heads and dependents, respectively. GCN can be viewed as a hierarchical node encoder, representing the node j at the l-th layer encoded as follows: where N (i) are neighbors of the node j. ReLU is a non-linear activation function. For dependency encoding by TreeLSTM or GCN, we make use of all the node representations, R dep = [r 1 , · · · , r n ], within the whole tree structure for next distillation.
Encoding constituency structure. We employ N-ary TreeLSTM to encode constituent tree: where q is the index of the branch of j. Slightly different from Child-Sum TreeLSTM, the separate parameter matrices for each child k allow the model to learn more fine-grained and order-sensitive children information. We also concatenate two directions from both bottom-up and top-down of each node as the final representation.
Similarly, GCN is also used to encode the constituent graph G = (V, E) via Eq. (2) and (3). Note that there are both words and constituent labels in the node set V. For constituency encoding by both TreeLSTM and GCN, we take the representations of terminal nodes in the structure as the corresponding word representations R con = [r 1 , · · · , r n ].

Heterogeneous Structure Distillation
Sequential models have been proven effective on encoding syntactic tree information (Shen et al., 2018;Kuncoro et al., 2019). We set the goal of KD as simultaneously distilling heterogeneous structures from tree encoder teachers into a LSTM student model.
Output distillation. The output logits serve as soft target providing richer supervision than the hard target of one-hot gold label for the training (Hinton et al., 2015). Given an input sentence X with the gold label Y (one-hot), the output logits of teachers are P t Γ(all) , and the output logits of the student is P s . The output distilling can be denoted as: where H(, ) refers to the cross-entropy. α is a coupling factor, which increases from 0 to 1 in training, namely teacher annealing .
Syntactic tree feature distillation. In order to capture rich syntactic tree features, we consider allowing the student to directly learn from the teachers' feature hidden representation. Specifically, we denote the hidden representation of the student LSTM as R s = [r 1 , · · · , r n ], and we expect R s to be able to predict the output of R dep or R con from syntax-aware teachers. Thus the target is to optimize the following regression loss: where η ∈ [0, 1] is a factor for coordinating the dependency and constituency structure encoding, f t Γ(dep) (), f t Γ(con) (), f s () are the feedforward layers, respectively, for calculating the corresponding score vectors, and j is the word index.
Semantic learning. We randomly mask a target input word Q j and let LSTM predict the word based on its hidden representation of prior words.
In consequence, we pose the following language modeling objective: by which LSTM can additionally improve the ability of semantic learning.

Enhanced Structure Injection
We consider further enhancing the trees injection, by encouraging the student to mimic the dependency and constituency tree induction of teachers.
Dependency injection. We force the student to predict the distributions of dependency arcs and labels based on the hidden representations and the representations of teachers.
is the arc probability of the parent node r j for x i in the dependency teacher, and P t Γ(dep) (l k |r j , x i ) is the probability of the label l k for the arc (r j , x i ) in the teacher.
Constituency injection. Similarly, to enhance constituency injection, we mimic the distribution of each span (i, j) with the label l in teachers. Following Zhou et al. (2019), we adopt a feedforward layer as the span scorer: We use the CYK algorithm (Cocke, 1970;Younger, 1975;Kasami, 1965) to search the highest score tree T * in teachers, and all possible trees T in the student. Then we optimize the following hinge loss between the structures in the student and teachers: where ∆ is the hamming distance. The above syntax loss in Eq. (10) and (12) can substitute the ones in Eq. (6) and (7), respectively. The overall objective of the structure injection is: Regularization. Based on the independent assumption, the syntax feature distillations target learning diversified private representations for heterogeneous structures as much as possible. In practice, there should be a latent shared structure in the parameter space, while the separate distillations will squeeze such shared feature, weakening the expression of the learnt representations. To avoid this, we additionally impose a regularization on Eq. (6), (7), (10) and (12): where Θ is the overall parameter in the student.

Training
Algorithm 1 gives the overall structure distillation process. At early training stage (line 2-19), semantic learning (Eq. 9) and output distillation (Eq. 5) are first executed by each teacher. As we have multiple teachers for one student on each task, for syntactic tree structure distillation, we sequentially distill one teacher at one time. We take turn with a turning gap G 2 processing the dependency or constituency injection from a tree teacher (line 13-17), to keep the training stable. After a certain number of training iterations G 1 , we optimize the overall loss (line 20): where λ 1 and λ 2 are coefficients, which regulate the corresponding objectives. L syn can be either syn (Eq. 13), that is, the syntax sources are simultaneously from two tree encoders (dependency and constituency) at one time. During inference, the well-trained student can make prediction alone with only word sequential input.

Experimental Setups
Hyperparameters. We use a 3-layer BiLSTM as our student, and a 2-layer architecture for all tree teachers. The default word embeddings are initialized randomly, and its dimension is set as 300. The hidden size is set to 350 in the student LSTM, and 300 in the teacher models, respectively. We adopt the Adam optimizer with an initial learning rate of 1e-5. We use the mini-batch of 32 within total 10k (T ) iterations with early stopping, and apply 0.4 dropout ratio for all embeddings. We set Algorithm 1: Distill heterogeneous trees.
Tasks and evaluation. The experiments are conducted on four representative syntax-dependent tasks: 1) Rel, relation classification on Semeval10 (Hendrickx et al., 2010); 2) NLI, sentence pair classification on the Stanford NLI (Bowman et al., 2015); 3) SST, binary sentiment classification task on the Stanford Sentiment Treebank (Socher et al., 2013), 4) SRL, semantic role labeling on the CoNLL2012 OntoNotes (Pradhan et al., 2013). For NLI, we make element-wise production, subtraction, addition and concatenation of two separate sentence representations as a whole. We mainly adopt F1 score to evaluate the performance of different models. The data splitting follows previous work.
Trees annotations and resources. The OntoNotes data offers the annotations of the dependency and constituency structure. For the rest datasets, we parse sentences via the state-of-the-art BiAffine dependency parser (Dozat and Manning, 2017), and the Self-Attentive constituency parser (Kitaev and Klein, 2018). The parsers are trained on PTB 2 . The dependency parser has a 93.4% LAS, and the constituency parser has 92.6% F1 score. Besides, we evaluate different contextualized word representations, such as ELMo 3 and BERT 4 .

Main Results
Experimental results of different models are shown in Table 1 varying syntactic tree structures can make different contributions to the tasks. For example, GCN with dependency structure gives the best result for Rel, while TreeLSTM with constituency tree achieves the best performance for SST. Third, when integrating heterogeneous tree structures by tree ensemble methods, a competitive performance can be obtained, showing the importance of integrating heterogeneous tree information. Finally, our distilled student model significantly outperforms all the baseline systems 5 , demonstrating its high effectiveness on the integration of heterogeneous structure information.
Ablation results. We ablate each part of our distilling method in Table 2. First, we find that the enhanced structure injection strategy (L (B) syn ) can help to achieve the best results for all the tasks, compared with the latent syntax feature mimic (L (A) syn ). By ablating each distilling objective, we learn that the syntax tree distillation (L syn ) is the kernel of our knowledge distillation for these Rel NLI SST SRL   syntax-dependent tasks, compared with semantic feature learning (L sem ). Besides, both the introduced teacher annealing factor α and regularization L reg can benefit the task performance. Finally, we explore recent contextualized word representations, including ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Surprisingly, our distilled student model receives a substantial performance improvements in all tasks. However, when removing the proposed syntax distillation from BERT, the performance drops, as shown in Table 1 (the vanilla BERT).

Heterogeneous Tree Structure
Upper-bound of heterogeneous structures.
We explore to what extent the distilled student can manage to capture heterogeneous tree structure information. Following previous work (Conneau et al., 2018), we consider employing two syntactic probing tasks, including 1) Constituent labeling, which assigns a non-terminal label for text spans within the phrase-structure (e.g., Verb, Noun, etc.), and 2) Dependency labeling, which predicts the relationship (edge) between two tokens (e.g., subject-object etc.). We take the last-layer output representation as the probing objective. We compare the student model with four teacher tree encoders, separately, based on the SRL task. As shown in Table 3, the student LSTM gives slightly lower score than one of the best tree models (i.e., GCN+dep. for dependency labeling, TreeLSTM+con. for constituency labeling), showing the effectiveness on capturing syntax. Besides, we can find that the regularization L reg plays a key role in improving the expression capability of the learnt representation.
Distributions of heterogeneous syntax in different tasks. We also compare the distributions of dependency and constituency structures in different tasks after fine-tuning. Technically, based on each example in the test set, the performance drops when the student LSTM is trained only under either standalone dependency or constituency injection (TreeLSTM or GCN), respectively, by controlling η=0 or 1. Intuitively, the more the results drop, the more the model benefits from the corresponding syntax. For each task, we collect the sensitivity values and linearly normalize them into [0,1] for all examples, and make statistics. As plotted in Figure  3, the distributions of dependency and constituency syntax vary among tasks, verifying that different tasks depend on distinct types of structural knowledge, while integrating them altogether can give the best effects. For example, TreeDepth, dependency structures support Rel and SRL, while NLI and SST benefit from constituency the most.

Robustness Analysis
Generalization ability to training data. Figure  4 shows the performance of different models on varying ratio of the full training dataset. We can find that the performance decreases with the reduction of the training data for all methods, while our distilled student achieves better results, compared with most of the baselines. The underlying reasons are two-fold. First, the heterogeneous syntactic features can provide strong representations for supporting better predictions. Second, the distilled student takes only sequential inputs, avoiding the noise from parsed inputs to some extent.
Also we see that TreeLSTM/GCN+dep. can counteract the data reduction (≤40%) on Rel and SRL tasks, showning that they rely more on dependency structures, while NLI and SST depend on constituency structures. In addition, the student starts underperforming than the best one on the small data (≤40%). Without explicit tree annotations, the contribution of heterogeneous syntax can be deteriorated. But it still remains robust on shortage of training data than most of the baselines, due to its noise resistant.
Reducing error accumulation of tree annotation. We investigate the effects on reducing noises from tree annotation. We compare the performance under different sources. Table 4 shows the results on SRL. With only word inputs, our model still outperforms the baselines which take the gold syntax annotation. This partially shows that without parsed tree annotation, the student model can avoid noise and error propagation. When we add gold annotation as additional signal, the performance can be further improved.
Efficiency study. As shown in Figure 5, the student model has fewer parameters, while keeping faster decoding speed, compared with other ensemble models. Our sequential model is about 3 times smaller than AdvT, but nearly 4 times faster than the tree ensemble methods. Such observation coincides with previous studies (Kim and Rush, 2016;Sun et al., 2019;.

Visualization on Heterogeneous Structure
The enhanced structure injection objectives (Eq. (10) and (12)) enables the student LSTM to unsupervisedly induce tree structures at the test stage.
To understand how the distilled model promote the mutual learning of heterogeneous structures, we empirically visualize the induced trees based on a test example of SRL. As shown in Figure 6, the discovered dependency structures accurately match the gold tree, and the constituents are highly correlated with the gold one. Besides, the edges that indicate the two elements are augmented by the learning of each other, which in return enhance the recognition of the spans of elements (yellow dotted boxes), respectively. For example, the constituent and dependent paths (green lines) linking two minimal target spans, the Focus Today program and by Wang Shilin, are enhanced and echoed with each other, via the core predicate. This reveals that our method can offer a deeper latent interaction between heterogeneous tree structures.  Figure 6: A SRL case where hosted is predicate, the Focus Today program is A0, by Wang Shilin is A1. Bold green lines indicates the edges with higher scores.

Conclusion
We investigated knowledge distillation on heterogeneous tree structures integration for facilitating NLP tasks, distilling syntactic knowledge into a sequential input encoder, in both output and feature level distillations. Results on four representative syntax-dependent tasks showed that the distilled student outperformed all standalone tree models, as well as the commonly used ensemble methods, indicating the effectiveness of the proposed method. Further analysis demonstrated that our method enjoys high robustness and efficiency.