Retrofitting Structure-aware Transformer Language Model for End Tasks

We consider retrofitting structure-aware Transformer-based language model for facilitating end tasks by proposing to exploit syntactic distance to encode both the phrasal constituency and dependency connection into the language model. A middle-layer structural learning strategy is leveraged for structure integration, accomplished with main semantic task training under multi-task learning scheme. Experimental results show that the retrofitted structure-aware Transformer language model achieves improved perplexity, meanwhile inducing accurate syntactic phrases. By performing structure-aware fine-tuning, our model achieves significant improvements for both semantic- and syntactic-dependent tasks.


Introduction
Natural language models (LM) can generate fluent text and encode factual knowledge (Mikolov et al., 2013;Pennington et al., 2014;Merity et al., 2017). Recently, pre-trained contextualized language models have given remarkable improvements on various NLP tasks (Peters et al., 2018;Radford et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019;. Among such methods, the Transformer-based (Vaswani et al., 2017) BERT has become a most popular encoder for obtaining state-of-the-art NLP task performance. It has been shown (Conneau et al., 2018;Tenney et al., 2019) that besides rich semantic information, implicit language structure knowledge can be captured by a deep BERT (Vig and Belinkov, 2019;Jawahar et al., 2019;Goldberg, 2019). However, such structure features learnt via the vanilla Transformer LM are insufficient for those NLP tasks that heavily rely on syntactic or linguistic knowledge (Hao et al., 2019). Some effort devote to improved the ability of structure * Corresponding author. learning in Transformer LM by installing novel syntax-attention mechanisms (Ahmed et al., 2019;. Nevertheless, several limitations can be observed. First, according to the recent findings by probing tasks (Conneau et al., 2018;Tenney et al., 2019;Goldberg, 2019), the syntactic structure representations are best retained right at the middle layers (Vig and Belinkov, 2019;Jawahar et al., 2019). Nevertheless, existing tree Transformers employ traditional full-scale training over the whole deep Transformer architecture (as shown in Figure 1(a)), consequently weakening the upper-layer semantic learning that can be crucial for end tasks. Second, these tree Transformer methods encode either standalone constituency or dependency structure, while different tasks can depend on varying types of structural knowledge. The constituent and dependency representation for syntactic structure share underlying linguistic characteristics, while the former focuses on disclosing phrasal continuity and the latter aims at indicating dependency relations among elements. For example, semantic parsing tasks are more dependent on the dependency features (Rabinovich et al., 2017;, while constituency information is much needed for sentiment classification (Socher et al., 2013).
In this paper, we aim to retrofit structure-aware Transformer LM for facilitating end tasks. • On the one hand, we propose a structure learning module for Transformer LM, meanwhile exploiting syntactic distance as the measurement for encoding both the phrasal constituency and the dependency connection. • On the other hand, as illustrated in Figure 1, to better coordinate the structural learning and semantic learning, we employ a middle-layer structural training strategy to integrate syntactic structures to the main language modeling task under multi-task scheme, which encourages the induction of structural information to take place at most suitable layer. • Last but not least, we consider performing structure-aware fine-tuning with end-task training, allowing learned syntactic knowledge in accordance most with the end task needs.
We conduct experiments on language modeling and a wide range of NLP tasks. Results show that the structure-aware Transformer retrofitted via our proposed middle-layer training strategy achieves better language perplexity, meanwhile inducing high-quality syntactic phrases. Besides, the LM after structure-aware fine-tuning can give significantly improved performance for various end tasks, including semantic-dependent and syntacticdependent tasks. We also find that supervised structured pre-training brings more benefits to syntactic-dependent tasks, while the unsupervised LM pre-training brings more benefits to semanticdependent tasks. Further experimental results on unsupervised structure induction demonstrate that different NLP tasks rely on varying types of structure knowledge as well as distinct granularity of phrases, and our retrofitting method can help to induce structure phrases that are most adapted to the needs of end tasks.

Related Work
Contextual language modeling. Contextual language models pre-trained on a large-scale corpus have witnessed significant advances (Peters et al., 2018;Radford et al., 2018;Howard and Ruder, 2018;Devlin et al., 2019;. In contrast to the traditional static and context-independent word embedding, contextual language models can strengthen word representations by dynamically encoding the contextual sentences for each word during pre-training. By further fine-tuning with end tasks, the contextualized word representation from language models can help to give the most task-related context-sensitive fea-tures (Peters et al., 2018). In this work, we follow the line of Transformer-based (Vaswani et al., 2017) LM (e.g., BERT), considering its prominence.
There has been much attention paid to unsupervised grammar induction task (Williams et al., 2017;Shen et al., 2018a,b;Kuncoro et al., 2018;Kim et al., 2019a;Drozdov et al., 2019;Kim et al., 2019b). For example, PRPN (Shen et al., 2018a) computes the syntactic distance of word pairs. On-LSTM (Shen et al., 2018b) allows hidden neurons to learn long-term or shortterm information by a gate mechanism. URNNG (Kim et al., 2019b) applies amortized variational inference, encouraging the decoder to generate reasonable tree structures. DIORA (Drozdov et al., 2019) uses inside-outside dynamic programming to compose latent representations from all possible binary trees. PCFG (Kim et al., 2019a) achieves grammar induction by probabilistic context-free grammar. Unlike these recurrent network based structure-aware LM, our work focuses on structure learning for a deep Transformer LM.
Structure-aware Transformer language model. Some efforts have been paid for the Transformerbased pre-trained language models (e.g. BERT) by visualizing the attention (Vig and Belinkov, 2019;Kovaleva et al., 2019;Hao et al., 2019) or probing tasks (Jawahar et al., 2019;Goldberg, 2019). They find that the latent language structure knowledge is best retained at the middle-layer in BERT (Vig and Belinkov, 2019;Jawahar et al., 2019;Goldberg, 2019

Model
The proposed structure-aware Transformer language model mainly consists of two components: the Transformer encoders and structure learning module, which are illustrated in Figure 2.

Transformer Encoder
The language model is built based on N -layer Transformer blocks. One Transformer layer applies multi-head self-attention in combination with a feedforward network, layer normalization and residual connections. Specifically, the attention weights are computed in parallel via: where Q (query), K (key) and V (value) in multihead setting process the input x = {x 1 , · · · , x n } t times.
Given an input sentence x, the output contextual representation of the l-th layer Transformer block can be formulated as: where η is the layer normalization operation and Φ is a feedforward network. In this work, the output contextual representation h l = {h l 1 , · · · , h l n } of the middle layers can be used to learn the structure y struc , and the one at the final layer will be used for the language modeling or end task training y task .

Unsupervised Syntax Learning Module
The structure learning module is responsible for unsupervisedly generating phrases, providing structure-aware language modeling to the host LM.
Syntactic context. We extract the context representations from Transformer middle layers for the next syntax learning. We optimize the structureaware Transformer LM by forcing the structure knowledge injection focused at middle three layers: (l − 1) th , l th , and (l + 1) th . Note that although we only make structural attending to the selected layers, structure learning can enhance lower layers via back-propagation.
Specifically, we take the first of the chosen threelayer as the word context C Ψ = h l−1 . For the phrasal context C Ω = {c Ω 1 , · · · , c Ω n }, we make use of contextual representations from the three chosen layers by weighted sum: where α l−1 , α l and α l+1 are sum-to-one trainable coefficients. Rich syntactic representations are expected to be captured in C Ω by LM.
Structure measuring. In this study, we reach the goal of measuring syntax by employing syntax distance. The general concept of syntax distance d i can be reckoned as a metric (i.e., distance) from a certain word x i to the root node within the dependency tree (Shen et al., 2018a). For instance in Figure 3, the head word 'remembered' x i and its dependent word 'James' x j follow d i < d j . While in this work, to maintain both the dependency and phrasal constituents simultaneously, we add additional constraints on words and phrases. Given two words x i and x j (0 ≤ i < j ≤ n) in one phrase, we define d i < d j . This can be demonstrated by the word pair 'the' and 'story'. While if they are in different phrases 1 , e.g., S u and S v , the corresponding inner-phrasal head words follow d i (in S u ) > d j (in S V ), e.g., 'story' and 'party'.
In the structure learning module, we first compute the syntactic distances d = {d 1 , · · · , d n } for each word based on the word context via a convolutional network: where d i is a scalar, and Φ is for linearization. With such syntactic distance, we expect both the dependency as well as constituency syntax can be well captured in LM.
Syntactic phrase generating. Considering the word x i opening an induced phrase S m = [x i , · · · , x i+w ] in a sentence, where w is the phrase width, we need to decide the probability p * (x j ) that a word x j (j=i + w + 1) (i.e., the first word outside phrase S m ) belongs to S m : We set the initial width w = 1, if p * (x j ) is above the window threshold λ, x j should be considered inside the phrase; otherwise, the phrase S m should be closed and restart at x j . We incrementally conduct such phrasal searching procedure to segment all the phrases in a sentence. Given an induced phrase S m = [x i , · · · , x i+w ], we obtain its embedding s m via a phrasal attention: 1 Note that we cannot explicitly define the granularity (width) of every phrases in constituency tree, while instead it will be decided by the structure learning module in heuristics.

Structure-aware Learning
Multi-task training for language modeling and structure induction. Different from traditional language models, a Transformer-based LM employs the masked language modeling (MLM), which can capture larger contexts. Likewise, we predict a masked word using the corresponding context representation at the top layer: On the other hand, the purpose on unsupervised syntactic induction is to encourage the model to induce s m that is most likely entailed by the phrasal context c Ω i . The behind logic lies is that, if the initial Transformer LM can capture linguistic syntax knowledge, then after iterations of learning with the structure learning module, the induced structure can be greatly amplified and enhanced . We thus define the following probability: Additionally, to enhance the syntax learning, we employ negative sampling: whereŝ is a randomly selected negative phrase. The final objective for structure learning is: We employ multi-task learning for simultaneously training our LM for both word prediction and structure induction. Thus, the overall target is to minimize the following multi-task loss objective: where γ pre is a regulating coefficient.
Supervised syntax injection. Our default structure-aware LM unsupervisedly induces syntax at the pre-training stage, as elaborated above. Alternatively, in Eq. (7), if we leverage the gold (or apriori) syntax distance information for phrases, we can achieve supervised structure injection.
Unsupervised structure fine-tuning. We aim to improve the learnt structural information for better facilitating the end tasks. Therefore, during the fine-tuning stage of end tasks, we consider further making the structure learning module trainable: where L task refers to the loss function of the end task, and γ fine is a regulating coefficient. Note that to achieve the best structural fine-tuning, the supervised structure injection is unnecessary, and we do not allow supervised structure aggregation at the fine-tuning stage. Our approach is model-agnostic as we realize the syntax induction via a standalone structure learning module, which is disentangled from a host LM. Thus the method can be applied to various Transformer-based LM architectures.

Experimental Setups
We employ the same architecture as BERT base model 2 , which is a 12-layer Transformer with 12 attention heads and 768 dimensional hidden size. To enrich our experiments, we also consider the Google pre-trained weights as the initialization. We use Adam as our optimizer with an initial learning rate in [8e-6, 1e-5, 2e-5, 3e-5], and a L2 weight decay of 0.01. The batch size is selected in [16,24,32]. We set the initial values of coefficients α l−1 , α l and α l+1 as 0.35, 0.4 and 0.25, respectively. The pre-training coefficient γ pre is set as 0.5, and the fine-tuning one γ fine as 0.23. These values give the best effects in our development experiments. Our implementation is based on the PyTroch library 3 .
Besides, for supervised structure learning in our experiments, we use the state-of-the-art BiAffine dependency parser (Dozat and Manning, 2017) to parse sentences for all the relevant datasets, and use the Self-Attentive parser (Kitaev and Klein, 2018) to obtain the constituency structure. Being trained on the English Penn Treebank (PTB) corpus (Marcus et al., 1993), the dependency parser has 95.2% UAS and 93.4% LAS, and the constituency parser has 92.6% F1 score. With the auto-parsed annotations, we can calculate the syntax distances (substitute the ones in Eq. 4) and obtain the corresponding phrasal embeddings (in Eq. 7).

Development Experiments
Structural learning layers. We first validate at which layer of depths the structural-aware Transformer LM can achieve the best performance when integrating our retrofitting method. We thus design probing experiments, in which we consider following two syntactic tasks. 1) Constituency phrase parsing seeks to generate grammar phrases based on the PTB dataset and evaluate whether induced constituent spans also exist in the gold Treebank dataset. 2) Dependency alignment aims to compute the proportion of Transformer attention connecting tokens in a dependency relation (Vig and Belinkov, 2019): is the attention weight, and dep(x i , x j ) is an indicator function (1 if x i and x j are in a dependency relation and 0 otherwise). The experiments are based on English Wikipedia, following Vig and Belinkov (2019).
As shown in Figure 4, both the results on unsupervised and supervised phrase parsing are the best at layer 6. Also the attention aligns with dependency relations most strongly in the middle layers (5-6), consistent with findings from previous work (Tenney et al., 2019;Vig and Belinkov, 2019). Both two probing tasks indicate that our proposed middle-layer structure training is practical. We thus inject the structure in the structure learning module at the 6-th layer (l = 6).  Phrase generation threshold. We introduce a hyper-parameter λ as a threshold to decide whether a word belong to a given phrase during the phrasal generation step. We explore the best λ value based on the same parsing tasks. As shown in Figure 5, with λ = 0.5 for unsupervised induction and λ = 0.7 for supervised induction, the induced phrasal quality is the highest. Therefore we set such λ values for all the remaining experiments.

Structure-aware Language Modeling
We evaluate the effectiveness of our proposed retrofitted structure-aware LM after pre-training. We first compare the performance on language modeling 4 . From the results shown in Table 2, our retrofitted Transformer yields better language perplexity in both unsupervised (37.0) or supervised (29.2) manner. This proves that our middle-layer structure training strategy can effectively relieve negative mutual influence of structure learning on semantic learning, while inducing high-quality of structural phrases. We can also conclude that language models with more successful structural knowledge can better help to encode effective intrinsic language patterns, which is consistent with the prior studies (Kim et al., 2019b;Drozdov et al., 2019). We also compare the constituency parsing with state-of-the-art structure-aware models, including 1) Recurrent-based models described in §2: PRPN (Shen et al., 2018a), On-LSTM (Shen et al., 2018b), URNNG (Kim et al., 2019b), DIORA (Drozdov et al., 2019), PCFG (Kim et al., 2019a), and 2) Transformer based methods: Tree+Trm , RvTrm (Ahmed et al., 2019), PI+TrmXL , and the BERT model initialized with rich weights. As shown in Table 2, all the structure-aware models can give good parsing results, compared with non-structured models. Our retrofitted Transformer LM gives the best performance (60.3% F1) in unsupervised induction. Combined with the supervised auto-labeled parses, it give the highest F1 score (68.8%).

Fine-tuning for End Tasks
We validate the effectiveness of our method for end tasks with structure-aware fine-tuning. All systems are first pre-trained for structure learning, and then fine-tuned with end task training. The evaluation is performed on eight tasks, involving  syntactic tasks and semantic tasks. TreeDepth predicts the depth of the syntactic tree, TopConst tests the sequence of top level constituents in the syntax tree, and Tense detects the tense of the main-clause verb, while SOMO checks the sensitivity to random replacement of words, which are the standard probing tasks. We follow the same datasets and settings with previous work (Conneau et al., 2018;Jawahar et al., 2019).
The results are summarized in Table 1. First, we find that structure-aware LMs bring improved performance for all the tasks, compared with the vanilla Transformer encoder. Second, the Transformer with our structural-aware fine-tuning achieves better results (70.74% on average) for all the end tasks, compared with the baseline tree Transformer LMs. This proves that our proposed middle-layer strategy best benefits the structural fine-tuning, compared with the full-layer structure training on baselines. Third, with supervised structure learning, significant improvements can be found across all tasks.
For the supervised setting, we replace the supervised syntax fusion in structure learning module  with the auto-labeled syntactic dependency embedding and concatenate it with other input embeddings. The results are not as prominent as the supervised syntax fusion, which reflects the advantage of our proposed structure learning module. Besides, based on the task improvements from the retrofitted Transformer by our method, we can further infer that the supervised structure benefits more syntactic-dependent tasks, and the unsupervised structure benefits semantic-dependent tasks the most. Finally, the BERT model integrating with our method can give improved effects 5 .
6 Analysis 6.1 Induced Phrase after Pre-training.
We take a further step, evaluating the fine-grained quality on phrasal structure induction after pretraining. Instead of checking whether the induced constituent spans are identical to the gold counterparts, we now consider measuring the devia- where ∆(ŷ i , y i ) is the phrasal editing distance between the induced phrase length and the gold length within a sentence. ∆ is the averaged editing distance. If all the predicted phrases are same with the ground truth, or all different from it, P hrDev(ŷ, y) = 0, which means that the phrases are induced with the maximum consistency, and vice versa. We make statistics for all the sentences in Table 3. Our method can unsupervisedly generate higher quality of structural phrases, while we can achieve the best injection of the constituency knowledge into LM by the supervised manner.

Fine-tuned Structures with End Tasks
Interpreting fine-tuned syntax. To interpret the fine-tuned structures, we empirically visualize the Transformer attention head from the chosen l-layer, and the syntax distances of the sentence. We exhibit three examples from SST, Rel and SRL, respectively, as shown in Figure 6. Overall, our method can help to induce clear structure of both dependency and constituency. While interestingly, different types of tasks rely on different granularity of phrase. Comparing the heat maps and syntax distances with each other, the induced phrasal constituency on SST are longer than that on SRL. This is because the sentiment classification task demands more phrasal composition features, while the SRL task requires more fine-grained phrases.
In addition, we find that the syntax distances in SRL and Rel are higher in variance, compared with the ones on SST, Intuitively, the larger deviation of syntax distances in a sentence indicates the more demand to the interdependent information between elements, while the smaller deviation refers to phrasal constituency. This reveals that SRL and Rel rely more on the dependency syntax, while SST is more relevant to constituents, which is consistent with previous studies (Socher et al., 2013;Rabinovich et al., 2017;Fei et al., 2020).
Distributions of heterogeneous syntax for different tasks. Based on the above analysis, we further analyze the distributions of dependency and constituency structures after fine-tuning, in different tasks. Technically, we calculate the mean absolute differences of syntax distances between elements x i and the sub-root node x r in a sentence: We then linearly normalize them into [0,1] for all the sentences in the corpus of each task, and make statistics, as plot-   ted in Figure 7. Intuitively, the larger the value is, the more interdependent to dependency syntax the task is, and otherwise, to constituency structure. Overall, distributions of dependency structures and phrasal constituents in fine-tuned LM vary among different tasks, verifying that different tasks depend on distinct types of structural knowledge. For example, TreeDepth, Rel and SRL are most supported by dependency structure, while TopConst and SST benefit from constituency the most. SOMO and NER can gain from both two types.
Phrase types. Finally, we explore the diversity of phrasal syntax required by two representative end tasks, SST and SRL. We first look into the statistical proportion for different types of induced phrases 6 . As shown in Table 4, our method tends to induce more task-relevant phrases, where the lengths of induced phrases are more variable to the task. Concretely, the fine-tuned structure-aware Transformer helps to generate more NP also with longer phrases for the SST task, and yield roughly equal numbers of NP and VP for SRL tasks with shorter phrases. This evidently gives rise to the better task performance. In contrast, the syntax phrases induced by the Tree+Trm model keep unvarying for SST (3.22) and SRL (3.36) tasks.

Conclusion
We presented a retrofitting method for structureaware Transformer-based language model. We adopted the syntax distance to encode both the constituency and dependency structure. To relieve the conflict of structure learning and semantic learning in Transformer LM, we proposed a middlelayer structure learning strategy under a multitasks scheme. Results showed that structure-aware Transformer retrofitted via our proposed method achieved better language perplexity, inducing highquality syntactic phrase. Furthermore, our LM after structure-aware fine-tuning gave significantly improved performance for both semantic-dependent and syntactic-dependent tasks, also yielding most task-related and interpretable syntactic structures.