Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach

It is commonly believed that knowledge of syntactic structure should improve language modeling. However, effectively and computationally efficiently incorporating syntactic structure into neural language models has been a challenging topic. In this paper, we make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called “syntactic distances”, where information between these two separate objectives shares the same intermediate representation. Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.


Introduction
It is widely believed in linguistics, cognitive science, and computational linguistics that the latent structure underlying how words combine to form sentences is best represented as a tree structure. The study of the computational mechanisms and systems of constraints that characterize such derivations or parse trees is a central question in these fields (Pollard and Sag, 1994;Steedman and Baldridge, 2011;Huddleston and Pullum, 2002;Adger, 2003;Bresnan, 2001;Chomsky, 1995;Sag et al., 2003).
Using syntactic information for the language modeling task has been a popular research topic since the 1990s. Early efforts included various approaches that attempted to incorporate shallow syntactic information such as POS tags (Heeman and Allen, 1997;Srinivas, 1996) as well as a more complete structures (Wright et al., 1994;Jurafsky et al., 1995). Most of such work falls under the topic of structured language modeling (Chelba and * Equal contribution. Jelinek, 2000;Van Uytsel et al., 2001;Xu et al., 2002). With the resurgence of neural network approaches, sequential, large-scale neural language models have been shown to significantly outperform traditional language models (Merity et al., 2017;Yang et al., 2018) without using syntactic structural information. On another scenario, recent analysis also reveals that state-of-the-art sequential neural language models still fail to learn certain long-range syntactic dependencies (Kuncoro et al., 2018). Thus it is an interesting problem to explore the relation between language models and syntax and investigate whether syntax can be integrated to enhance neural language models.
To this end, two main lines of work have been investigated, namely transition-based and distancebased methods, respectively. The former strand of work has sought to jointly train a transition-based parser (Nivre, 2008;Zhang and Nivre, 2011;Andor et al., 2016) with a language model using a linearized structured sentence. For example, recurrent neural network grammars (RNNGs) model the joint probability of both words and trees by training a generative, top-down parser (Dyer et al., 2016;Cheng et al., 2017). Subsequent work (Kim et al., 2019b) has developed an unsupervised variant of RNNGs based on an expectation maximization algorithm, which enables the system to be used as a language model without access to parser data.
The second strand of work designs language models that are constrained using syntactic constituents induced using the notion of syntactic distance (Shen et al., 2017(Shen et al., , 2018. The distances are a sequence of scalars between consecutive words, which are higher when there is a higher level of constituent boundary between the corresponding pair of words. While aligning nicely with the sequential nature of language models, syntactic distances can be transformed into syntactic tree structures with simple principles (Shen et al., 2017).
The major difference between the above two strands of work is that the former focuses more on parsing performance while the latter aligns better to language model settings. There are three main benefits of the syntactic distance approach. First, typical engineering tricks for language modeling such as batching and regularization (Merity et al., 2017) can be directly used. Second, unlike transition-based methods, which requires to model each sentence independently, distance-based models allow direct comparison with mainstream prior work on language modeling (Gal and Ghahramani, 2016;Merity et al., 2017;Yang et al., 2018) on the same datasets, which carry information across sentence boundaries. Third, there is no risk of compounding errors as compared to the transitionbased approach. However, unlike for transitionbased approaches (Kim et al., 2019b), for distancebased approaches there have been no studies examining the relationship between induced syntactic structure and human labeled syntactic structure, or whether human labeled syntactic trees can be used to improve language modeling (Dyer et al., 2016;Kim et al., 2019b).
To this end, we investigate distance-based language models with explicit supervision. In particular, we inject syntactic tree supervision into distance-based neural language models by breaking a syntactic tree into a label sequence, and extending a distance-based language model to include a multi-task objective that also learns to predict goldstandard labels. We choose the Ordered-Neuron LSTM (ON-LSTM) (Shen et al., 2018) as our baseline model, which gives the best results among distance-based models.
For making fair comparison with the dominant methods on language modeling, we also manually extend the most commonly-used dataset for evaluating language models, which we name PTB-Concat (Mikolov et al., 2010). It is a version of the Penn Treebank (PTB) (Marcus et al., 1993) dataset with syntactic trees removed, and with preprocessing of numbers, punctuation and singleton words. We add syntactic trees, thus directly compare distancebased methods with other language models.
Experimental results show that incorporating linguistically motivated structures could practically improve language modeling performance. To the best of our knowledge, this is the first work to successfully incorporate gold-standard syntactic trees into syntactic distance based language models. Ad-ditional experiments suggest that the level of improvement could also be achieved in other language models. Furthermore, analyses of the trees learned by the multi-task models demonstrate that they are different from both gold trees and unsupervisedly learned trees. 1

Related Work
Using syntactic information for language modeling dates back to the last century. Srinivas (1996) proposed using shallow syntactic structures-so-called "super-tags"-which successfully reduced perplexity by 38% over a tri-gram based word-level language model. More complete parser integration is also explored under the heading of "structured language modeling" (Chelba and Jelinek, 2000). This research covers a wide range of different parsers, albeit mostly with N-gram models (Van Uytsel et al., 2001;Xu et al., 2002). Wright et al. (1994) and Jurafsky et al. (1995) extend bi-gram language models with a context-free grammar. Feed-forward neural language models were also explored (Xu et al., 2003). However, the performance does not approach that of the modern neural LMs. Dyer et al. (2016) first proposed RNNG. Subsequent work extends the model with an encoderdecoder architecture (Cheng et al., 2017), unsupervised learning (Kim et al., 2019b, knowledgedistillation (Kuncoro et al., 2019) and computational psycholinguistics (Hale et al., 2018). Shen et al. (2017) first used syntactic distance to constrain language modeling. Its subsequent work (Shen et al., 2018) transfers the distance notion to LSTM cell. Our work extends distance-based methods in trying to introduce supervised syntax to these models. A very recent work makes use of attention over spans instead of syntactic distance to inject inductive bias to language models (Peng et al., 2019). However, the time complexity of injecting supervision is much higher than distancebased approach (O(n 2 ) VS O(n) ).

Model
The overall structure of our model is shown in Figure 1. In particular, the ON-LSTM is taken as the base language model, and syntactic trees are added by conversion to distance metrics. The supervised distance values are taken as one additional output, resulting in a multi-view model.  Figure 1: Split-head approach of constructing the two master forget gates in the multi-task setting.

Ordered Neurons LSTM
Ordered Neurons LSTM (ON-LSTM) (Shen et al., 2018) is built upon a vanilla LSTM model (Hochreiter and Schmidhuber, 1997) with two additional gates, namely a master input gateĩ t and a master forget gatef t , each being a vector of the same shape as the LSTM forget and input gates: where cumax is defined as the cumulative sum of softmax outputs, i.e., cumax(·) = cumsum(softmax(·)). The cumax function provides an inductive bias to model hierarchical structures through enforcing units in the master forget gatef t to increase monotonically from 0 to 1 and those in the master input gateĩ t to decrease monotonically from 1 to 0. The two gates are applied on the original input and forget gates as follows: ON-LSTM can learn the implicit structure of a language in the form of a binary tree in an unsupervised manner, through syntactic distances, which are calculated as: Figure 2: Binarized grammar tree and its corresponding syntactic distances. The heights of the bars stand for the values of the distances. To convert this tree to syntactic distances, we first assign all the words an initial value of 1, and then the non-leaf nodes are assigned distances in the order of d3 → d2 → d1 → d4, according to the procedures in the second part of Model section. On the other hand, given the distances, the tree can be recovered in a top-down process by setting up the split boundaries in descending order of distances (i.e., d4 → d1 → d2 → d3). Syntactically, a shorter distance between a pair of words indicates a closer relationship between the constituents on the two sides of the distance. Note that since only the relative order of the distances could affect the structure of the trees, valid values of these distances are not unique.
where D m is the size of the hidden state. The syntactic distance d t between two consecutive words is a scalar value, which can be interpreted as reflecting the syntactic relatedness between the constituents before and after time point t. In terms of trees, it can be thought of as the height the lowest tree node that encloses both words. In the case where we consider discrete trees, the height is given by the maximum path length from a leaf. In the more general case, it can be thought of as a scalar value measuring a continuous notion of node height. Figure 2 depicts a sample sentence with its syntactic distances and corresponding tree structures. More generally, the binary tree structure of a sequence with N tokens can be specified with a sequence of N − 1 syntactic distances. This definition of distance makes the syntactic distance an ultrametric (Holly, 2001;Wu et al., 1999), a concept which is important in the theory of hierarchical agglomerative clustering (Johnson, 1967) and was first explored in a linguistic setting by Levelt (1974).

Converting Grammar Trees to Syntactic Distances
To integrate treebank trees into ON-LSTM, we need to first convert syntactic trees into a representation based on syntactic distances. Since the original grammar trees are not necessarily binary, we first split non-binary nodes by adding sentinel intermediate nodes to form a right-branched binary tree, following the steps in Stern et al. (2017). Now for a binary tree with N leaf nodes, we have N − 1 non-leaf nodes that correspond to the N − 1 slots between each of the adjacent word pairs, each of which are assigned a syntactic distance ( Figure 2). The binary tree can thus be represented as a sequence of distances d 1 , d 2 , . . . , d N −1 .
The conversion from binary tree to syntactic distances thus translates to the assigning of a distance value for each of the N − 1 non-leaf nodes in the tree. This is achieved in a bottom-up process. We first initialize a distance value of 1 at all of the leaf nodes, and then compute the syntactic distances of the parent nodes by recursively tracing back their parents. More specifically, for a certain parent node, its corresponding syntactic distance d P is computed with respect to the syntactic distances of its children d L and d R , i.e., A more detailed algorithm flowchart of tree-todistance conversion is given in Appendix A.1.

Auxiliary Syntactic Distance Outputs
In ON-LSTM the distances d t 's in Equation 12 are used to infer the structure of grammar trees. Consequently, a straight-forward way to incorporate ground truth parse trees is to use the ground truth distances d g t to guide d t , as depicted in Figure 1. Interestingly, directly forcing the structure inferred by language models to be coherent to linguist-tagged ground truth trees barely improves the language model performance (see Section 6). Instead, we introduce a "split-head" setting, which can practically improve LM performances by learning two sets of closely related syntactic distances.
In particular, we use another master forget gatẽ f w t for inferring a set of distances that are trained to align with the gold-standard syntactic distances, while leaving the original distances d t computed fromf t intact. To achieve this, we introduce an extra linear layer on top of the hidden states h f t , and from there infer a separate set of master forget gates. In this way, both of the master forget gatesf t andf w t share the same input h f t , but optimize two different sets of trees for the language modeling and parsing task, respectively. i.e., The syntactic distances for the auxiliary supervised targets are then calculated as follows: wheref w tk is the k-th element in the vectorf w t

Grammar Trees as Auxiliary Supervised Targets for Language Modeling
With the additional master forget gatef w t , the model has two different sets of predictions. The first set is the language model outputs of ON-LSTM, predicting the next words. The second set is the distances calculated in Equation 17. The original language modeling structure of the ON-LSTM model is left intact after the modification, so we can continue to use the master forget gatef t to update hidden states and calculate the softmax output in ON-LSTM for the language modeling part. We denote the negative log-likelihood loss in the language model part as L lm . For brevity, we do not discuss the details of the loss.
For aligning the syntactic distances, we perform a ranking loss between the learned syntactic distance d w t and ground truth distance d g , which was first proposed by Burges et al. (2005). The goal is to encourage the model to produce the distances that have the same ranking order as the ground truth distances: (18) The joint objective function is thus to minimize the following loss: where α is the scaling parameter.

Datasets
We make test datasets in English and Chinese, respectively, both of which have parse trees and also language modeling benchmarks. For English, we use the Penn Treebank (PTB) dataset (Marcus et al., 1993). Mikolov et al. (2010) have provided a widely accepted version of PTB for language modeling. Several modifications are made to the original treebank. For example, all punctuation symbols are removed, all characters are lower-cased, the vocabulary size is truncated at 10,000 and all sentences are concatenated. However, this version of PTB discards the parse tree structures, which makes it unsuitable for comparing sequential language models with those utilizing tree structures. We refer to this version as PTB-Concat.
Dyer et al. (2016) proposed a different version of PTB, which retains the parse tree structures. Sentences are modeled separately, punctuation is retained, and singleton words are replaced with the Berkeley parser's mapping rules, resulting in much larger vocabulary size, 23,815-word types. Since it retains the parse trees, this dataset enables direct comparison between models that utilize parse trees with those who do not. But unfortunately, since the vocabulary is different from PTB-Concat, and the sentences are processed separately, the results are not directly comparable with those in PTB-Concat, on which most existing work on language modeling reports results. We refer to this version as PTB-Sepsent.
As mentioned above, a salient limitation of PTB-Sepsent is that it does not allow fair comparison with existing LM work on PTB-Concat. To address this issue, we propose a different variation of PTB dataset that both uses the same vocabulary size as PTB-Concat and at the same time retaining the ground-truth grammar trees. We pre-process the PTB dataset by following the same steps indicated by Mikolov et al. (2010) to obtain a modified treebank with the same vocabulary set as PTB-Concat. Sentences are concatenated, and we make sure that the sentences are the same to PTB-Concat, from token to token, in the training, validation, and test sets. This results in the same vocabulary as that of PTB-Concat, which allows us to directly compare models that utilize parse trees with the existing reports of performance on PTB-Concat. We refer to this version of PTB-Concat with syntax as PTB-Concat-Syn and we will cover preprocessing details in Appendix A.3.
For Chinese, we use the Chinese Treebank 5.1 (Xue et al., 2005), with the same settings as Kim et al. (2019b). Sentences are modeled separately and singleton words are replaced with a single <UNK> token. It will be referred to as CTB-

Model
Param Dev Test Gal and Ghahramani (2016)   Sepsent in the rest of the paper.

Experiments
We evaluate the influence of syntactic supervision on distance-based langauge models, especially in terms of its language modeling performance. We are also going to analyze the induced syntax after introducing the structural supervision. In addition, extensive ablation tests are conducted to understand how syntactic supervision affects the langauge model.

Language Modeling
We first compare our models with existing sequential language models on PTB-Concat, and then we compare our model with transition-based language models on PTB-Sepsent and CTB-Sepsent, which have a larger vocabulary and also use additional grammatical structure.  word vectors, LSTM weight metrics, outputs between LSTM layers and the output of the last layer, respectively. The embedding dropout ratio is 0.125. The model is trained and finetuned for 1000 epochs in total and is switched to the fine-tuning phase at epoch 650. The ground truth syntactic structures are used to supervise the syntactic distances in the third layer of ONLSTM-SYD and the loss raio α is set to 0.75. We use this setting as the default setting for all the experiments. The results are shown in Table 1. After adding structural signals into the model, our model ONLSTM-SYD significantly outperforms the original ON-LSTM model (p-value < 0.05), indicating that incorporating linguist-tagged parse trees can contribute to language modeling positively.

Results on PTB-Concat
Results on PTB-Sepsent and CTB-Sepsent PTB-Sepsent and CTB-Sepsent offer a comparable setting with other structure-aware supervised (Dyer et al., 2016) andunsupervised (Kim et al., 2019b) baselines. The results are listed in Table 2. 2 ONLSTM-SYD performs better than ONLSTM, which indicates that supervised syntactic information can help improve language modeling.
The margin between our models and the baselines is rather large. We find that the set of regularization and optimization techniques proposed by Merity et al. (2017) contribute significantly to this margin. Because of the sequential and parallel nature of our model, it can directly inherit and benefit from this set of tricks. In contrast, it is non-trivial to use them for RNNG and URNNG. As a more rigorous analysis, we further conducted a set of experiments without those tricks (i.e. non-monotonically triggered ASGD, weight-dropped LSTM, finetuning). The performance (denoted as ONLSTM-SYD-noAWD) drops; however, the model still outperforms the other baselines by a significant margin.

Structure Analysis
In this subsection we analyze the model to see how the additional structural supervision affects the quality of inferred trees. Note that our goal here is to analyze the influence of ground truth syntactic information on the quality of the induced trees rather than to yield a better grammar induction performance, since our model is not strictly comparable to other models due to its extra structural supervision during training.
We follow the settings of Htut et al. (2018) to test our model on the WSJ10 and WSJ test sets, reporting the results in Table 3. The WSJ test set has 2416 sentences with arbitrary lengths, while WSJ10 consists of 7422 sentences of the whole WSJ corpora that contain no more than 10 words. We use both biased and unbiased distance-to-tree conversion algorithms for both ON-LSTM and our proposed model (c.f. Appendix A.1 and A.2 for a formal description of the biased and non-biased conversion algorithm). Since our model has two sets of trees learned simultaneously, we list all of them in Table 3.

Grammar Induction
We can see that the trees learned by the joint loss show improved the F1 score and rely less on the branching bias of the tree constructing algorithm (see Dyer et al. (2019)). The big gap of F1 scores on WSJ between the biased and unbiased trees are altered after introducing the structural loss, and the LM unbiased trees significantly outperforms its baseline ON-LSTM. These indicate that the auxiliary supervised task not only lowers the perplexity, but also improves the qualities of the induced trees for the LM task.
Looking more into the trees, we find that compared to ON-LSTM, ONLSTM-SYD improves the label prediction accuracy for NP (noun phrases), VP (verb phrases) and PP (prepositional phrases) but fails to improve ADJP (adjective phrases). This suggests that different types of human-annotated constituents may have different influences on language modeling, or that human-annotated trees are themselves biased to differing degrees between different constituent types.  are the best compared to those computed from the other parts of the model (i.e., within the same section in the table). The Algorithm column represents whether bias or unbiased algorithm is performed. ONLSTM-SYD syd and ONLSTM-SYD lm represent two sets of trees induced from loss Lsyd and Llm respectively. The Accuracy columns represent the fraction of ground truth constituents of a given type that correspond to constituents in the model parses. The R/L Ratio column represents the ratio between the number of words that are left children of its parent, and those that are right children.
Branching Bias Syntactic trees of English naturally have a bias towards right branching structures. As shown in the last section of Table 3, right branching trees achieve a much higher F1 score than random, balanced or left branching trees. As pointed out by Dyer et al. (2019), PRPN and ON-LSTM resort to a distance-to-tree algorithm with right-branching biases (See Appendix A.2). For our model, a biased distance-to-tree algorithm yields worse results compared to its nonbiased counterpart; but on unsupervised models such as ON-LSTM, biased algorithms yield better results than non-biased versions. This observation indicates that syntactic supervision leads to better tree structures as compared with fully unsupervised tree induction, which is intuitive.
Linguistic Analysis Our best parsing results are for trees decoded from the syntactic prediction objective using the unbiased algorithm. Interestingly, these trees tend to be deeper on average than the (binarized) gold standard trees (see Table 3). 3 This appears to be driven by a failure of the model to identify constituents centered on deeply-embedded head words-instead, the model prefers right-branching structures. Some examples of trees are displayed in Figure 3. In the top part of the figure, we see the parse produced from the L syd distances of our model, in the middle the tree produced the L lm distances and, on the bottom, the gold standard tree. As can be seen in the figure, the L syd -based tree is largely right-branching and misses constituents centered on several deeply em-bedded heads, such as the verb said. By contrast, the L lm -based tree is considerably shallower than the gold-standard and consists of a sequence of smaller chunks that often mis-bracket words with respect to the gold-standard constituent boundaries. Figure 4 illustrates these phenomenon in further detail. The plot at the top of the figure shows the proportion of constituents produced from L syd distances whose boundaries correspond to a gold constituent, broken down by height of nodes in the predicted tree. As the plot illustrates, the model fares better on relatively small constituents lower in trees, and makes more errors for constituents higher in the tree, reflecting mistakes on deeplyembedded heads. The bottom of the figure shows the same breakdown for L lm -based induced trees. Overall, the affect is similar, although L lm -based trees are shallower than the L syd -based trees. We believe the increased accuracy for the longest constituents is driven by the fact that, since the highest constituents cover long sentence spans and there are few possible long spans, these constituents have a higher baseline probability of being correct.
It appears that the L syd objective has learned a strong right-branching bias, leading to very deep trees (even with the unbiased decoder) whereas the L lm objective appears to be using a kind of predictive chunking of the sentence into small groups of words. It is tempting to speculate that these chunks may correspond to linguistic units used in prosodic planning or by the human sentence processor, while the deeper trees correspond more directly to the compositional structure underlying sentence meaning. We leave exploring this question to future the company which issued a statement on the agreement late friday said that N million of the payment was previously provided for in its financial statements and that NN will be recognized in its N third-quarter statement the company which issued a statement on the agreement late friday said that N million of the payment was previously provided for in its financial statements and that NN will be recognized in its N third-quarter statement The company which issued a statement on the agreement late Friday said that 1 million of the payment was previously provided for in its financial statements and that 500,000 will be recognized in its 1989 third-quarter statement  A constituent is considered as correct if its boundaries correspond to a true constituent. The constituents' heights are those in the predicted tree. Since constituents that represent the whole sentence always have correct boundaries, they are excluded from the calculation.

work.
Parsing performance Our models give worse unlabeled parsing performance compared to transition-based methods. In particular, Kim et al. (2019a) report that unsupervised URNNG achieves 45.4 WSJ F1 in a similar setting, while another URNNG that finetunes a supervised RNNG model gives a much better F1 of 72.8, leading a 27.4 F1 improvement. In contrast, the F1 of our structure prediction trees is 61.3 in unbiased algorithm. This indicates that our model brings more benefits on the LM side rather than the parsing side.

Ablation Study
Layer used for supervision Table 4 (Top) shows the performances where the supervised signal is injected into different layers. Although injecting syntax into the last layer gives the best syntactic distance for grammar induction, it fails to achieve a similar improvement on perplexity. This suggests that a better syntactic structure may not always lead to a better language model. The observation is consistent with prior research (Williams et al., 2018).
Tree structure We study the influence of the different types of supervised trees to the model. In addition to using the ground truth parse trees, we also tried to train the model with random trees instead, and without providing trees, in which case it degenerates to a vanilla ON-LSTM. From Table  4 (Middle) we can find that without supervision signals from gold standard parse trees the model performs worse than the full model. Random trees introduce noise to the model and downgrade both parsing and LM performance, indicating the importance of injecting meaningful syntax.

Multitask variants
We also explored injecting the supervised syntactic information at different levels. One straight forward baseline is to add supervision signals directly on the syntactic distance  in ON-LSTM, using one set of trees to guide both LM and parsing, as indicated in the Model section (Table 4 Bottom, one set of trees). Despite injecting stronger syntactic signals, this direct approach does not improve language model perplexity. This also reflects the fact that the most suitable syntactic structures for language modeling do not necessarily conform to human labeled syntax. In addition, we also use ON-LSTM hidden states for supervised syntactic distance prediction (Table 4 Bottom, vanilla multitasking). This approach fails to outperform its ON-LSTM baseline due to the same reason. In summary, there are mutual benefits between induced and supervised syntactic information, although they do not fully overlap.
Generalization to other LMs One practical question is whether the improvements found in our work can be generalized to other language models. To answer this question, we introduce the multitask scheme to PRPN (Shen et al., 2017), which is another model that is also able to learn unsupervised structures through language modeling. Similar to ON-LSTM, PRPN is also a syntactic distance method. We modify the PRPN model in the same spirit as in ON-LSTM. In addition, we change the encoding layer and use the output as syntactic distance embeddings l syd . Then we map l syd to two sets of syntactic distances d lm and d syd for language modeling and syntactic distance prediction, respectively. Syntactic supervision comes to d syd .
The model reaches a test perplexity of 60.5 in PTB-Concat (p-value < 0.05), which also significantly outperforms the 62.0 from the original model. We refer readers to Appendix A.4 for the details of PRPN and our modified PRPN-SYD.

Conclusion
We investigated linguistic supervision for distancebased structure-aware language models, showing its strengths over transition-based counterparts in language modeling. Apart from the explicit observations in achieving strong perplexity scores, our model reveals several interesting aspects of the quality of the trees learned by the model. As a byproduct of our investigation, we release a version of PTB-Concat, which contains syntactic structures while at the same time the same pre-processing steps adopted by most previous work on neural language models.

A Appendices
A.1 Algorithms for transformation between parse trees and syntactic distances The following tree-to-distance algorithm provides a set of distances given a tree. The node indicates the root node of the given tree.
Algorithm 1 Binary Parse Tree to Distance (∪ represents the concatenation operator of lists) return t2d, d 12: end function The following distance-to-tree conversion algorithm provides an unbiased reconstruction of tree given a set of distances.
additional grammar trees retained from the original PTB dataset. The resulting datasets then becomes PTB-Concat-Syn.

A.4.1 Parse-Read-Predict Network (PRPN)
The idea of PRPN builds upon an assumption that to predict a word x i , we only need information for all precedent siblings in constituent tree. The model constitutes three components: (i) a parsing network that calculates the syntactic distance and parsing gates. (ii) a reading network to model the language, and (iii) a predict network to predict the next word.
PRPN first uses a two-layer convolutional network to calculate the syntactic distance d at timestep t: Where e i−L , ..., e i are word embeddings, L is the lookback range.
Then the difference between distances is fed through hardtanh to model the degree α t j that how much two words x t and x j are related: Where hardtanh(x) = max(−1, min(1, x)), and τ is the temperature parameter.
For word x i , the first precedent word x t with a small value α t i represents x t and all its precedents are not likely to be siblings of x i . The following parsing gate g t i models the probability of x t and x i being siblings: The reading network is a variant of Long Short-Term Memory-Network (LSTMN) (Cheng et al., 2016) where the attention score is softly truncated by parsing gates: The predict network utilizes the structure-aware hidden states of reading network to predict the next word.

A.4.2 The PRPN-SYD model
We re-designed the parsing network. We use LSTM to encode each embedding sequence s = (e 0 , e 1 , ..., e n ),. Because the task of language modeling prohibits seeing future words, we use unidirectional LSTM: We stack a convolutional layer on top of the hidden states h i of the LSTM, which helps gather local syntactic information: Next, syntactical information learned both locally and globally are integrated by using another unidirectional LSTM: We pass theĥ layer through two 2-layer fullyconnected networks which output two respective sets of distance scalars: Where d lm is the distance for language modeling while d syd is for syntactic distance prediction. For two sets of distances, we use the same objective functions as described in ONLSTM-SYD.

A.5 Trees
We visualize a set of sentences (14 sentences in total) and their corresponding trees in parallel to contrast the qualitative differences of the model induces trees and gold standard trees. Sentences are selected randomly from the dataset. In each of the following figures, we provide three trees for a same sentence, which corresponds to trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
boeing is also supposed to send to america west another N twin-engine aircraft as well as a N by year 's end boeing is also supposed to send to america west another N twin-engine aircraft as well as a N by year 's end boeing is also supposed to send to america west another N twin-engine aircraft as well as a N by year 's end Figure 5: Sentence 1. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
that discrepancy hurts quantum badly because its own plants cover only about half of its ethylene needs that discrepancy hurts quantum badly because its own plants cover only about half of its ethylene needs that discrepancy hurts quantum badly because its own plants cover only about half of its ethylene needs Figure 6: Sentence 2. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom). however as expected brazil waited for the crop estimate to come out and then cut the export price of its juice concentrate to about N.N a pound from around N.N however as expected brazil waited for the crop estimate to come out and then cut the export price of its juice concentrate to about N.N a pound from around N.N however as expected brazil waited for the crop estimate to come out and then cut the export price of its juice concentrate to about N.N a pound from around N.N Figure 9: Sentence 5. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom). total advertising linage was modestly lower as classified-ad volume increased while there was softer demand for retail and national ad linage said john curley gannett 's chief executive officer total advertising linage was modestly lower as classified-ad volume increased while there was softer demand for retail and national ad linage said john curley gannett 's chief executive officer total advertising linage was modestly lower as classified-ad volume increased while there was softer demand for retail and national ad linage said john curley gannett 's chief executive officer Figure 10: Sentence 6. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
it 's turning out to be a real blockbuster mr. sweig said it 's turning out to be a real blockbuster mr. sweig said it 's turning out to be a real blockbuster mr. sweig said Figure 11: Sentence 7. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
the fact that this happened two years ago and there was a recovery gives people some comfort that this wo n't be a problem the fact that this happened two years ago and there was a recovery gives people some comfort that this wo n't be a problem the fact that this happened two years ago and there was a recovery gives people some comfort that this wo n't be a problem Figure 12: Sentence 8. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom). ncnb will also acquire N million of freedom 's assets from the rtc which will require N million in assistance ncnb will also acquire N million of freedom 's assets from the rtc which will require N million in assistance ncnb will also acquire N million of freedom 's assets from the rtc which will require N million in assistance Figure 13: Sentence 9. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
when you suggest otherwise you leave the realm of reporting and enter the orbit of speculation when you suggest otherwise you leave the realm of reporting and enter the orbit of speculation when you suggest otherwise you leave the realm of reporting and enter the orbit of speculation Figure 14: Sentence 10. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
but not much money was spent on the shows either a situation that encouraged cheap-to-make talk and game shows while discouraging expensive-to-produce dramas but not much money was spent on the shows either a situation that encouraged cheap-to-make talk and game shows while discouraging expensive-to-produce dramas but not much money was spent on the shows either a situation that encouraged cheap-to-make talk and game shows while discouraging expensive-to-produce dramas Figure 15: Sentence 11. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).
it also drops a provision that would have permitted corporations to use excess pension funds to pay health benefits for current retirees it also drops a provision that would have permitted corporations to use excess pension funds to pay health benefits for current retirees it also drops a provision that would have permitted corporations to use excess pension funds to pay health benefits for current retirees Figure 16: Sentence 12. Trees induced from the syntactic task (top) and language model task (middle) set of distances, as well as the gold-standard trees (bottom).