A Span-based Linearization for Constituent Trees

We propose a novel linearization of a constituent tree, together with a new locally normalized model. For each split point in a sentence, our model computes the normalizer on all spans ending with that split point, and then predicts a tree span from them. Compared with global models, our model is fast and parallelizable. Different from previous local models, our linearization method is tied on the spans directly and considers more local features when performing span prediction, which is more interpretable and effective. Experiments on PTB (95.8 F1) and CTB (92.4 F1) show that our model significantly outperforms existing local models and efficiently achieves competitive results with global models.


Introduction
Constituent parsers map natural language sentences to hierarchically organized spans (Cross and Huang, 2016). According to the complexity of decoders, two types of parsers have been studied, globally normalized models which normalize probability of a constituent tree on the whole candidate tree space (e.g. chart parser (Stern et al., 2017a)) and locally normalized models which normalize tree probability on smaller subtrees or spans. It is believed that global models have better parsing performance (Gaddy et al., 2018). But with the fast development of neural-network-based feature representations (Hochreiter and Schmidhuber, 1997;Vaswani et al., 2017), local models are able to get competitive parsing accuracy while enjoying fast training and testing speed, and thus become an active research topic in constituent parsing.
Locally normalized parsers usually rely on tree decompositions or linearizations. From the perspective of decomposition, the probability of trees can be factorized, for example, on individual spans. Teng and Zhang (2018) investigates such a model which predicts probability on each candidate span. It achieves quite promising parsing results, while the simple local probability factorization still leaves room for improvements. From the perspective of linearization, there are many ways to transform a structured tree into a shallow sequence. As a recent example, Shen et al. (2018) linearizes a tree with a sequence of numbers, each of which indicates words' syntactic distance in the tree (i.e., height of the lowest common ancestor of two adjacent words). Similar ideas are also applied in Vinyals et al. (2015), Choe and Charniak (2016) and transition-based systems (Cross and Huang, 2016;Liu and Zhang, 2017a). With tree linearizations, the training time can be further accelerated to O(n), but the parsers often sacrifice a clear connection with original spans in trees, which makes both features and supervision signals from spans hard to use.
In this work, we propose a novel linearization of constituent trees tied on their span representations. Given a sentence W and its parsing tree T , for each split point after w i in the sentence, we assign it a parsing target d i , where (d i , i) is the longest span ending with i in T . We can show that, for a binary parsing tree, the set {(d i , i)} includes all left child spans in T . Thus the linearization is actually sufficient to recover a parsing tree of the sentence.
Compared with prior work, the linearization is directly based on tree spans, which might make estimating model parameters easier. We also build a different local normalization compared with the simple per-span-normalization in Teng and Zhang (2018). Specifically, the probability P (d i |i) is normalized on all candidate split points on the left of i. The more powerful local model can help to further improve parsing performance while retaining the fast learning and inference speed (with a greedy heuristic for handling illegal sequences, we can achieve O(n log n) average inference complexity).  Figure 1: The process of generating the linearization of the sentence "She loves writing code .". Given an original parsing tree (a), we firstly convert it to a right binary tree by recursively combining the rightmost two children (b). Then, we represent the tree as a span table, and divide it into five parts according to the right boundaries of the spans (c). Green and red circles represent left and right child spans respectively. Gray circles represent spans which do not appear in the tree. In each part, there is only one longest span (green circles), thus the corresponding value of that part is just the left boundary of the green circle.
We perform experiments on PTB and CTB. The proposed parser significantly outperforms existing locally normalized models, and achieves competitive results with state-of-the-art global models (95.8 F1 on PTB and 92.1 F1 on CTB). We also evaluate how the new linearization helps parse spans with different lengths and types. To summarize, our main contributions include: • Proposing a new linearization which has clear interpretation (Section 2).
• Building a new locally normalized model with constraints on span scores (Section 3).
• Compared with previous local models, the proposed parser achieves better performance (competitive with global models) and has faster parsing speed (Section 4).

Tree Linearization
We first prepare some notations. Let W = (w 1 , w 2 , . . . , w n ) be a sentence, T be its binary constituent tree and A ij → B ik C kj be a derivation in T . Denote (i, j)(0 ≤ i < j ≤ n) to be a span from w i+1 to w j (for simplicity, we ignore the label of a span). Clearly, there is only one such linearization for a tree. We have an equal definition of D, which shows the span (d i , i) is a left child span.
Proposition 1. Given a tree T , the set of spans {(d i , i) | i = 1, 2, . . . , n} is equal to the set of left child spans 1 Proof. First, for each j, there is only one left child span (i, j) ending with j, otherwise if (i , j) is a left child span with i = i (e.g. i < i), (i, j) must also be a right child span. Therefore |S| = n. Similarly, if i = d j , (i, j) should be a right child span of (d j , j).
Thus we can generate the linearization using Algorithm 1. For span (i, j) and its gold split k, we can get d k = i. Then we recursively calculate the linearization of span (i, k) and (k, j). Note that the returned linearization D does not contain d n , so we append zero (d n = 0 for the root node) to the end as the final linearization. Figure 1 is a generation process of sentence "She loves writing code .". From the span table, it is obvious that there is only one left child span (green circles) ending with the same right boundary.
In the following discussions, we will use D and S interchangeably. Next, we show two properties of a legal D.
Proposition 2. A linearization D can recover a tree T iff.
Proof. The necessity is obvious. We show the sufficiency by induction on the sentence length. When n = 1, the conclusion stands. Assuming for all linearizations with length less than n, property 1 and 2 lead to a well-formed tree, and now consider a linearization with length n.
Define k = max{k | d k = 0, k < n}. Since d 1 = 0 (by property 1), k is not none. We split the sentence into (0, k), (k, n), and claim that after removing (0, n), the spans in D are either in (0, k) or (k, n), thus by induction we obtain the conclusion. To validate the claim, for k < k, by property 1, we have d k < k < k, thus (d k , k ) is in (0, k). For k > k, by property 2, either d k ≥ k or d k = 0. Since k is the largest index with d k = 0, we have d k = 0, which means (d k , k ) is in (k, n). Therefore, we show the existence of a tree from D. The tree is also unique, because if two trees T and T have the same linearization, by Proposition 1, we have T = T .
Proposition 2 also suggests a top-down algorithm (Algorithm 2) for performing tree inference given a legal linearization. For span (i, j) (with label (i, j)), we find the rightmost split k satisfying d k = i, and then recursively decode the two subtrees rooted at span (i, k) and (k, j), respectively. When D does not satisfy property 2 (our model can ensure property 1), one solution is to seek a minimum change of D to make it legal. However, it is reduced to a minimum vertex cover problem (regarding each span (d i , i) as a point, if two spans violate property 2, we connect an edge between them. ). We can also slightly modify Algorithm 2 to perform an approximate inference (Section 3.4).
return node 11: end function Finally we need to deal with the linearization of non-binary trees. For spans having more than two child spans, there is no definition for their middle child spans whether they are left children or right children, thus Proposition 1 might not stand. We recursively combine two adjacent spans from right to left using an empty label ∅. Then the tree can be converted to a binary tree (Stern et al., 2017a). For a unary branch, we treat it as a unique span with a new label which concatenates all the labels in the branch.

The Parser
In this section, we introduce our encoder, decoder and inference algorithms in detail. Then we compare our normalization method with two other methods, globally normalized and existing locally normalized methods.

Encoder
We represent each word w i using three pieces of information, a randomly initialized word embedding e i , a character-based embedding c i obtained by a character-level LSTM and a randomly initialized part-of-speech tag embedding p i . We concatenate these three embeddings to generate a representation of word w i , To get the representation of the split points, the word representation matrix X = [x 1 , x 2 , . . . , x n ] is fed into a bidirectional LSTM or Transformer (Vaswani et al., 2017) firstly. Then we calculate the representation of the split point between w i and w i+1 using the outputs from the encoders, (1) Note that for Transformer encoder, → h i is calculated in the same way as Kitaev and Klein (2018a).

Decoder
Since a split point can play two different roles when it is the left or right boundary of a span, we use two different vectors to represent the two roles inspired by Dozat and Manning (2017). Concretely, we use two multi-layer perceptrons to generate two different representations, (2) Then we can define the score of span (i, j) using a biaffine attention function (Dozat and Manning, 2017;Li et al., 2019), where W, b 1 and b 2 are all model parameters. α ij measures the possibility of (i, j) being a left child span in the tree.
Different from Stern et al. (2017a) which does global normalization on the probability of the whole tree and Teng and Zhang (2018) which does local normalization on each candidate span, we do normalization on all spans with the same right boundary j. Thus the probability of span (i, j) to be a left child span is defined as, Finally, we can predict the linearization using the probability P (i|j), For label prediction, we first infer the tree structure from the linearization (Section 3.4). 2 Then we use a multi-layer perceptron to calculate the label probability of span (i, j),

Training Objective
Given a gold parsing tree T and its linearization (d 1 , d 2 , . . . , d n ), we can calculate the loss using the negative log-likelihood: The loss function consists of two parts. One is the structure loss, which is only defined on the left child spans. The other one is the label loss, which is defined on all the spans in T .

Tree Inference
To reconstruct the tree structure from the predicted linearization (d 1 , d 2 , . . . , d n ), we must deal with illegal sequences. One solution is to convert an illegal linearization to a legal one, and then use Algorithm 2 to recover the tree. However, the optimal converting algorithm is NP hard as discussed in Section 2. We propose two approximate reconstruction methods, both of which are based on replacing line 5 of Algorithm 2. One is to find the largest k The other is to find the index k of the smallest d k (if there are multiple choices, we choose the largest one), k ← arg min k d k .
Both methods are applicable to legal situations, and they have similar performance in our empirical evaluations. The inference time complexity is O(n 2 ) in the worst-case for unbalanced trees, while in average it is O(n log n) (which is the same as Stern et al. (2017a)). Finally, instead of reconstructing trees from linearization sequences (d 1 , d 2 , . . . , d n ), we could have an accurate CKY-style decoding algorithm from probabilities P (i|j) (Equation 3). Specifically, it maximizes the product of left child span probabilities, where G(i, j) represents the highest probability of subtree with root node (i, j). We can calculate G(0, n) using dynamic programming algorithm and back-trace the tree accordingly. The complexity is O(n 3 ).

More Discussions on Normalization
We can compare our locally normalized model (Equation 3) with other probability factorizations of constituent trees (Figure 2). Global normalization (Figure 2(a)) performs marginalization over all candidate trees, which requires dynamic programming decoding. As a local model, our parser is a span-level factorization of the tree probability, and each factor only marginalizes over a linear number of items (i.e., the probability of span (i, j) is normalized with all scores of (i , j), i < j). It is easier to be parallelized and enjoys a much faster parsing speed. We will show that its performance is also competitive with global models. Teng and Zhang (2018) studies two local normalized models over spans, namely the span model and the rule model. The span model simply considers individual spans independently (Figure 2(b)) which may be the finest factorization. Our model lies between it and the global model.
The rule model considers a similar normalization with our model. If it is combined with the top-down decoding (Stern et al., 2017a), the two parsers look similar. 3 We discuss their differences. The rule model takes all ground truth spans from the gold trees, and for each span (i, j), it compiles a probability P ((i, j) ← (i, k)(k, j)) for its ground truth split k. Our parser, on the other side, factorizes on each word. Therefore, for the same span (i, j), their normalization is constrained within (i, j), while ours is over all i < j. The main advantage of our parser is simpler span representations (not depend on parent spans): it makes the parser easy to batch for sentences with different lengths and tree structures since each d i can be calculated offline before training.

Data and Settings
Datasets and Preprocessing All models are trained on two standard benchmark treebanks, English Penn Treebank (PTB) (Marcus et al., 1993) and Chinese Penn Treebank (CTB) 5.1. The POS tags are predicted using Stanford Tagger (Toutanova et al., 2003). To clean the treebanks, we strip the leaf nodes with POS tag -NONE-from the two treebanks and delete the root nodes with constituent type ROOT. For evaluating the results, we use the standard evaluation tool 4 .
For words in the testing corpus but not in the training corpus, we replace them with a unique label <UNK>. We also replace the words in the training corpus with the unknown label <UNK> with probability p unk (w) = z z+c(w) , where c(w) is the number of time word w appears in the training corpus and we set z = 0.8375 as Cross and Huang (2016).
Hyperparameters We use 100D GloVe (Pennington et al., 2014) embedding for PTB and 80D structured-skipgram (Ling et al., 2015) embedding  for CTB. For character encoding, we randomly initialize the character embeddings with dimension 64. We use Adam optimizer with initial learning rate 1.0 and epsilon 10 −9 . For LSTM encoder, we use a hidden size of 1024, with 0.33 dropout in all the feed-forward and recurrent connections. For Transformer encoder, we use the same hyperparameters as Kitaev and Klein (2018a). For split point representation, we apply two 1024-dimensional hidden size feed-forward networks. All the dropout we use in the decoder layer is 0.33. We also use BERT (Devlin et al., 2019) (uncased, 24 layers, 16 attention heads per layer and 1024-dimensional hidden vectors) and use the output of the last layer as the pre-trained word embeddings. 5 Training Details We use PyTorch as our neural network toolkit and run the code on a NVIDIA GeForce GTX Titan Xp GPU and Intel Xeon E5-2603 v4 CPU. All models are trained for up to 150 epochs with batch size 150 (Zhou and Zhao, 2019). https://github.com/AntNLP/ span-linearization-parser former) significantly outperform the single locally normalized models. Compared with globally normalized models, our models also outperform those parsers with LSTM encoder and achieve a competitive result with Transformer encoder parsers. With the help of BERT (Devlin et al., 2018), our models with two encoders both achieve the same performance (95.8 F1) as the best parser (Zhou and Zhao, 2019). Table 3 shows the final results on CTB test set. Our models (92.1 F1) also significantly outperform local models and achieve competitive result amongst global models.

Main Results
Compared with Teng and Zhang (2018) which does local normalization on single span, our model increases 0.2 F1 on PTB, which shows that doing normalization on more spans is really better. Our model also significantly outperforms Shen et al. (2018) which predicts the syntactic distance of a tree. This indicates the superiority of our linearization method directly tied on the spans.

Evaluation
To better understand the extent to which our model transcends the locally normalized model which does normalization on a single span described in Teng and Zhang (2018), we do several experiments to compare the performance about different lengths of spans and different constituent types.
In order to make a fair comparison, we implement their model by ourselves using the same LSTM encoder as ours. Besides, we ignore the LSTM for label prediction and complex span representations in their models and use simpler settings. Our own implementation achieves the same result as they report (92.4 F1). For convenience, we call their model per-span-normalization (PSN for short) model in the following.
Influence of Span Length First, we analyse the influence of different lengths of spans and the results are shown in Figure 3. We find that for sentences of lengths between [11,45]  spans, PSN model only needs to consider few spans, which is more local and it is enough for the perspan-normalization to handle this situation. For long spans, our model needs to do normalization on more spans and the state space becomes large linearly. So the accuracy decreases fast, and there is no advantage compared with PSN model which uses CKY algorithm for inference. For spans of other lengths, our locally normalized method can take all spans with the same right boundary into consideration and add sum-to-one constraints on their scores. As a result, our model outperforms PSN model even without the help of accurate inference.
Influence of Constituent Type Then we compare the accuracy of different constituent types.   than PSN model, especially in types SBAR, ADJP and QP. When optimizing the representation of one split point, our model can consider all of the words before it, which can be helpful to predict some types. For example, when we predict an adjective phrase (ADJP), its representation has fused the words' information before it (e.g. linking verb like "is"), which can narrow the scope of prediction.

Ablation Study
We perform several ablation experiments by modifying the structure of the decoder layer. The results are shown in Table 4.
First, we delete the two different split point representations described in Equation (2)    Cython" stands for using Cython to optimize the python code. "w/o tree inference" stands for evaluating without tree inference. The model in Kitaev and Klein (2018a) is ran by ourselves, and other speeds are extracted from their original papers.
cates that distinguishing the representations of left and right boundaries of a span is really helpful. Then we delete the local normalization on partial spans and only calculate the probability of each span to be a left child. The inference algorithm is the same as our full model. Final result decreases by 0.5 F1, despite improvement on precision. This might be because our normalization method can add constraints on all the spans with the same right boundary, which makes it effective when only one span is correct.
Finally, we try to predict the labels sequentially, which means assigning each split i a tu- where left i and right i represent the labels of the longest spans ending and starting with i in the tree, respectively. This may make our model become a sequence labeling model similar to Gómez-Rodríguez and Vilares (2018). However, the performance is very poor, and this is largely due to the loss of structural information in the label prediction. Therefore, how to balance efficiency and label prediction accuracy might be a research problem in the future.

Inference Algorithms
We compare three inference algorithms described in Section 3.4. The results are shown in Table 5. We find that different inference algorithms have no obvious effect on the performance, mainly due to the powerful learning ability of our model. Thus we use the third method which is the most convenient to implement.

Parsing Speed
The parsing speeds of our parser and other parsers are shown in Table 6. Although our inference complexity is O(n log n), our speed is faster than other local models, except Shen et al. (2018) which evaluates without tree inference and Vilares et al. (2019) which utilizes a pure sequence tagging framework. This is mainly due to the simplicity of our model and the parallelism of matrix operations for structure prediction. Compared with globally normalized parsers like Zhou and Zhao (2019) and Kitaev and Klein (2018a), our model is also faster even if they use optimization for python code (e.g. Cython 6 ). Other global model like Stern et al. (2017a) which infers in O(n 3 ) complexity is much slower than ours, and this shows the superiority of our linearization in speed.

Related Work
Globally normalized parsers often have high performance on constituent parsing due to their search on the global state space (Stern et al., 2017a;Kitaev and Klein, 2018a;Zhou and Zhao, 2019). However, they suffer from high time complexity and are difficult to parallelize. Thus many efforts have been made to optimize their efficiency (Vieira and Eisner, 2017).
Recently, the rapid development of encoders (Hochreiter and Schmidhuber, 1997;Vaswani et al., 2017) and pre-trained language models (Devlin et al., 2018) have enabled local models to achieve similar performance as global models. Teng and Zhang (2018) propose two local models, one does normalization on each candidate span and one on each grammar rule. Their models even outperform the global model in Stern et al. (2017a) thanks to the better representation of spans. However, they still need an O(n 3 ) complexity inference algorithm to reconstruct the final parsing tree.
Meanwhile, many work do research on faster sequential models. Transition-based models predict a sequence of actions and achieve an O(n) complexity (Watanabe and Sumita, 2015;Cross and Huang, 2016;Liu and Zhang, 2017a). However, they suffer from the issue of error propagation and cannot be parallel. Sequence labeling models regard tree prediction as sequence prediction problem (Gómez-Rodríguez and Vilares, 2018;Shen et al., 2018). These models have high efficiency, but their linearizations have no direct relation to the spans, so the performance is much worse than span-based models.
We propose a novel linearization method closely related to the spans and decode the tree in O(n log n) complexity. Compared with Teng and Zhang (2018), we do normalization on more spans, thus achieve a better performance.
In future work, we will apply graph neural network (Velickovic et al., 2018;Ji et al., 2019;Sun et al., 2019) to enhance the span representation. Due to the excellent properties of our linearization, we can jointly learn constituent parsing and dependency parsing in one graph-based model. In addition, there is also a right linearization defined on the set of right child spans. We can study how to combine the two linear representations to further improve the performance of the model.

Conclusion
In this work, we propose a novel linearization of constituent trees tied on the spans tightly. In addition, we build a new normalization method, which can add constraints on all the spans with the same right boundary. Compared with previous local normalization methods, our method is more accurate for considering more span information, and reserves the fast running speed due to the parallelizable linearization model. The experiments show that our model significantly outperforms existing local models and achieves competitive results with global models.