Efficient Constituency Parsing by Pointing

We propose a novel constituency parsing model that casts the parsing problem into a series of pointing tasks. Specifically, our model estimates the likelihood of a span being a legitimate tree constituent via the pointing score corresponding to the boundary words of the span. Our parsing model supports efficient top-down decoding and our learning objective is able to enforce structural consistency without resorting to the expensive CKY inference. The experiments on the standard English Penn Treebank parsing task show that our method achieves 92.78 F1 without using pre-trained models, which is higher than all the existing methods with similar time complexity. Using pre-trained BERT, our model achieves 95.48 F1, which is competitive with the state-of-the-art while being faster. Our approach also establishes new state-of-the-art in Basque and Swedish in the SPMRL shared tasks on multilingual constituency parsing.


Introduction
Constituency or phrase structure parsing is a core task in natural language processing (NLP) with myriad downstream applications. Therefore, devising effective and efficient algorithms for parsing has been a key focus in NLP.
With the advancements in neural approaches, various neural architectures have been proposed for constituency parsing as they are able to effectively encode the input tokens into dense vector representations while modeling the structural dependencies between tokens in a sentence. These include recurrent networks (Dyer et al., 2016;Stern et al., 2017b) and more recently selfattentive networks (Kitaev and Klein, 2018).
The parsing methods can be broadly distinguished based on whether they employ a greedy transition-based algorithm or a globally optimized chart parsing algorithm. The transition-based parsers (Dyer et al., 2016;Cross and Huang, 2016;Liu and Zhang, 2017) generate trees autoregressively as a form of shift-reduce decisions. Though computationally attractive, the local decisions made at each step may propagate errors to subsequent steps which would suffer from exposure bias. Chart parsing methods, on the other hand, learn scoring functions for subtrees and perform global search over all possible trees to find the most probable tree for a sentence (Durrett and Klein, 2015;Gaddy et al., 2018;Kitaev and Klein, 2018;. In this way, these methods can ensure consistency in predicting structured output. The limitation, however, is that they run slowly at O(n 3 ) or higher time complexity.
In this paper, we propose a novel parsing approach that casts constituency parsing into a series of pointing problems ( Figure 1). Specifically, our parsing model estimates the pointing score from one word to another in the input sentence, which represents the likelihood of the span covering those words being a legitimate phrase structure (i.e., a subtree in the constituency tree). During training, the likelihoods of legitimate spans are maximized using the cross entropy loss. This enables our model to enforce structural consistency, while avoiding the use of structured loss that requires expensive O(n 3 ) CKY inference (Gaddy et al., 2018;Kitaev and Klein, 2018). The training in our model can be fully parallelized without requiring structured inference as in (Shen et al., 2018;Gómez and Vilares, 2018). Our pointing mechanism also allows efficient top-down decoding with a best and worse case running time of O(n log n) and O(n 2 ), respectively.
In the experiments with English Penn Treebank parsing, our model without any pre-training achieves 92.78 F1, outperforming all existing methods with similar time complexity. With pre-trained BERT (Devlin et al., 2019), our model pushes the F1 score to 95.48, which is on par with the state-of-the-art , while supporting faster decoding. Our model also performs competitively on the multilingual parsing tasks in the SPMRL 2013/2014 shared tasks and establishes new state-of-the-art in Basque and Swedish. We will release our code at https://ntunlpsg.github.io/project/parser/ptrconstituency-parser 2 Model Similar to Stern et al. (2017a), we view constituency parsing as the problem of finding a set of labeled spans over the input sentence. Let S(T ) denote the set of labeled spans for a parse tree T . Formally, S(T ) can be expressed as where |S(T )| is the number of spans in the tree. Figure 1 shows an example constituency tree and its corresponding labeled span representation. Following the standard practice in parsing (Gaddy et al., 2018;Shen et al., 2018), we convert the n-ary tree into a binary form and introduce a dummy label ∅ to spans that are not constituents in the original tree but created as a result of binarization. Similarly, the labels in unary chains corresponding to nested labeled spans are collapsed into unique atomic labels, such as S-VP in Fig. 1.
Although our method shares the same "spanbased" view with that of Stern et al. (2017a), our approach diverges significantly from their framework in the way we treat the whole parsing problem, and the representation and modeling of the spans, as we describe below.

Parsing as Pointing
In contrast to previous approaches, we cast parsing as a series of pointing decisions. For each index i in the input sequence, the parsing model points it to another index p i in order to identify the tree span (i, p i ), where i = p i . Similar to Pointer Networks (Vinyals et al., 2015a), each pointing mechanism is modeled as a multinomial distribution over the indices of the input tokens (or encoder states). However, unlike the original pointer network where a decoder state points to an encoder state, in our approach, every encoder state h i points to another encoder state h p i .
In this paper, we generally use x ) y to mean x points to y. We will refer to the pointing operation either as a function of the encoder states (e.g., h i ) h p i ) or simply the corresponding indices (e.g., i ) p i ). They both mean the same operation where the pointing function takes the encoder state h i as the query vector and points to h p i by computing an attention distribution over all the encoder states.
Let P(T ) denote the set of pointing decisions derived from a tree T by a transformation H, i.e., H : T → P(T ). For the parsing process to be valid, the transformation H and its inverse H which transforms P(T ) back to T , should both have a one-to-one mapping property. Otherwise, the parsing model may confuse two different parse trees with the same pointing representation. In this paper, we propose a novel transformation that satisfies this property, as defined by the following proposition (proof provided in the Appendix).
Proposition 1 Given a binary constituency tree T for a sentence containing n tokens, the transformation H converts it into a set of pointing decisions P(T ) = {(i ) p i , l i ) : i = 1, . . . , n − 1; i = p i } such that (min(i, p i ), max(i, p i )) is the largest span that starts or ends at i, and l i is the label of the nonterminal associated with the span.
To elaborate further, each pointing decision in P(T ) represents a specific span in S(T ). The pointing i ) p i is directional, while the span that it represents (i , j ) is non-directional. In other words, there may exist position i such that i > p i , Algorithm 1 Convert binary tree to Pointing Input: Binary tree T and its span representation S(T ) Output: Pointing representation P(T ) The span's label node ← node.parent (x, y) ← node.span Span covered by node end while Until i is no longer start/end point push(P(T ), (i ) pi, li)) end for return P(T ) while i < j ∀i , j ∈ [1, n]. In fact, it is easy to see that if the token at index i is a left-child of a subtree, the largest span involving i starts at i, and in this case i < p i and i = i, j = p i . On the other hand, if the token is a right-child of a subtree, the respective largest span ends at position i, in which case i > p i and i = p i , j = i (e.g., see 4 ) 2 in Figure 1). In addition, as the spans in S(T ) are unique, it can be shown that the pointing decisions in P(T ) are also distinct from one another (see Appendix for a proof by contradiction).
Given such pointing formulation, for every constituency tree, there exists a trivial case (1 ) n, l 1 ) where p 1 = n and l 1 is generally 'S'. Thus, to make our formulation more general with n inputs and n outputs and convenient for the method description discussed later on, we add another trivial case (n ) 1, l 1 ). With this generalization, we can represent the pointing decisions of any binary constituency tree T as: The pointing representation of the tree in Figure 1 is given at the bottom of the figure. To illustrate, in the parse tree, the largest phrase that starts or ends at token 2 ('enjoys') is the subtree rooted at '∅', which spans from 2 to 5. In this case, the span starts at token 2. Similarly, the largest phrase that starts or ends at token 4 ('tennis') is the span "enjoys playing tennis", which is rooted at 'VP'. In this case, the span ends at token 4.
Algorithm 1 describes the procedure to convert a binary tree to its corresponding pointing representation. Specifically, from each leaf token i, the algorithm traverses upward along the hierarchy until the non-terminal node that does not start or end with i. In this way, the largest span starting or ending with i can be identified.

Top-Down Tree Inference
In the previous section, we described how to convert a constituency tree T into a sequence of pointing decisions P(T ). We use this transformation to train the parsing model (described in detail in Sections 2.3 -2.4). During inference, given a sentence to parse, our decoder with the help of the parsing model predicts P(T ), from which we can construct the tree T . However, not all sets of pointings P(T ) guarantee the generation of a valid tree. For example, for a sentence with four (4) tokens, the pointing P(T ) = {(1 ) 4, l 1 ), (2 ) 3, l 2 ), (3 ) 4, l 3 ), (4 ) 1, l 1 )} does not generate a valid tree because token '3' cannot belong to both spans (2, 3) and (3, 4). In other words, simply taking the arg max over the pointing distributions may not generate a valid tree.
Our approach to decoding is inspired by the span-based approach of Stern et al. (2017a). In particular, to reduce the search space, we score for span identification (given by the pointing function) and label assignment separately.
Span Identification. We adopt a top-down greedy approach formulated as follows.
where s split (i, k, j) is the score of having a splitpoint at position k (i ≤ k < j), as defined by the following equation.
where ρ(k ) i) and ρ(k + 1 ) j) are the pointing scores (probabilities) for spans (i, k) and (k+1, j), respectively. Note that the pointing scores are asymmetric, meaning that ρ(i ) j) may not be equal to ρ(j ) i), because pointing from i to j is different from pointing from j to i. This is different from previous approaches, where the score of a span is defined to be symmetric. We build a tree for the input sentence by computing Eq. 3 recursively starting from the full sentence span (1, n).
In the general case when i < k < j − 1, our pointing-based parsing model should learn to assign high scores to the two spans (i, k) and (k + 1, j), or equivalently the pointing decisions k ) i and k+1 ) j. However, the pointing formulation described so far omits the trivial self-pointing decisions, which represent the singleton spans. A singleton span is only created when the splitting decision splits an n-size span into a single-token span (singleton span) and a sub-span of size n − 1, i.e., when k = i or k = j − 1. For instance, for the parsing process in Figure 2a, the splitting decision at the root span (1, 5) results in a singleton span (1, 1) and a general span (2, 5). For this splitting decision, Eq. 3 requires the scores of (1, 1) and (2, 5). However, the set of pointing decisions P(T ) does not cover the pointing for (1, 1). This discrepancy can be resolved by modeling the singleton spans separately. To achieve that, we redefine Eq. 3 as follows: (5) where sp and gp respectively represent the scores for the singleton and general pointing functions (to be defined formally in Section 2.3).
Remark on structural consistency. It is important to note that since the pointing functions are defined to have a global structural property (i.e., the largest span that starts/ends with i), our model inherently enforces structural consistency. The pointing formulation of the parsing problem also makes the training process simple and efficient; it allows us to train the model effectively with simple cross entropy loss (see Section 2.4).
Label Assignment. Label assignment of spans is performed after every split decision. Specifically, as we split a span (i, j) into two sub-spans (i, k) and (k+1, j) which corresponds to the pointing functions of k ) i and k+1 ) j, we perform the label assignments for the two new sub-spans as where gc is the label classifier for any general (non-unary) span and L is the set of possible nonterminal labels. Following Shen et al. (2018), we use a separate classifier uc for determining the labels of the unary spans, e.g., the first layer of labels NP, ∅, . . ., NP, ∅) in Figure 2. Also, note that the label assignment is done based on only the query vector (the encoder state that is used to point).

Algorithm 2 Pointing parsing algorithm
Input: Sentence length n; pointing scores: gp(i, j), sp(i, j); label scores: gc(l|i), uc(l|i), 1 ≤ i ≤ j ≤ n, l ∈ Lg/Lu Output: Parse tree T Q = [(1, n)] queue of spans S = [(1, n, arg max l gc(l|1)] general spans, labels U ={((t, t), arg max l uc(l|t))} n t=1 unary spans, labels using gp, sp if k = i then push(Q, (i + 1, j)) push(S, (i + 1, j, arg max l gc(l|i + 1))) else if k = j − 1 then push(Q, (i, j − 1)) push(S, (i, j − 1, arg max l gc(l|j − 1))) else push(Q, (i, k)) push(Q, (k + 1, j)) push(S, (i, k, arg max l gc(l|k))) push(S, (k + 1, j, arg max l gc(l|k + 1))) end if end while T = S ∪ U Figure 2 illustrates the top-down parsing process for our running example. It consists of a sequence of pointing decisions (Figure 2a, top to bottom), which are then trivially converted to the parse tree (Figure 2b). We also provide the pseudocode in Algorithm 2. Specifically, the algorithm finds the best split for the current span (i, j) using the pointing scores and pushes the newly created sub-spans into the FIFO queue Q. The process terminates when there are no more spans to be split. Similar to Stern et al. (2017a), our parsing algorithm has the worst and best case time complexities of O(n 2 ) and O(n log n), respectively.

Model Architecture
We now describe the architecture of our parsing model: the sentence encoder, the pointing model and the labeling model.
where e char i , e word i , e pos i are respectively the character, word, and part-of-speech (POS) embeddings of the word x i . Following Kitaev and Klein (2018), we use a character LSTM to compute the character embedding of a word. We experiment with both randomly initialized and pre-(a) Execution of pointing parsing algorithm (b) Output parse tree. Figure 2: Inferring the parse tree for a given sentence and its part-of-speech (POS) tags (predicted by an external POS tagger). Starting with the full sentence span (1, 5) and its label S, we predict split point 1 using the base (sp) and general (gp) pointing scores as per Eqn. 3-5. The left singleton span (1, 1) is assigned with a label NP and the right span (2, 5) is assigned with a label ∅ using the label classifier gc as per Eqn. 6. The recursion of splitting and labeling continues until the process reaches a terminal node. The label assignment for the unary spans is done by the uc classifier.
trained word embeddings. If pretrained embeddings are used, the word embedding e word i is the summation of the word's randomly-initialized embedding and the pretrained embedding. The POS embeddings (e pos i ) are randomly initialized. The word representations (e i ) are then passed to a neural network based sequence encoder to obtain their hidden representations. Since our method does not require any specific encoder, one may use any encoder model, such as Bi-LSTM (Hochreiter and Schmidhuber, 1997) or self-attentive encoder (Kitaev and Klein, 2018). In this paper, unless otherwise specified, we use the self-attentive encoder model as our main sequence encoder because of its efficiency with parallel computation. The model is factorized into content and position information in both the self-attention sub-layer and the feedforward layer. Details about this factorization process is provided in Kitaev and Klein (2018).
Pointing and Labeling Models. The results of the aforementioned sequence encoding process are used to compute the pointing and labeling scores. More formally, the encoder network produces a sequence of n latent vectors H = (h 1 , . . . , h n ) for the input sequence X = (x 1 , . . . , x n ). After that, we apply four (4) separate position-wise two-layer Feed-Forward Networks (FFN), formu-lated as FFN(x) = ReLU(xW 1 + b 1 )W 2 + b 2 , to transform H into task-specific latent representations for the respective pointing and labeling tasks.
Note that there is no parameter sharing between FFN gp , FFN sp , FFN gc and FFN uc . The pointing functions are then modeled as the multinomial (or attention) distributions over the input indices for each input position i as follows.
For label assignment functions, we simply feed the label representations H gc = (h gc 1 , . . . , h gc n ) and H uc = (h uc 1 , . . . , h uc n ) into the respective softmax classification layers as follows.
where L g and L u are the set of possible labels for the general and unary spans respectively, w gc l and w uc l are the class-specific trainable weight vectors.

Training Objective
We train our parsing model by minimizing the total loss L total (θ) defined as: L total (θ) = L gp (θ e , θ gp ) + L sp (θ e , θ sp ) +L gc (θ e , θ gc ) + L uc (θ e , θ uc ) (14) where each individual loss is a cross entropy loss computed for the corresponding labeling or pointing task, and θ = {θ e , θ gp , θ sp , θ gc , θ uc } represents the overall model parameters; specifically, θ e denotes the encoder parameters shared by all components, while θ gp , θ sp , θ gc and θ uc denote the separate parameters catering for the four pointing and labeling functions, gp, sp, gc and uc, respectively.

Experiments
To show the effectiveness of our approach, we conduct experiments on English and Multilingual parsing tasks. For English, we use the standard Wall Street Journal (WSJ) part of the Penn Treebank (PTB) (Marcus et al., 1993), whereas for multilingual, we experiment with seven (7) different languages from the SPMRL 2013-2014 shared task (Seddah et al., 2013): Basque, French, German, Hungarian, Korean, Polish and Swedish. For evaluation on PTB, we report the standard labeled precision (LP), labeled recall (LR), and labelled F1 computed by evalb 1 . For the SPMRL datasets, we report labeled F1 and use the same setup in evalb as Kitaev and Klein (2018).

English (PTB) Experiments
Setup. We follow the standard train/valid/test split, which uses sections 2-21 for training, section 22 for development and section 23 for evaluation. This gives 45K sentences for training, 1,700 sentences for development, and 2,416 sentences for testing. Following previous studies, our model uses POS tags predicted by the Stanford tagger (Toutanova et al., 2003).
For our model, we adopt the self-attention encoder with similar hyperparameter details proposed by Kitaev and Klein (2018 and unary label classifiers (gc and uc), the hidden dimension of the specific position-wise feedforward networks is 250, while those for pointing functions (gp and sp) have hidden dimensions of 1024. Our model is trained using the Adam optimizer (Kingma and Ba, 2015) with a batch size of 100 sentences. Additionally, we use 100 warm-up steps, within which we linearly increase the learning rate from 0 to the base learning rate of 0.008. Model selection for testing is performed based on the labeled F1 score on the validation set.
Results for Single Models. The experimental results on PTB for the models without pre-training are shown in Table 1. As it can be seen, our model achieves an F1 of 92.78, the highest among the models using top-down inference strategies. Specifically, our method outperforms Stern et al. (2017a) and Shen et al. (2018) by about 1.0 point in F1-score. Notably, our model with LSTM encoder achieves an F1 of 92.26, which is still better than all the top-down parser methods. On the other hand, while Kitaev and Klein (2018) and Zhou and Zhao (2019) achieve higher F1 score, their inference speed is significantly slower than ours because of the use of CKY based algorithms, which run at O(n 3 ) time complexity for Kitaev and Klein (2018) and O(n 5 ) for Zhou and Zhao (2019). Furthermore, their training objectives involve the use of structural hinge loss, which requires online CKY inference during training. This makes their training time considerably slower than that of our method, which is trained  directly with span-wise cross entropy loss. In addition, Zhou and Zhao (2019) uses external supervision (head information) from the dependency parsing task. Dependency parsing models, in fact, have a strong resemblance to the pointing mechanism that our model employs (Ma et al., 2018). As such, integrating dependency parsing information into our model may also be beneficial. We leave this for future work.
Results with Pre-training Similar to Kitaev and Klein (2018) and , we also evaluate our models with BERT (Devlin et al., 2019) embeddings . Following them in the inclusion of contextualized token representations, we adjust the number of self-attentive layers to 2 and the base learning rate to 0.00005.
As shown in Table 2, our model achieves an F1 score of 95.48, which is on par with the state-ofthe-art models. However, the advantage of our method is that it is faster than those methods. Specifically, our model runs at O(n 2 ) worst-case time complexity, while that of  is O(n 3 ). Comparison on parsing speed is discussed in the following section.
Parsing Speed Comparison. In addition to parsing performance in F1 scores, we also compare our parser against the previous neural approaches in terms of parsing speed. We record the parsing timing over 2416 sentences of the PTB test set with batch size of 1, on a machine with NVIDIA GeForce GTX 1080Ti GPU and Intel(R) Xeon(R) Gold 6152 CPU. This setup is comparable to the setup of Shen et al. (2018).
As shown in Table 3, our parser outperforms Shen et al. (2018) by 19 more sentences per second, despite the fact that our parsing algorithm runs at O(n 2 ) worse-case time complexity while the one used by Shen et al. (2018) can theoretically run at O(n log n) time complexity. To elaborate further, the algorithm presented in Shen et al.

Model
# sents/sec Petrov and Klein (2007) 6.2 Zhu et al. (2013) 89.5 Liu and Zhang (2017) 79.2 Stern et al. (2017a) 75.5 Klein (2018) 94.40 Shen et al. (2018) 111.1 Our model 130.2 (2018) can only run at O(n 2 ) complexity. To achieve O(n log n) complexity, it needs to sort the list of syntactic distances, which the provided code 2 does not implement. In addition, the speed up for our method can be attributed to the fact that our algorithm (see Algorithm 2) uses a while loop, while the algorithm of Shen et al. (2018) has many recursive function calls. Recursive algorithms tend to be less empirically efficient than their equivalent while/for loops in handling lowlevel memory allocations and function call stacks.

SPMRL Multilingual Experiments
Setup. Similar to the English PTB experiments, we use the predicted POS tags from external taggers (provided in the SPMRL datasets). The train/valid/test split is reported in Table 6. For single model evaluation, we use the identical hyperparameters and optimizer setups as in English PTB. For experiments with pre-trained models, we use the multilingual BERT (Devlin et al., 2019), which was trained jointly on 104 languages.
Results. The results for the single models are reported in Table 4. We see that our model achieves the highest F1 score in Basque and Swedish, which are higher than the baselines by 0.52 and 1.37 respective in F1. Our method also performs competitively with the previous state-of-the-art methods on other languages. Table 5 reports the performance of the models using pre-trained BERT. Evidently, our method achieves state-of-the-art results in Basque and Swedish, and performs on par with the previous best method by  in the other five languages. Again, note that our method is considerably faster and easier to train than the

Related Work
Prior to the neural tsunami in NLP, parsing methods typically model correlations in the output space through probabilistic context-free grammars (PCFGs) on top of sparse (and discrete) input representations either in a generative regime  or a discriminative regime (Finkel et al., 2008) or a combination of both (Charniak and Johnson, 2005). Beside the chart parser approach, there is also a long tradition of transition-based parsers (Sagae and Lavie, 2005) Recently, however, with the advent of powerful neural encoders such as LSTMs (Hochreiter and Schmidhuber, 1997), the focus has been switched more towards effective modeling of correlations in the input's latent space, as the output structures are nothing but a function of the input (Gaddy et al., 2018). Various neural network models have been proposed to effectively encode the dense input representations and correlations, and have achieved state-of-the-art parsing results. To enforce the structural consistency, existing neural parsing methods either employ a transition-based algorithm (Dyer et al., 2016;Liu and Zhang, 2017; or a globally optimized chart-parsing algorithm (Gaddy et al., 2018;Kitaev and Klein, 2018).
Meanwhile, researchers also attempt to convert the constituency parsing problem into tasks that can be solved in alternative ways. For instance, Fernández-González and Martins (2015) transform the phrase structure into a special form of dependency structure. Such a dependency structure, however, requires certain corrections while converting back to the corresponding constituency tree. Gómez and Vilares (2018) and Shen et al. (2018) propose to map the constituency tree for a sentence of n tokens into a sequence of n − 1 labels or scalars based on the depth or height of the lowest common ancestors between pairs of consecutive tokens. In addition, methods like (Vinyals et al., 2015b;Vaswani et al., 2017) apply the sequence-to-sequence framework to "translate" a sentence into the linearized form of its constituency tree. While being trivial and simple, parsers of this type do not guarantee structural correctness, because the syntax of the linearized form is not constrained during tree decoding.
Our approach differs from previous work in that it represents the constituency structure as a series of pointing representations and has a relatively simpler cross entropy based learning objective. The pointing representations can be computed in parallel, and can be efficiently converted into a full constituency tree using a top-down algorithm. Our pointing mechanism shares certain similarities with the Pointer Network (Vinyals et al., 2015a), but is distinct from it in that our method points a word to another word within the same encoded sequence.

Conclusion
We have presented a novel constituency parsing method that is based on a pointing mechanism. Our method utilizes an efficient top-down decoding algorithm that uses pointing functions for scoring possible spans. The pointing formulation inherently captures global structural properties and allows efficient training with cross entropy loss. With experiments we have shown that our method outperforms all existing top-down methods on the English Penn Treebank parsing task. Our method with pre-training rivals the state-ofthe-art method, while being faster than it. On multilingual constituency parsing, it also establishes new state-of-the-art in Basque and Swedish.