Straight to the Tree: Constituency Parsing with Neural Syntactic Distance

In this work, we propose a novel constituency parsing scheme. The model first predicts a real-valued scalar, named syntactic distance, for each split position in the sentence. The topology of grammar tree is then determined by the values of syntactic distances. Compared to traditional shift-reduce parsing schemes, our approach is free from the potentially disastrous compounding error. It is also easier to parallelize and much faster. Our model achieves the state-of-the-art single model F1 score of 92.1 on PTB and 86.4 on CTB dataset, which surpasses the previous single model results by a large margin.


Introduction
Devising fast and accurate constituency parsing algorithms is an important, long-standing problem in natural language processing. Parsing has been useful for incorporating linguistic prior in several related tasks, such as relation extraction, paraphrase detection (Callison-Burch, 2008), and more recently, natural language inference (Bowman et al., 2016) and machine translation (Eriguchi et al., 2017).
Neural network-based approaches relying on dense input representations have recently achieved competitive results for constituency parsing (Vinyals et al., 2015;Cross and Huang, 2016;Liu and Zhang, 2017b;Stern et al., 2017a). Generally speaking, either these approaches produce the parse tree sequentially, by governing Figure 1: An example of how syntactic distances (d1 and d2) describe the structure of a parse tree: consecutive words with larger predicted distance are split earlier than those with smaller distances, in a process akin to divisive clustering. the sequence of transitions in a transition-based parser (Nivre, 2004;Zhu et al., 2013;Chen and Manning, 2014;Cross and Huang, 2016), or use a chart-based approach by estimating non-linear potentials and performing exact structured inference by dynamic programming (Finkel et al., 2008;Durrett and Klein, 2015;Stern et al., 2017a).
Transition-based models decompose the structured prediction problem into a sequence of local decisions. This enables fast greedy decoding but also leads to compounding errors because the model is never exposed to its own mistakes during training (Daumé et al., 2009). Solutions to this problem usually complexify the training procedure by using structured training through beamsearch (Weiss et al., 2015;Andor et al., 2016) and dynamic oracles (Goldberg and Nivre, 2012;Cross and Huang, 2016). On the other hand, chartbased models can incorporate structured loss functions during training and benefit from exact inference via the CYK algorithm but suffer from higher computational cost during decoding (Durrett and Klein, 2015;Stern et al., 2017a).
In this paper, we propose a novel, fully-parallel model for constituency parsing, based on the concept of "syntactic distance", recently introduced by (Shen et al., 2017) for language modeling. To construct a parse tree from a sentence, one can proceed in a top-down manner, recursively splitting larger constituents into smaller constituents, where the order of the splits defines the hierarchical structure. The syntactic distances are defined for each possible split point in the sentence. The order induced by the syntactic distances fully specifies the order in which the sentence needs to be recursively split into smaller constituents (Figure 1): in case of a binary tree, there exists a oneto-one correspondence between the ordering and the tree. Therefore, our model is trained to reproduce the ordering between split points induced by the ground-truth distances by means of a margin rank loss (Weston et al., 2011). Crucially, our model works in parallel: the estimated distance for each split point is produced independently from the others, which allows for an easy parallelization in modern parallel computing architectures for deep learning, such as GPUs. Along with the distances, we also train the model to produce the constituent labels, which are used to build the fully labeled tree. Our model is fully parallel and thus does not require computationally expensive structured inference during training. Mapping from syntactic distances to a tree can be efficiently done in O(n log n), which makes the decoding computationally attractive. Despite our strong conditional independence assumption on the output predictions, we achieve good performance for single model discriminative parsing in PTB (91.8 F1) and CTB (86.5 F1) matching, and sometimes outperforming, recent chart-based and transition-based parsing models.

Syntactic Distances of a Parse Tree
In this section, we start from the concept of syntactic distance introduced in Shen et al. (2017) for unsupervised parsing via language modeling and we extend it to the supervised setting. We propose two algorithms, one to convert a parse tree into a compact representation based on distances between consecutive words, and another to map the inferred representation back to a complete parse tree. The representation will later be used for supervised training. We formally define the syntactic distances of a parse tree as follows: return d, c, t, h 17: end function Definition 2.1. Let T be a parse tree that contains a set of leaves (w 0 , ..., w n ). The height of the lowest common ancestor for two leaves (w i , w j ) is noted asd i j . The syntactic distances of T can be any vector of scalars d = (d 1 , ..., d n ) that satisfy: In other words, d induces the same ranking order as the quantitiesd j i computed between pairs of consecutive words in the sequence, i.e. (d 0 1 , ...,d n−1 n ). Note that there are n − 1 syntactic distances for a sentence of length n.
Example 2.1. Consider the tree in Fig. 1 for which d 0 1 = 2,d 1 2 = 1. An example of valid syntactic distances for this tree is any d = (d 1 , d 2 ) such that d 1 > d 2 .
Given this definition, the parsing model predicts a sequence of scalars, which is a more natural setting for models based on neural networks, rather than predicting a set of spans. For comparison, in most of the current neural parsing methods, the model needs to output a sequence of transitions (Cross and Huang, 2016;Chen and Manning, 2014).
Let us first consider the case of a binary parse tree. Algorithm 1 provides a way to convert it to a tuple (d, c, t), where d contains the height of the inner nodes in the tree following a left-to-right (in order) traversal, c the constituent labels for each node in the same order and t the part-of-speech (a) Boxes in the bottom are words and their corresponding POS tags predicted by an external tagger. The vertical bars in the middle are the syntactic distances, and the brackets on top of them are labels of constituents. The bottom brackets are the predicted unary label for each words, and the upper brackets are predicted labels for other constituent.
(b) The corresponding inferred grammar tree. Figure 2: Inferring the parse tree with Algorithm 2 given distances, constituent labels, and POS tags. Starting with the full sentence, we pick split point 1 (as it is assigned to the larger distance) and assign label S to span (0,5). The left child span (0,1) is assigned with a tag PRP and a label NP, which produces an unary node and a terminal node. The right child span (1,5) is assigned the label ∅, coming from implicit binarization, which indicates that the span is not a real constituent and all of its children are instead direct children of its parent. For the span (1,5), the split point 4 is selected. The recursion of splitting and labeling continues until the process reaches a terminal node.
Algorithm 2 Distance to Binary Parse Tree end if 10: return node 11: end function (POS) tags of each word in the left-to-right order. d is a valid vector of syntactic distances satisfying Definition 2.1.
Once a model has learned to predict these variables, Algorithm 2 can reconstruct a unique binary tree from the output of the model (d,ĉ,t). The idea in Algorithm 2 is similar to the top-down parsing method proposed by Stern et al. (2017a), but differs in one key aspect: at each recursive call, there is no need to estimate the confidence for every split point. The algorithm simply chooses the split point i with the maximumd i , and assigns to the span the predicted labelĉ i . This makes the running time of our algorithm to be in O(n log n), compared to the O(n 2 ) of the greedy top-down algorithm by (Stern et al., 2017a). Figure 2 shows an example of the reconstruction of parse tree. Alternatively, the tree reconstruction process can also be done in a bottom-up manner, which requires the recursive composition of adjacent spans according to the ranking induced by their syntactic distance, a process akin to agglomerative clustering.
One potential issue is the existence of unary and n-ary nodes. We follow the method proposed by Stern et al. (2017a) and add a special empty label ∅ to spans that are not themselves full constituents but simply arise during the course of implicit binarization. For the unary nodes that contains one nonterminal node, we take the common approach of treating these as additional atomic labels alongside all elementary nonterminals (Stern et al., 2017a). For all terminal nodes, we determine whether it belongs to a unary chain or not by predicting an additional label. If it is predicted with a label different from the empty label, we conclude that it is a direct child of a unary constituent with that label. Otherwise if it is predicted to have an empty label, we conclude that it is a child of a bigger constituent which has other constituents or words as its siblings.
An n-ary node can arbitrarily be split into binary nodes. We choose to use the leftmost split point. The split point may also be chosen based on model prediction during training. Recovering an n-ary parse tree from the predicted binary tree simply requires removing the empty nodes and split combined labels corresponding to unary chains.
Algorithm 2 is a divide-and-conquer algorithm. The running time of this procedure is O(n log n). However, the algorithm is naturally adapted for execution in a parallel environment, which can further reduce its running time to O(log n).

Learning Syntactic Distances
We use neural networks to estimate the vector of syntactic distances for a given sentence. We use a modified hinge loss, where the target distances are generated by the tree-to-distance conversion given by Algorithm 1. Section 3.1 will describe in detail the model architecture, and Section 3.2 describes the loss we use in this setting.

Model Architecture
Given input words w = (w 0 , w 1 , ..., w n ), we predict the tuple (d, c, t). The POS tags t are given by an external Part-Of-Speech (POS) tagger. The syntactic distances d and constituent labels c are predicted using a neural network architecture that stacks recurrent (LSTM (Hochreiter and Schmidhuber, 1997)) and convolutional layers.
Words and tags are first mapped to sequences of embeddings e w 0 , ..., e w n and e t 0 , ..., e t n . Then the word embeddings and the tag embeddings are concatenated together as inputs for a stack of bidirectional LSTM layers: where BiLSTM w (·) is the word-level bidirectional layer, which gives the model enough capacity to capture long-term syntactical relations between words.
To predict the constituent labels for each word, we pass the hidden states representations h w 0 , ..., h w n through a 2-layer network FF w c , with softmax output: To compose the necessary information for inferring the syntactic distances and the constituency label information, we perform an additional convolution: where g s i can be seen as a draft representation for each split position in Algorithm 2. Note that the subscripts of g s i s start with 1, since we have n − 1 positions as non-terminal constituents. Then, we stack a bidirectional LSTM layer on top of g s i : h s 1 , ..., h s n = BiLSTM s (g s 1 , . . . , g s n ) where BiLSTM s fine-tunes the representation by conditioning on other split position representations. Interleaving between LSTM and convolution layers turned out empirically to be the best choice over multiple variations of the model, including using self-attention (Vaswani et al., 2017) instead of LSTM.
To calculate the syntactic distances for each position, the vectors h s 1 , . . . , h s n are transformed through a 2-layer feed-forward network FF d with a single output unit (this can be done in parallel with 1x1 convolutions), with no activation function at the output layer: For predicting the constituent labels, we pass the same representations h s 1 , . . . , h s n through another 2-layer network FF s c , with softmax output.
The overall architecture is shown in Figure 2a. Since the output (d, c, t) can be unambiguously transfered to a unique parse tree, the model implicitly makes all parsing decisions inside the recurrent and convolutional layers.

Given a set of training examples
, the training objective is the sum of the prediction losses of syntactic distances d k and constituent labels c k .
Due to the categorical nature of variable c, we use a standard softmax classifier with a crossentropy loss L label for constituent labels, using the estimated probabilities obtained in Eq. 3 and 7.
A naïve loss function for estimating syntactic distances is the mean-squared error (MSE): The MSE loss forces the model to regress on the exact value of the true distances. Given that only the ranking induced by the ground-truth distances in d is important, as opposed to the absolute values themselves, using an MSE loss over-penalizes the model by ignoring ranking equivalence between different predictions. Therefore, we propose to minimize a pair-wise learning-to-rank loss, similar to those proposed in (Burges et al., 2005). We define our loss as a variant of the hinge loss as: where [x] + is defined as max(0, x). This loss encourages the model to reproduce the full ranking order induced by the ground-truth distances. The final loss for the overall model is just the sum of individual losses L = L label + L rank dist .

Experiments
We evaluate our model described above on 2 different datasets, the standard Wall Street Journal (WSJ) part of the Penn Treebank (PTB) dataset, and the Chinese Treebank (CTB) dataset. For evaluating the F1 score, we use the standard evalb 1 tool. We provide both labeled and unlabeled F1 score, where the former takes into consideration the constituent label for each predicted 1 http://nlp.cs.nyu.edu/evalb/ constituent, while the latter only considers the position of the constituents. In the tables below, we report the labeled F1 scores for comparison with previous work, as this is the standard metric usually reported in the relevant literature.

Penn Treebank
For the PTB experiments, we follow the standard train/valid/test separation and use sections 2-21 for training, section 22 for development and section 23 for test set. Following this split, the dataset has 45K training sentences and 1700, 2416 sentences for valid/test respectively. The placeholders with the -NONE-tag are stripped from the dataset during preprocessing. The POS tags are predicted with the Stanford Tagger (Toutanova et al., 2003).
We use a hidden size of 1200 for each direction on all LSTMs, with 0.3 dropout in all the feedforward connections, and 0.2 recurrent connection dropout (Merity et al., 2017). The convolutional filter size is 2. The number of convolutional channels is 1200. As a common practice for neural network based NLP models, the embedding layer that maps word indexes to word embeddings is randomly initialized. The word embeddings are sized 400. Following (Merity et al., 2017), we randomly swap an input word embedding during training with the zero vector with probability of 0.1. We found this helped the model to generalize better. Training is conducted with Adam algorithm with l2 regularization decay 1 × 10 −6 . We pick the result obtaining the highest labeled F1 on the validation set, and report the corresponding test F1, together with other statistics. We report our results in Table 1. Our best model obtains a labeled F1 score of 91.8 on the test set (Table 1). Detailed dev/test set performances, including label accuracy is reported in Table 3. Our model performs achieves good performance for single-model constituency parsing trained without external data. The best result from (Stern et al., 2017b) is obtained by a generative model. Very recently, we came to knowledge of Gaddy et al. (2018), which uses character-level LSTM features coupled with chart-based parsing to improve performance. Similar sub-word features can be also used in our model. We leave this investigation for future works. For comparison, other models obtaining better scores either use ensembles, benefit from semi-supervised learning, or recur to re-ranking of a set of candidates.

Chinese Treebank
We use the Chinese Treebank 5.1 dataset, with articles 001-270 and 440-1151 for training, articles Model LP LR F1 Single Model Charniak (2000) 82.1 79.6 80.8 Zhu et al. (2013) 84   (Liu and Zhang, 2017b). The -NONE-tags are stripped as well. The hidden size for the LSTM networks is set to 1200. We use a dropout rate of 0.4 on the feed-forward connections, and 0.1 recurrent connection dropout. The convolutional layer has 1200 channels, with a filter size of 2. We use 400 dimensional word embeddings. During training, input word embeddings are randomly swapped with the zero vector with probability of 0.1. We also apply a l2 regularization weighted by 1×10 −6 on the parameters of the network. Table 2 reports our results compared to other benchmarks.
To the best of our knowledge, we set a new stateof-the-art for single-model parsing achieving 86.5 F1 on the test set. The detailed statistics are shown in Table 3.

Ablation Study
We perform an ablation study by removing/adding components from a our model, and re-train the ablated version from scratch. This gives an idea of the relative contributions of each of the components in the model. Results are reported in Table 4   compute an extra word level embedding vector as input of our model. It's seems that characterlevel features give marginal improvements in our model. We also experimented by using 300D GloVe (Pennington et al., 2014) embedding for the input layer but this didn't yield improvements over the model's best performance. Unsurprisingly, the model trained with MSE loss underperforms considerably a model trained with the rank loss.

Parsing Speed
The prediction of syntactic distances can be batched in modern GPU architectures. The distance to tree conversion is a O(n log n) (n stand for the number of words in the input sentence) divide-and-conquer algorithm. We compare the parsing speed of our parser with other state-ofthe-art neural parsers in Table 5. As the syntactic distance computation can be performed in parallel within a GPU, we first compute the distances in a batch, then we iteratively decode the tree with Algorithm 2. It is worth to note that this comparison may be unfair since some of the reported results may use very different hardware settings. We couldn't find the source code to re-run them on our hardware, to give a fair enough comparison. In our setting, we use an NVIDIA TITAN Xp graphics card for running the neural network part, and the distance to tree inference is run on an Intel Core i7-6850K CPU, with 3.60GHz clock speed.

Related Work
Parsing natural language with neural network models has recently received growing attention. These models have attained state-of-the-art results for dependency parsing (Chen and Manning, 2014) and constituency parsing (Dyer et al., 2016;Cross and Huang, 2016;Coavoux and Crabbé, 2016). Early work in neural network based parsing directly use a feed-forward neural network to predict parse trees (Chen and Manning, 2014). Vinyals et al. (2015) use a sequence-tosequence framework where the decoder outputs a linearized version of the parse tree given an input sentence. Generally, in these models, the correctness of the output tree is not strictly ensured (although empirically observed). Other parsing methods ensure structural consistency by operating in a transition-based setting (Chen and Manning, 2014) by parsing either in the top-down direction (Dyer et al., 2016;Liu and Zhang, 2017b), bottom-up (Zhu et al., 2013;Watanabe and Sumita, 2015;Cross and Huang, 2016) and recently in-order (Liu and Zhang, 2017a). Transition-based methods generally suffer from compounding errors due to exposure bias: during testing, the model is exposed to a very different regime (i.e. decisions sampled from the model itself) than what was encountered during training (i.e. the ground-truth decisions) (Daumé et al., 2009;Goldberg and Nivre, 2012). This can have catastrophic effects on test performance but can be mitigated to a certain extent by using beamsearch instead of greedy decoding. (Stern et al., 2017b) proposes an effective inference method for generative parsing, which enables direct decoding in those models. More complex training methods have been devised in order to alleviate this problem (Goldberg and Nivre, 2012;Cross and Huang, 2016). Other efforts have been put into neural chart-based parsing (Durrett and Klein, 2015;Stern et al., 2017a) which ensure structural consistency and offer exact inference with CYK algorithm. (Gaddy et al., 2018) includes a simplified CYK-style inference, but the complexity still remains in O(n 3 ).
In this work, our model learns to produce a particular representation of a tree in parallel. Representations can be computed in parallel, and the conversion from representation to a full tree can efficiently be done with a divide-and-conquer algorithm. As our model outputs decisions in parallel, our model doesn't suffer from the exposure bias. Interestingly, a series of recent works, both in machine translation (Gu et al., 2018) and speech synthesis (Oord et al., 2017), considered the sequence of output variables conditionally independent given the inputs.

Conclusion
We presented a novel constituency parsing scheme based on predicting real-valued scalars, named syntactic distances, whose ordering identify the sequence of top-down split decisions. We employ a neural network model that predicts the distances d and the constituent labels c. Given the algorithms presented in Section 2, we can build an unambiguous mapping between each (d, c, t) and a parse tree. One peculiar aspect of our model is that it predicts split decisions in parallel. Our experiments show that our model can achieve strong performance compare to previous models, while being significantly more efficient. Since the architecture of model is no more than a stack of standard recurrent and convolution layers, which are essential components in most academic and industrial deep learning frameworks, the deployment of this method would be straightforward.