Stack-Pointer Networks for Dependency Parsing

We introduce a novel architecture for dependency parsing: stack-pointer networks (StackPtr). Combining pointer networks (Vinyals et al., 2015) with an internal stack, the proposed model first reads and encodes the whole sentence, then builds the dependency tree top-down (from root-to-leaf) in a depth-first fashion. The stack tracks the status of the depth-first search and the pointer networks select one child for the word at the top of the stack at each step. The StackPtr parser benefits from the information of whole sentence and all previously derived subtree structures, and removes the left-to-right restriction in classical transition-based parsers. Yet the number of steps for building any (non-projective) parse tree is linear in the length of the sentence just as other transition-based parsers, yielding an efficient decoding algorithm with O(n^2) time complexity. We evaluate our model on 29 treebanks spanning 20 languages and different dependency annotation schemas, and achieve state-of-the-art performances on 21 of them


Introduction
Dependency parsing, which predicts the existence and type of linguistic dependency relations between words, is a first step towards deep language understanding. Its importance is widely recognized in the natural language processing (NLP) community, with it benefiting a wide range of NLP applications, such as coreference resolution (Ng, 2010;Durrett and Klein, 2013; Work done while at Carnegie Mellon University. 2016), sentiment analysis (Tai et al., 2015), machine translation (Bastings et al., 2017), information extraction (Nguyen et al., 2009;Angeli et al., 2015;Peng et al., 2017), word sense disambiguation (Fauceglia et al., 2015), and low-resource languages processing (McDonald et al., 2013;Ma and Xia, 2014). There are two dominant approaches to dependency parsing (Buchholz and Marsi, 2006;Nivre et al., 2007): local and greedy transitionbased algorithms (Yamada and Matsumoto, 2003;Nivre and Scholz, 2004;Zhang and Nivre, 2011;Chen and Manning, 2014), and the globally optimized graph-based algorithms (Eisner, 1996;Mc-Donald et al., 2005a,b;. Transition-based dependency parsers read words sequentially (commonly from left-to-right) and build dependency trees incrementally by making series of multiple choice decisions. The advantage of this formalism is that the number of operations required to build any projective parse tree is linear with respect to the length of the sentence. The challenge, however, is that the decision made at each step is based on local information, leading to error propagation and worse performance compared to graph-based parsers on root and long dependencies (McDonald and Nivre, 2011). Previous studies have explored solutions to address this challenge. Stack LSTMs  are capable of learning representations of the parser state that are sensitive to the complete contents of the parser's state. Andor et al. (2016) proposed a globally normalized transition model to replace the locally normalized classifier. However, the parsing accuracy is still behind state-of-the-art graph-based parsers (Dozat and Manning, 2017).
Graph-based dependency parsers, on the other hand, learn scoring functions for parse trees and perform exhaustive search over all possible trees for a sentence to find the globally highest scoring tree. Incorporating this global search algorithm with distributed representations learned from neural networks, neural graph-based parsers (Kiperwasser and Goldberg, 2016;Wang and Chang, 2016;Kuncoro et al., 2016;Dozat and Manning, 2017) have achieved the state-of-the-art accuracies on a number of treebanks in different languages. Nevertheless, these models, while accurate, are usually slow (e.g. decoding is O(n 3 ) time complexity for first-order models (McDonald et al., 2005a,b) and higher polynomials for higherorder models (McDonald and Pereira, 2006;Ma and Zhao, 2012b,a)).
In this paper, we propose a novel neural network architecture for dependency parsing, stackpointer networks (STACKPTR). STACKPTR is a transition-based architecture, with the corresponding asymptotic efficiency, but still maintains a global view of the sentence that proves essential for achieving competitive accuracy. Our STACKPTR parser has a pointer network (Vinyals et al., 2015) as its backbone, and is equipped with an internal stack to maintain the order of head words in tree structures. The STACKPTR parser performs parsing in an incremental, topdown, depth-first fashion; at each step, it generates an arc by assigning a child for the head word at the top of the internal stack. This architecture makes it possible to capture information from the whole sentence and all the previously derived subtrees, while maintaining a number of parsing steps linear in the sentence length.
We evaluate our parser on 29 treebanks across 20 languages and different dependency annotation schemas, and achieve state-of-the-art performance on 21 of them. The contributions of this work are summarized as follows: (i) We propose a neural network architecture for dependency parsing that is simple, effective, and efficient. (ii) Empirical evaluations on benchmark datasets over 20 languages show that our method achieves state-of-the-art performance on 21 different treebanks 1 . (iii) Comprehensive error analysis is conducted to compare the proposed method to a strong graph-based baseline using biaffine attention (Dozat and Manning, 2017).

Background
We first briefly describe the task of dependency parsing, setup the notation, and review Pointer Networks (Vinyals et al., 2015).

Dependency Parsing and Notations
Dependency trees represent syntactic relationships between words in the sentences through labeled directed edges between head words and their dependents. Figure 1 (a) shows a dependency tree for the sentence, "But there were no buyers".
In this paper, we will use the following notation: Input: x = {w 1 , . . . , w n } represents a generic sentence, where w i is the ith word.
Output: y = {p 1 , p 2 , · · · , p k } represents a generic (possibly non-projective) dependency tree, where each path p i = $, w i,1 , w i,2 , · · · , w i,l i is a sequence of words from the root to a leaf. "$" is an universal virtual root that is added to each tree.
Stack: σ denotes a stack configuration, which is a sequence of words. We use σ|w to represent a stack configuration that pushes word w into the stack σ.
Children: ch(w i ) denotes the list of all the children (modifiers) of word w i .

Pointer Networks
Pointer Networks (PTR-NET) (Vinyals et al., 2015) are a variety of neural network capable of learning the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. This model cannot be trivially expressed by standard sequence-to-sequence networks  due to the variable number of input positions in each sentence. PTR-NET solves the problem by using attention (Bahdanau et al., 2015;Luong et al., 2015) as a pointer to select a member of the input sequence as the output.
Formally, the words of the sentence x are fed one-by-one into the encoder (a multiple-layer bidirectional RNN), producing a sequence of encoder hidden states s i . At each time step t, the decoder (a uni-directional RNN) receives the input from last step and outputs decoder hidden state h t . The attention vector a t is calculated as follows: where score(·, ·) is the attention scoring function, which has several variations such as dot-product, concatenation, and biaffine (Luong et al., 2015). PTR-NET regards the attention vector a t as a probability distribution over the source words, i.e. it uses a t i as pointers to select the input elements.

Overview
Similarly to PTR-NET, STACKPTR first reads the whole sentence and encodes each word into the encoder hidden state s i . The internal stack σ is always initialized with the root symbol $. At each time step t, the decoder receives the input vector corresponding to the top element of the stack σ (the head word w p where p is the word index), generates the hidden state h t , and computes the attention vector a t using Eq. (1). The parser chooses a specific position c according to the attention scores in a t to generate a new dependency arc (w h , w c ) by selecting w c as a child of w h . Then the parser pushes w c onto the stack, i.e. σ → σ|w c , and goes to the next step. At one step if the parser points w h to itself, i.e. c = h, it indicates that all children of the head word w h have already been selected. Then the parser goes to the next step by popping w h out of σ.
At test time, in order to guarantee a valid dependency tree containing all the words in the input sentences exactly once, the decoder maintains a list of "available" words. At each decoding step, the parser selects a child for the current head word, and removes the child from the list of available words to make sure that it cannot be selected as a child of other head words.
For head words with multiple children, it is possible that there is more than one valid selection for each time step. In order to define a deterministic decoding process to make sure that there is only one ground-truth choice at each step (which is necessary for simple maximum likelihood estimation), a predefined order for each ch(w i ) needs to be introduced. The predefined order of children can have different alternatives, such as leftto-right or inside-out 2 . In this paper, we adopt the inside-out order 3 since it enables us to utilize second-order sibling information, which has been proven beneficial for parsing performance (Mc-Donald and Pereira, 2006; (see § 3.4 for details). Figure 1 (b) depicts the architecture of STACKPTR and the decoding procedure for the example sentence in Figure 1 (a).

Encoder
The encoder of our parsing model is based on the bi-directional LSTM-CNN architecture (BLSTM-CNNs) (Chiu and Nichols, 2016; where CNNs encode character-level information of a word into its character-level repre-sentation and BLSTM models context information of each word. Formally, for each word, the CNN, with character embeddings as inputs, encodes the character-level representation. Then the character-level representation vector is concatenated with the word embedding vector to feed into the BLSTM network. To enrich word-level information, we also use POS embeddings. Finally, the encoder outputs a sequence of hidden states s i .

Decoder
The decoder for our parser is a uni-directional LSTM. Different from previous work (Bahdanau et al., 2015;Vinyals et al., 2015) which uses word embeddings of the previous word as the input to the decoder, our decoder receives the encoder hidden state vector (s i ) of the top element in the stack σ (see Figure 1 (b)). Compared to word embeddings, the encoder hidden states contain more contextual information, benefiting both the training and decoding procedures. The decoder produces a sequence of decoder hidden states h i , one for each decoding step.

Higher-order Information
As mentioned before, our parser is capable of utilizing higher-order information. In this paper, we incorporate two kinds of higher-order structures grandparent and sibling. A sibling structure is a head word with two successive modifiers, and a grandparent structure is a pair of dependencies connected head-to-tail: 0 12 3 456 2782 96 56 986 2 5214 3 77543 9 5 2 52 ÿ ÿ c c c d efgfhefhij k l mn l ho op qheirl n e s p l t k l mn l ho op qhet k l mn l ho To utilize higher-order information, the decoder's input at each step is the sum of the encoder hidden states of three words: where β t is the input vector of decoder at time t and h, g, s are the indices of the head word and its grandparent and sibling, respectively. Figure 1 (b) illustrates the details. Here we use the element-wise sum operation instead of concatenation because it does not increase the dimension of the input vector β t , thus introducing no additional model parameters.

Biaffine Attention Mechanism
For attention score function (Eq. (1)), we adopt the biaffine attention mechanism (Luong et al., 2015;Dozat and Manning, 2017): where W, U, V, b are parameters, denoting the weight matrix of the bi-linear term, the two weight vectors of the linear terms, and the bias vector. As discussed in Dozat and Manning (2017), applying a multilayer perceptron (MLP) to the output vectors of the BLSTM before the score function can both reduce the dimensionality and overfitting of the model. We follow this work by using a one-layer perceptron to s i and h i with elu (Clevert et al., 2015) as its activation function.
Similarly, the dependency label classifier also uses a biaffine function to score each label, given the head word vector h t and child vector s i as inputs. Again, we use MLPs to transform h t and s i before feeding them into the classifier.

Training Objectives
The STACKPTR parser is trained to optimize the probability of the dependency trees given sentences: P θ (y|x), which can be factorized as: where θ represents model parameters. p <i denotes the preceding paths that have already been generated. c i,j represents the jth word in p i and c i,<j denotes all the proceeding words on the path p i . Thus, the STACKPTR parser is an autoregressive model, like sequence-to-sequence models, but it factors the distribution according to a top-down tree structure as opposed to a left-to-right chain. We define P θ (c i,j |c i,<j , p <i , x) = a t , where attention vector a t (of dimension n) is used as the distribution over the indices of words in a sentence.
Arc Prediction Our parser is trained by optimizing the conditional likelihood in Eq (2), which is implemented as the cross-entropy loss.

Label Prediction
We train a separated multiclass classifier in parallel to predict the dependency labels. Following Dozat and Manning (2017), the classifier takes the information of the head word and its child as features. The label classifier is trained simultaneously with the parser by optimizing the sum of their objectives.

Discussion
Time Complexity. The number of decoding steps to build a parse tree for a sentence of length n is 2n−1, linear in n. Together with the attention mechanism (at each step, we need to compute the attention vector a t , whose runtime is O(n)), the time complexity of decoding algorithm is O(n 2 ), which is more efficient than graph-based parsers that have O(n 3 ) or worse complexity when using dynamic programming or maximum spanning tree (MST) decoding algorithms.
Top-down Parsing. When humans comprehend a natural language sentence, they arguably do it in an incremental, left-to-right manner. However, when humans consciously annotate a sentence with syntactic structure, they rarely ever process in fixed left-to-right order. Rather, they start by reading the whole sentence, then seeking the main predicates, jumping back-and-forth over the sentence and recursively proceeding to the subtree structures governed by certain head words. Our parser follows a similar kind of annotation process: starting from reading the whole sentence, and processing in a top-down manner by finding the main predicates first and only then search for sub-trees governed by them. When making latter decisions, the parser has access to the entire structure built in earlier steps.

Implementation Details
Pre-trained Word Embeddings. For all the parsing models in different languages, we initialize word vectors with pretrained word embeddings. For Chinese, Dutch, English, German and Spanish, we use the structured-skipgram  embeddings. For other languages we use Polyglot embeddings (Al-Rfou et al., 2013).
Optimization. Parameter optimization is performed with the Adam optimizer (Kingma and Ba, 2014) with β 1 = β 2 = 0.9. We choose an initial learning rate of η 0 = 0.001. The learning rate η is annealed by multiplying a fixed decay rate ρ = 0.75 when parsing performance stops increasing on validation sets. To reduce the effects of "gradient exploding", we use gradient clipping of 5.0 (Pascanu et al., 2013). Dropout Training. To mitigate overfitting, we apply dropout (Srivastava et al., 2014;. For BLSTM, we use recurrent dropout (Gal and Ghahramani, 2016) with a drop rate of 0.33 between hidden states and 0.33 between layers. Following Dozat and Manning (2017), we also use embedding dropout with a rate of 0.33 on all word, character, and POS embeddings.
Hyper-Parameters. Some parameters are chosen from those reported in Dozat and Manning (2017). We use the same hyper-parameters across the models on different treebanks and languages, due to time constraints. The details of the chosen hyper-parameters for all experiments are summarized in Appendix A.
To make a thorough empirical comparison with previous studies, we also evaluate our system on treebanks from CoNLL shared task and the Universal Dependency (UD) Treebanks 4 . For the CoNLL Treebanks, we use the English treebank from CoNLL-2008 shared task (Surdeanu et al., 2008) and all 13 treebanks from CoNLL-2006 shared task (Buchholz and Marsi, 2006). The experimental settings are the same as . For UD Treebanks, we select 12 languages. The details of the treebanks and experimental settings are in § 4.5 and Appendix B.
Evaluation Metrics Parsing performance is measured with five metrics: unlabeled attachment score (UAS), labeled attachment score (LAS), unlabeled complete match (UCM), labeled complete match (LCM), and root accuracy (RA). Following previous work (Kuncoro et al., 2016;Dozat and Manning, 2017), we report results excluding punctuations for Chinese and English. For each experiment, we report the mean values with corresponding standard deviations over 5 repetitions. Baseline For fair comparison of the parsing performance, we re-implemented the graph-based Deep Biaffine (BIAF) parser (Dozat and Manning, 2017), which achieved state-of-the-art results on a wide range of languages. Our re-implementation adds character-level information using the same LSTM-CNN encoder as our model ( § 3.2) to the original BIAF model, which boosts its performance on all languages.

Main Results
We first conduct experiments to demonstrate the effectiveness of our neural architecture by comparing with the strong baseline BIAF. We compare the performance of four variations of our model with different decoder inputs -Org, +gpar, +sib and Full -where the Org model utilizes only the encoder hidden states of head words, while the +gpar and +sib models augments the original one with grandparent and sibling information, respectively. The Full model includes all the three information as inputs. Figure 2 illustrates the performance (five metrics) of different variations of our STACKPTR parser together with the results of baseline BIAF re-implemented by us, on the test sets of the three languages. On UAS and LAS, the Full variation of STACKPTR with decoding beam size 10 outperforms BIAF on Chinese, and obtains competitive performance on English and German. An interesting observation is that the Full model achieves the best accuracy on English and Chinese, while performs slightly worse than +sib on German. This shows that the importance of higher-order information varies in languages. On LCM and UCM, STACKPTR significantly outperforms BIAF on all languages, showing the superiority of our parser on complete sentence parsing. The results of our parser on RA are slightly worse than BIAF. More details of results are provided in Appendix C.  Table 1: UAS and LAS of four versions of our model on test sets for three languages, together with topperforming parsing systems. "T" and "G" indicate transition-and graph-based models, respectively. For BIAF, we provide the original results reported in Dozat and Manning (2017) and our re-implementation. For STACKPTR and our re-implementation of BiAF, we report the average over 5 runs.   re-implementation of BIAF obtains better performance than the original one in Dozat and Manning (2017), demonstrating the effectiveness of the character-level information. Our model achieves state-of-the-art performance on both UAS and LAS on Chinese, and best UAS on English. On German, the performance is competitive with BIAF, and significantly better than other models.

Error Analysis
In this section, we characterize the errors made by BIAF and STACKPTR by presenting a number of experiments that relate parsing errors to a set of linguistic and structural properties. For simplicity, we follow McDonald and Nivre (2011) and report labeled parsing metrics (either accuracy, precision, or recall) for all experiments.

Length and Graph Factors
Following McDonald and Nivre (2011), we analyze parsing errors related to structural factors. Sentence Length. Figure 3 (a) shows the accuracy of both parsing models relative to sentence lengths. Consistent with the analysis in Mc-Donald and Nivre (2011), STACKPTR tends to perform better on shorter sentences, which make fewer parsing decisions, significantly reducing the chance of error propagation.
Dependency Length. Figure 3 (b) measures the precision and recall relative to dependency lengths. While the graph-based BIAF parser still performs better for longer dependency arcs and transition-based STACKPTR parser does better for shorter ones, the gap between the two systems is marginal, much smaller than that shown  Table 3: UAS and LAS on 14 treebanks from CoNLL shared tasks, together with several state-of-the-art parsers. Bi-Att is the bi-directional attention based parser (Cheng et al., 2016), and NeuroMST is the neural MST parser . in McDonald and Nivre (2011). One possible reason is that, unlike traditional transition-based parsers that scan the sentence from left to right, STACKPTR processes in a top-down manner, thus sometimes unnecessarily creating shorter dependency arcs first.
Root Distance. Figure 3 (c) plots the precision and recall of each system for arcs of varying distance to the root. Different from the observation in McDonald and Nivre (2011), STACKPTR does not show an obvious advantage on the precision for arcs further away from the root. Furthermore, the STACKPTR parser does not have the tendency to over-predict root modifiers reported in McDonald and Nivre (2011). This behavior can be explained using the same reasoning as above: the fact that arcs further away from the root are usually constructed early in the parsing algorithm of traditional transition-based parsers is not true for the STACKPTR parser.

Effect of POS Embedding
The only prerequisite information that our parsing model relies on is POS tags. With the goal of achieving an end-to-end parser, we explore the effect of POS tags on parsing performance. We run experiments on PTB using our STACKPTR parser with gold-standard and predicted POS tags, and without tags, respectively. STACKPTR in these experiments is the Full model with beam=10. Table 2 gives results of the parsers with different versions of POS tags on the test data of PTB.
The parser with gold-standard POS tags significantly outperforms the other two parsers, showing that dependency parsers can still benefit from accurate POS information. The parser with predicted (imperfect) POS tags, however, performs even slightly worse than the parser without using POS tags. It illustrates that an end-to-end parser that doesn't rely on POS information can obtain competitive (or even better) performance than parsers using imperfect predicted POS tags, even if the POS tagger is relative high accuracy (accuracy > 97% in this experiment on PTB). Table 3 summarizes the parsing results of our model on the test sets of 14 treebanks from the CoNLL shared task, along with the state-of-theart baselines. Along with BIAF, we also list the performance of the bi-directional attention based Parser (Bi-Att) (Cheng et al., 2016) and the neural MST parser (NeuroMST)  for comparison. Our parser achieves state-of-theart performance on both UAS and LAS on eight languages -Arabic, Czech, English, German, Portuguese, Slovene, Spanish, and Swedish. On Bulgarian and Dutch, our parser obtains the best UAS. On other languages, the performance of our parser is competitive with BIAF, and significantly better than others. The only exception is Japanese, on which NeuroMST obtains the best scores.

UD Treebanks
For UD Treebanks, we select 12 languages -Bulgarian, Catalan, Czech, Dutch, English, French, German, Italian, Norwegian, Romanian, Russian and Spanish. For all the languages, we adopt the standard training/dev/test splits, and use the universal POS tags (Petrov et al., 2012) provided in each treebank. The statistics of these corpora are provided in Appendix B. Table 4 summarizes the results of the STACKPTR parser, along with BIAF for comparison, on both the development and test datasets for each language. First, both BIAF and STACKPTR parsers achieve relatively high parsing accuracies on all the 12 languages -all with UAS are higher than 90%. On nine languages -Catalan, Czech, Dutch, English, French, German, Norwegian, Russian and Spanish -STACKPTR outperforms BIAF for both UAS and LAS. On Bulgarian, STACKPTR achieves slightly better UAS while LAS is slightly worse than BIAF. On Italian and Romanian, BIAF obtains marginally better parsing performance than STACKPTR.

Conclusion
In this paper, we proposed STACKPTR, a transition-based neural network architecture, for dependency parsing. Combining pointer networks with an internal stack to track the status of the top-down, depth-first search in the decoding procedure, the STACKPTR parser is able to capture information from the whole sentence and all the previously derived subtrees, removing the leftto-right restriction in classical transition-based parsers, while maintaining linear parsing steps, w.r.t the length of the sentences. Experimental re-sults on 29 treebanks show the effectiveness of our parser across 20 languages, by achieving state-ofthe-art performance on 21 corpora.
There are several potential directions for future work. First, we intend to consider how to conduct experiments to improve the analysis of parsing errors qualitatively and quantitatively. Another interesting direction is to further improve our model by exploring reinforcement learning approaches to learn an optimal order for the children of head words, instead of using a predefined fixed order. Table 5 summarizes the chosen hyper-parameters used for all the experiments in this paper. Some parameters are chosen directly or similarly from those reported in Dozat and Manning (2017). We use the same hyper-parameters across the models on different treebanks and languages, due to time constraints.

Layer
Hyper