Dependency Parsing as Head Selection

Conventional graph-based dependency parsers guarantee a tree structure both during training and inference. Instead, we formalize dependency parsing as the problem of independently selecting the head of each word in a sentence. Our model which we call DENSE (as shorthand for Dependency Neural Selection) produces a distribution over possible heads for each word using features obtained from a bidirectional recurrent neural network. Without enforcing structural constraints during training, DeNSe generates (at inference time) trees for the overwhelming majority of sentences, while non-tree outputs can be adjusted with a maximum spanning tree algorithm. We evaluate DeNSe on four languages (English, Chinese, Czech, and German) with varying degrees of non-projectivity. Despite the simplicity of the approach, our parsers are on par with the state of the art.


Introduction
Dependency parsing plays an important role in many natural language applications, such as relation extraction (Fundel et al., 2007), machine translation (Carreras and Collins, 2009), and ontology construction (Snow et al., 2004).A dependency parser represents syntactic information as a set of headdependent relational arcs, typically constrained to be a tree structure.Practically all models proposed for dependency parsing in recent years can be described as graph-based (McDonald et al., 2005a) or transition-based (Yamada and Matsumoto, 2003; Nivre et al., 2006b).
Graph-based dependency parsers are typically arc-factored, where the score of a tree is defined as the summation of the scores of all its arcs.An arc is scored with a set of local features and a linear model, the parameters of which can be effectively learned with online algorithms (Crammer and Singer, 2001; Crammer and Singer, 2003; Freund and Schapire, 1999; Collins, 2002).In order to efficiently find the best scoring tree during training and decoding, various maximization algorithms have been developed (Eisner, 1996;Eisner, 2000;McDonald et al., 2005b).In general, graph-based methods are optimized globally, using features of single arcs in order to make the learning and inference tractable.Transition-based algorithms factorize a tree into a set of parsing actions.At each transition state, the parser scores a candidate action conditioned on the state of the transition system and the parsing history, and greedily selects the highest-scoring action to execute.This score is typically obtained with a classifier based on non-local features defined over a rich history of parsing decisions (Yamada and Matsumoto, 2003; Zhang and Nivre, 2011).
Regardless of the algorithm used, most wellknown dependency parsers, such as the MSTPaser (McDonald et al., 2005b) and the MaltPaser (Nivre et al., 2006a), rely on extensive feature engineering.Feature templates are typically manually designed and aim at capturing head-dependent relationships which are notoriously sparse and difficult to estimate.More recently, a few approaches (Chen and Manning, 2014; Pei et al., 2015; Kiperwasser and Goldberg, 2016) apply neural networks for learning dense feature representations.The learned features are subsequently used in a conventional graph-or transition-based parser, or better designed variants (Dyer et al., 2015).
In this work, we propose a simple neural networkbased model which learns to select the head for each word in a sentence without enforcing tree structured output; as a result, no transition system or graph-based algorithm is needed during training.Our model which we call DENSE (as shorthand for Dependency Neural Selection) employs bidirectional recurrent neural networks to learn featural representations for words in a sentence.These features are subsequently used to predict the head of each word.We recast the task of dependency parsing as a classification problem with each headdependent decision being locally optimized.Although there is nothing inherent in the model to enforce tree-structured output, when tested on an English dataset, DENSE is able to generate trees for 95% of the sentences, 87% of which are projective.Since the model is decoupled from the underlying inference procedure, it can easily transfer between a projective and non-projective parser without much modification.Specifically, post-processing the output of DENSE with the Chu-Liu-Edmonds algorithm results in a non-projective parser; whereas post-processing the output with the Eisner algorithm yields a projective parser.
We evaluate our model on benchmark dependency parsing corpora, representing four languages (English, Chinese, Czech, and German) with varying degrees of non-projectivity.Despite the simplicity of our approach, experiments show that the resulting parsers are on par with the state of the art.

Related Work
In this section we briefly review prior work on dependency parsing focusing on graph-based and transition-based models.We also discuss how our model relates to previously proposed neural network-based parsers.
Graph-based Parsing Graph-based dependency parsers employ a model for scoring possible dependency graphs for a given sentence.The graphs are typically factored into their component arcs and the score of a tree is defined as the sum over all arcs.This factorization enables tractable search for the highest scoring graph structure.Dependency-tree parsing is commonly formulated as the search for the maximum spanning tree (MST) in a graph which yields efficient algorithms for both non-projective and projective dependency trees.For instance, the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967;McDonald et al., 2005b) is often used to extract the MST in the case of non-projective trees, and the Eisner algorithm (Eisner, 1996;Eisner, 2000) in the case of projective trees.During training, weight parameters of the scoring function can be learned with margin-based algorithms (Crammer and Singer, 2001;Crammer and Singer, 2003) or the structured perceptron (Freund and Schapire, 1999; Collins, 2002).Beyond basic first-order models, the literature offers a few examples of higherorder models involving sibling and grand parent relations (Carreras, 2007;Koo et al., 2010;Zhang and McDonald, 2012).Although more expressive, such models render both training and inference more challenging.
Transition-based Parsing As the term implies, transition-based parsers conceptualize the process of transforming a sentence into a dependency tree as a sequence of transitions.A transition system typically includes a stack for storing partially processed tokens, a buffer containing the remaining input, and a set of arcs containing all dependencies between tokens that have been added so far (Nivre, 2003;Nivre et al., 2006b).A dependency tree is constructed by manipulating the stack and buffer, and appending arcs with predetermined operations.
In an arc-standard system (Yamada and Matsumoto, 2003; Nivre, 2004), the transitions include a SHIFT operation which removes the first word in the buffer and pushes it onto the stack; a LEFT-ARC operation adds an arc from the word in the beginning of the buffer to the word on top of the stack; and a RIGHT-ARC operation adds an arc from the word on top of the stack to the word in the beginning of the buffer.During parsing, the transition from one configuration to the next is greedily scored with a linear classifier whose features are defined according to the stack and buffer.The above arc-standard system builds a projective dependency tree bottom up, with the assumption that an arc is only added when the dependent node has already found all its dependents.Extensions include the arc-eager system (Nivre, 2008) which always adds an arc at the earliest possible stage, a more elaborate (reduce) action space to handle non-projective parsing (Attardi, 2006) 2015) redesign components of a transition-based system where the the buffer, stack, and action sequences are modeled separately with stack long short-term memory networks.The hidden states of these LSTMs are concatenated and used as input features to a final transition classifier.Kiperwasser and Goldberg (2016) use a bi-directional LSTM to extract features for both a transition-and graph-based parser.
Compared to previous work, we formalize dependency parsing as the task of finding for each word in a sentence its most probable head.Both head selection and the features it is based on are learned using neural networks.The model locally optimizes a set of head-dependent decisions without attempting to enforce any global consistency during training.As a result, DENSE predicts dependency arcs greedily following a simple training procedure without predicting a parse tree, i.e., without performing a sequence of transition actions or employing a graph algorithm during training.Nevertheless, it can be seamlessly integrated with a graph-based decoder to ensure tree-structured output.In common with recent neural network-based dependency parsers, we aim to alleviate the need for hand-crafting feature combinations.Beyond feature learning, we further show that it is possible to simplify inference and training with bi-directional recurrent neural networks.

Dependency Parsing as Head Selection
In this section we present our parsing model, DENSE, which tries to predict the head of each word in a sentence.Specifically, the model takes as in- put a sentence of length N and outputs N head, dependent arcs.We describe the model focusing on unlabeled dependencies and then discuss how it can be straightforwardly extended to the labeled setting.We begin by explaining how words are represented in our model and then give details on how DENSE makes predictions based on these learned representations.Since there is no guarantee that the outputs of DENSE are trees (although they mostly are), we also discuss how to extend DENSE in order to enforce projective and non-projective tree outputs.Throughout this paper, lowercase boldface letters denote vectors (e.g., v or v i ), uppercase boldface letters denote matrices (e.g., M or M b ), and lowercase letters denote scalars (e.g., w or w i ).

Word Representation
Let S = (w 0 , w 1 , . . ., w N ) denote a sentence of length N; following common practice in the dependency parsing literature (Kübler et al., 2009), we add an artificial ROOT token represented by w 0 .Analogously, let A = (a 0 , a 1 , . . ., a N ) denote the representation of sentence S, with a i representing word w i (0 ≤ i ≤ N).Our model reads through the sentence and then produces a representation for each word based on the entire sentence.Besides encoding information about each w i in isolation (e.g., its lexical meaning or POS tag), a i must also encode w i 's positional information within the sentence.Such information has been previously shown to be important in dependency parsing (McDonald et al., 2005a).For example, in the following sentence: the head of the first a is dog, whereas the head of the second a is cat.Without taking positional information into account, a model cannot easily decide which a (nearer or farther) to assign to dog.
Long short-term memory networks (Hochreiter and Schmidhuber, 1997; LSTMs), a type of recurrent neural network with a more complex computational unit, have proven effective at capturing long-term dependencies.In our case LSTMs allow to represent each word on its own and within a sequence leveraging long-range contextual information.
As shown in Figure 1, we first use a forward LSTM (LSTM F ) to read the sentence from left to right and then a backward LSTM (LSTM B ) to read the sentence from right to left, so that the entire sentence serves as context for each word: 1 where x i is the feature vector of word w i , h F i ∈ R d and c F i ∈ R d are the hidden states and memory cells for the ith word w i in LSTM F and d is the hidden unit size.h F i is also the representation for w 0:i (w i and its left neighboring words) and c F i is an internal state maintained by LSTM F .h B i ∈ R d and c B i ∈ R d are the hidden states and memory cells for the backward LSTM B .Each token w i is additionally represented by x i , the concatenation of two vectors corresponding to w i 's lexical and POS tag embeddings: where e(w i ) and e(t i ) are one-hot vector representations of token w i and its POS tag t i ; W e ∈ R s×|V | and W t ∈ R q×|T | are the word and POS tag embedding matrices, where |V | is the vocabulary size, s is the word embedding size, |T | is the POS tag set size, and q the tag embedding size.The hidden states of the forward and backward LSTMs are concatenated to obtain a i , the final representation of w i : 1 For the details on LSTM networks, see e.g., Graves (2012) or Goldberg (2015).

Head Selection
We now move on to discuss our formalization of dependency parsing as head section.We first focus on unlabeled dependencies and then explain how the model can be extended to predict labeled ones.
In a dependency tree, a head can have multiple dependents, whereas a dependent can have only one head.Based on this fact, dependency parsing can be formalized as follows.Given a sentence S = (w 0 , w 1 , . . ., w N ), we aim to find for each word w i ∈ {w 1 , w 2 , . . ., w n } the most probable head w j ∈ {w 0 , w 1 , . . ., w N }.For example, in Figure 1, to find the head for the token love, we calculate probabilities P head (ROOT|love, S), P head (kids|love, S), and P head (candy|love, S), and select the highest.More formally, we estimate the probability of token w j being the head of token w i as: where a i and a j are vector-based representations of w i and w j , respectively (Section 3.1 describes how these are learned); g(a j , a i ) is a neural network with a single hidden layer that computes the associative score between representations a i and a j : where v a ∈ R 2d , U a ∈ R 2d×2d , and W a ∈ R 2d×2d are weight matrices of g.Note that the candidate head w j can be the ROOT; while the dependent w i cannot.Equations ( 5) and (6) compute the probability of adding an arc between two words, in a fashion similar to the neural attention mechanism in sequenceto-sequence models (Bahdanau et al., 2014).We train our model by minimizing the negative log likelihood of the gold standard head, dependent arcs in all training sentences: where T is the training set, h(w i ) is w i 's gold standard head2 within sentence S, and N S the number of words in S (excluding ROOT).During inference, for each word w i (i ∈ [1, N S ]) in S, we greedily choose the most likely head w j ( j ∈ [0, N S ]): Note that the prediction for each word w i is made independently of the other words in the sentence.
Given our greedy inference method, there is no guarantee that the head, dependent arcs predicted for a sentence form a tree (maybe there are cycles).However, we empirically observed that most outputs during inference are indeed trees.For instance, on an English dataset, 95% of the arcs predicted on the development set are trees, and 87% of them are projective, whereas on a Chinese dataset, 87% of the arcs form trees, 73% of which are projective.This indicates that although the model does not explicitly model tree structure during training, it is able to figure out from the data (which consists of trees) that it should predict them.
So far we have focused on unlabeled dependencies, however it is relatively straightforward to extend DENSE to produce labeled dependencies.We basically train an additional classifier to predict labels for the arcs which have been already identified.The classifier takes as input features [a i ; a j ; x i ; x j ] representing properties of the arc w j , w i .These consist of a i and a j , the LSTM-based representations for w i and w j (see Equation ( 4)), and their word and part-of-speech embeddings, x i and x j (see Equation (3)).Specifically, we use a trained DENSE model to go through the training corpus and generate features and corresponding dependency labels as training data.We employ a two-layer rectifier network (Glorot et al., 2011) for the classification task.

Maximum Spanning Tree Algorithms
As mentioned earlier the greedy inference strategy may fail to produce well-formed trees.In this case, the output of DENSE can be adjusted with a maximum spanning tree algorithm.We use the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) for building non-projective trees and the Eisner algorithm (Eisner, 1996) for projective ones.
Following McDonald et al. (2005b), we view a sentence S = (w 0 = ROOT, w 1 , . . ., w N ) as a graph G S = V S , E S with the sentence words and the dummy root symbol as vertices and a directed edge between every pair of distinct words and from the root symbol to every word.The directed graph G S is defined as: where s(i, j) is the weight of edge i, j and P head (w i |w j , S) is known.The problem of dependency parsing now boils down to finding the tree with the highest score which is equivalent to finding a maximum spanning tree (MST) in G S (McDonald et al., 2005b).
Non-projective Parsing To build a non-projective parser, we solve the MST problem with the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967).The algorithm selects for each vertex (excluding ROOT) the in-coming edge with the highest weight.If a tree results, it must be the maximum spanning tree and the algorithm terminates.Otherwise, there must be a cycle which the algorithm identifies, contracts into a single vertex and recalculates edge weights going into and out of the cycle.The greedy inference strategy described in Equation ( 8)) is essentially a sub-procedure in the Chu-Liu-Edmonds algorithm with the algorithm terminating after the first iteration.In implementation, we only run the Chu-Liu-Edmonds algorithm through graphs with cycles, i.e., non-tree outputs.
Projective Parsing For projective parsing, we solve the MST problem with the Eisner algorithm (Eisner, 1996).The time complexity of the Eisner algorithm is O(N 3 ), while checking if a tree is projective can be done reasonably faster, with a O(N log N) algorithm.Therefore, we only apply the Eisner algorithm to the non-projective outputs of the greedy strategy.Finally, it should be noted that the training of our model does not rely on the Chu-Liu-Edmonds or Eisner algorithm, or any other graphbased algorithm.MST algorithms are only used at test time to correct non-tree outputs which are in the minority; DENSE acquires underlying tree structure constraints from the data without an explicit learning algorithm.

Experiments
We evaluated our parser in a projective and nonprojective setting.In the following, we describe the datasets we used and provide training details for our models.We also present comparisons against multiple previous systems and analyze the parser's output.

Datasets
In the projective setting, we assessed the performance of our parser on the English Penn Treebank (PTB) and the Chinese Treebank 5.1 (CTB).Our experimental setup closely follows Chen and Manning (2014) and Dyer et al. (2015).
For English, we adopted the Stanford basic dependencies (SD) representation (De Marneffe et al., 2006). 3We follow the standard splits of PTB, sections 2-21 were used for training, section 22 for development and section 23 for testing.POS tags were assigned using the Stanford tagger (Toutanova et al., 2003) with an accuracy of 97.3%.For Chinese, we follow the same split of CTB5 introduced in Zhang and Clark (2008).In particular, we used sections 001-815, 1001-1136 for training, sections 886-931, 1148-1151 for development and sections 816-885, 1137-1147 for testing.The original constituency trees in CTB were converted to dependency trees with the Penn2Malt tool. 4 We used gold segmentation and gold POS tags as in Chen and Manning (2014) and Dyer et al. (2015).
In the non-projective setting, we assessed the performance of our parser on the Czech and German, the largest non-projective datasets released as part of the CoNLL 2006 multilingual dependency parsing shared task.Since there is no official development set in either dataset, we used the last 374/367 sentences in the Czech/German training set as development data. 5Projective statistics of the four datasets are summarized in Table 1.

Training Details
We trained our models on an Nvidia GPU card; training takes one to two hours.Model parameters were uniformly initialized in [−0.1, 0.1].We used Adam (Kingma and Ba, 2014) to optimize our models with hyper-parameters recommended by the authors (i.e., learning rate 0.001, first momentum coefficient 0.9, and second momentum coefficient 0.999).To alleviate the gradient exploding problem, we rescaled the gradient when its norm exceeded 5 (Pascanu et al., 2013).Dropout (Srivastava et al., 2014) was applied to our model with the strategy recommended in Zaremba et al. ( 2014).On all datasets, we used two-layer LSTMs and set d = s = 300, where d is the hidden unit size and s is the word embedding size.
As in previous neural dependency parsing work (Chen and Manning, 2014; Dyer et al., 2015), we used pre-trained word vectors to initialize our word embedding matrix W e .For the PTB experiments, we used 300 dimensional pre-trained GloVe6 vectors (Pennington et al., 2014).For the CTB experiments, we trained 300 dimensional GloVe vectors on the Chinese Gigaword corpus which we segmented with the Stanford Chinese Segmenter (Tseng et al., 2005).For Czech and German, we did not use pretrained word vectors to initialize our word embedding matrix.We set the POS tag embedding size to q = 30 in the English experiments, q = 50 in the Chinese experiments and q = 40 in both Czech and German experiments.

Results
For both English and Chinese experiments, we report unlabeled (UAS) and labeled attachment scores (LAS) on the development and test sets; following Chen and Manning (2014) punctuation is excluded from the evaluation.
Experimental results on PTB are shown in Table 2.We compared our model with several recent papers following the same evaluation protocol and experimental settings.The first block in the table contains parsers which do not use neural networks.Bohnet10 (Bohnet, 2010)   is better), where they use a complex high order decoding algorithm that involves cubic pruning and strategies for encouraging diversity.Post-processing the output of the parser with the Eisner algorithm generally improves performance (by 0.21%; see last row in Table 3).In Figure 2 we analyze the performance of our parser on sentences of different length.On both PTB and CTB, DENSE has an advantage on long sentences compared to C&M14 and and Dyer15.Finally, we report unlabeled sentence level exact match (UEM) in Table 4. Interestingly, even when using the greedy inference strategy, DENSE yields a UEM comparable to Dyer15 on PTB.Exact match results using labeled dependencies are similar, however we omit them due to lack of space.
For Czech and German, we closely follow the evaluation setup of CoNLL 2006.We report both UAS and LAS, although most previous work has focused on UAS.Our results are summarized in As can been seen, DENSE outperforms all other first (and second) order parsers on both German and Czech.As in the projective experiments, we observe slight improvements on both UAS and LAS when using a MST algorithm.On German, DENSE is comparable with the best third-order parser (Turbo-3rd), while on Czech it lags behind Turbo-3rd and RBG-3rd.This is not surprising considering that DENSE is a first-order parser and only uses words and POS tags as features.Comparison systems use a plethora of hand-crafted features and more sophisticated high-order decoding algorithms.
Our experimental results demonstrate that using a MST algorithm during inference can slightly improve the model's performance.We further examined the extent to which the MST algorithm is necessary for producing dependency trees.Table 6 shows the percentage of trees before and after the application of the MST algorithm across the four languages.In the majority of cases DENSE outputs trees (ranging from 87.0% to 96.7%) and a significant proportion of them are projective (ranging from 65.5% to 86.6%).Therefore, only a small proportion of outputs (14.0% on average) need to be post-processed with the Eisner or Chu-Liu-Edmonds algorithm.

Conclusions
In this work we presented DENSE, a neural dependency parser which we train without a transition system or graph-based algorithm.Experimental results show that DENSE achieves competitive performance across four different languages and can seamlessly transfer from a projective to a non-projective parser simply by changing the post-processing MST algorithm during inference.In the future, we would like to increase the coverage of our parser by using tri-training techniques (Li et al., 2014) and multitasking learning (Luong et al., 2015).

Figure 1 :
Figure1: The architecture of our dependency parsing model.DENSE estimates the probability a word being the head of another word based on the bidirectional LSTM representations of the two words.P head (ROOT|love, S) stands for the probability of ROOT being the head of love (arcs denote candidate heads; the solid arc corresponds to the goldstandard).

Table 1 :
Projective statistics on four datasets.Number of sentences and percentage of projective trees are calculated on the training set.

Table 2 :
Results on English dataset (PTB with Stanford Dependencies).+E means that we post-process non-projective outputs with the Eisner algorithm.

Table 3 :
Results on Chinese dataset (CTB).+E means that we post-process non-projective outputs with the Eisner algorithm.UAS against sentence length on PTB and CTB (development set).We sort all sentences by length in ascending order and divide them equally into 10 bins.The horizontal axis is the length of the last sentence in each bin.

Table 4 :
Unlabeled exact match results on PTB and CTB.

Table 5 :
Non-projective results on the CoNLL 2006 dataset.+CLE means that we post-process non-tree outputs with the Chu-Liu-Edmonds algorithm.

Table 5 .
We compare DENSE against three nonprojective graph-based dependency parsers: the MST parser (McDonald et al., 2005b), the Turbo parser (Martins et al., 2013), and the RBG parser (Lei et al., 2014).We show the performance of these parsers in the first order setting (e.g., MST-1st) and in higher order settings (e.g., Turbo-3rd).The results of MST-1st, MST-2nd, RBG-1st and RBG-3rd are reported in Lei et al. (2014) and the results of Turbo-1st and Turbo-3rd are reported in Martins

Table 6 :
Percentage of trees and projective trees on the development set before and after DENSE uses a MST algorithm.On PTB and CTB, we use the Eisner algorithm and on Czech and German, we use the Chu-Liu-Edmonds algorithm.etal.(2013).We show results for our parser with greedy inference (see DENSE in the table) and when we use the Chu-Liu-Edmonds algorithm to postprocess non-tree outputs (DENSE+CLE).