Incremental Graph-based Neural Dependency Parsing

Very recently, some studies on neural dependency parsers have shown advantage over the traditional ones on a wide variety of languages. However, for graph-based neural dependency parsing systems, they either count on the long-term memory and attention mechanism to implicitly capture the high-order features or give up the global exhaustive inference algorithms in order to harness the features over a rich history of parsing decisions. The former might miss out the important features for specific headword predictions without the help of the explicit structural information, and the latter may suffer from the error propagation as false early structural constraints are used to create features when making future predictions. We explore the feasibility of explicitly taking high-order features into account while remaining the main advantage of global inference and learning for graph-based parsing. The proposed parser first forms an initial parse tree by head-modifier predictions based on the first-order factorization. High-order features (such as grandparent, sibling, and uncle) then can be defined over the initial tree, and used to refine the parse tree in an iterative fashion. Experimental results showed that our model (called INDP) archived competitive performance to existing benchmark parsers on both English and Chinese datasets.


Introduction and Motivation
The rise of machine learning methods in natural language processing (NLP) coupled with the availability of treebanks (Buchholz and Marsi, 2006) for a wide variety of languages has led to a rapid increase in research on data-driven dependency parsing. Two predominant paradigms for the datadriven dependency parsing are often called graphbased and transition-based dependency parsing Nivre, 2007, 2011). The first category learns the parameters to score correct dependency subgraphs over incorrect ones, typically by factoring the graphs into their component directed arcs, and performs parsing by searching the highest-scoring graph for a given sentence. The second category of parsing systems instead learns to predict one transition from one parse state to the next given a parse history, and performs parsing by taking the predicted transitions at each parse state until a complete dependency graph is derived.
Empirical studies show that the graph-based and transition-based models exhibit no statistically significant difference in accuracy on a variety of languages, although they are very different theoretically (McDonald and Nivre, 2011). Graphbased models are usually trained by maximizing the difference in score between the entire correct dependency graph and all incorrect ones for every training sentence. However, exhaustive inference is generally NP-hard when the score is factored over any extended scope of the dependency subgraph beyond a single arc (McDonald and Satta, 2007), which is the primary shortcoming of the graph-based systems. In transition-based parsing, the feature representations are not restricted to a small number of arcs in the graph but can be derived from all the dependency subgraphs built so far, while the main disadvantage of these models is that the local greedy parsing strategy may lead to the error propagation because false early predictions can eliminate valid parse trees.
With a few exceptions (Zeman andZabokrtskỳ, 2005;Zhang and Clark, 2008;Zhang et al., 2014), the graph-based parsers usually require global learning and inference, but define features over a limited scope of the dependency graph, while the transition-based ones typically use local, greedy training and inference, but introduce a rich feature space based on the history of parsing decisions.
Many approaches have been proposed to overcome the weaknesses of traditional graph-based or transition-based models. There are at least three ways for potential improvement: ensembleweighting the predictions of multiple parsing systems (Sagae and Lavie, 2006;Hall et al., 2007), feature integration-combining the two models by allowing the output of one model to define features for the other (Martins et al., 2008;Nivre and McDonald, 2008;McDonald and Nivre, 2011), and novel approaches-changing the underlying model structure directly by constructing globally trained transition-based parsers (Zhang and Clark, 2008;Huang and Sagae, 2010) or graph-based parsers with rich features (Riedel and Clarke, 2006;Nakagawa, 2007;Smith and Eisner, 2008;Martins et al., 2009).
Very recently, some studies on the deep architectures have shown advantage over the shallow ones on a wide variety of dependency parsing benchmarks. Deep neural networks were used to replace the classifiers for predicting optimal transitions in transition-based parers (Chen and Manning, 2014) or the scoring functions for ranking the subgraphs in graph-based rivals (Kiperwasser and Goldberg, 2016a,b). There are several recent developments in neural dependency parsing (Weiss et al., 2015;Zhou et al., 2015;Dyer et al., 2015), which can be viewed as targeting the weaknesses of locally greedy algorithms in transitionbased models by using the beam search and conditional random field loss objective, although using the beam search instead of strictly deterministic parsing can to some extent alleviate the error propagation problem but does not eliminate it.
For graph-based neural dependency parsing systems, they either count on the long-term memory and neural attention to implicitly capture the high-order features (Kiperwasser and Goldberg, 2016b;Cheng et al., 2016;Dozat and Manning, 2017) or give up the global inference algorithms in order to introduce features over a rich history of parsing decisions by a greedy, bottom-up method (Kiperwasser and Goldberg, 2016a). The former might miss out the important information for specific headword predictions without the help of the structural features derived from the entire parse tree, while the latter may suffer from the error propagation as false structural constraints are used to create features when making future predictions. In this study, we explore the feasibility of explicitly taking advantage of high-order features while remaining the strength of global exhaustive inference and learning as a graph-based parser.
The proposed parser first encodes each word in a sentence by distributed embeddings using a convolutional neural network and constructs an initial parse graph by head-modifier predictions with a maximum directed spanning tree algorithm based on the first-order features (i.e. the score is factored over the arcs in a graph). Once an initial parse graph is built, the high-order features (such as grandparent, sibling, and uncle) can be defined, and used to refine the structure of the parse tree in an iterative way. Theoretically, the refinement will continue until no change is made in the iteration. But experimental results demonstrated that pretty good performance can be achieved with no more than twice updates because many dependencies are determined by independent arc prediction and a few head-modifier pairs need to be re-estimated after one update (i.e. only a few changes above and beyond the dominant first-order scores). We call this proposed model an incremental neural dependency parsing (INDP) 1 .

Incremental Neural Dependency Parser
Given an input sentence x, we denote the set of all valid dependency parse trees that can be constructed from x as Y(x). Assuming there exists a graph scoring function s, the dependency parsing problem can be formulated as finding the highest scoring directed spanning tree for the sentence x.
where y * (x) is the parse tree with the highest score, and θ is a set of the parameters used to compute the scores. To make the search tractable, the score of a graph is usually factorized into the sum of its arc (head-modifier) scores (McDonald et al., 2005a).
where A(ŷ) represents a set of directed arcs in the parse treeŷ. The score of an arc (h, m) represents the likelihood of creating a dependency from head h to modifier (or dependent) m in a dependency tree. If each arc score is estimated independently, we call it a first-order factorization. When the scoring is based on two or more arcs, second-or high-order factorizations are applied.
In traditional approaches, this score is commonly defined to be the product of a high dimensional feature representation of the arc and a learned weighting parameter vector. The performance of those systems is heavily dependent on the choice of features. For that reason, much effort in designing such systems goes into the feature engineering, which is important but labor-intensive, mainly first based on human ingenuity and linguistic intuition, and then confirmed or refined by empirical analyses. In this study, a neural network is designed instead to estimate the arc scores using the high-order features. In the following, we first describe how the word representations are produced. Then, the key components of the INDP, directionspecific scoring with special normalization and incremental refinement with high-order features, are discussed in detail. Finally, we present the entire parsing algorithm of the INDP.

Word Feature Representations
In graph-based neural dependency parsing work, such as (Kiperwasser and Goldberg, 2016a,b;Dozat and Manning, 2017), recurrent neural network (RNN) is a popular statistical learner used to produce the continuous vector representations for each word in a sentence due to its ability to bridge long time lags between relevant inputs. We chose to use one-dimensional convolution instead as a building block because it is good enough to capture the interactions of word feature representations in a context window with less computational cost. Such a design makes the parameters of our first-order parser to be optimized efficiently, which will be augmented with the high-order features (i.e. long distance dependencies) at incremental refinement stages.
The words are fed into the network as indices that are used by a lookup operation to transform words into their feature vectors. We consider a fixed-sized word dictionary D 2 . The vector repre-sentations are stored in a word embedding matrix E word ∈ R d×|D| , where d is the dimensionality of the vector space (a hyper-parameter to be chosen) and |D| is the size of the dictionary. Like (Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015;Cheng et al., 2016), we also map part-ofspeech (POS) tags to another q-dimensional vector space, and provide POS type features for words. Formally, assume we are given a sentence x [1:n] that is a sequence of n words x i , 1 ≤ i ≤ n. For each word x i ∈ D that has an associated index k i into the column of the matrix E word , and is labeled as a POS tag of type l i , its feature representation is obtained by concatenating both word and POS tag embeddings as: where E pos ∈ R q×|P| is a POS tag embedding matrix and |P| is the size of POS tag set P (finegrained POS tags are used if available). Binary e k i and e l i are one-hot encoding vectors for the ith word in the sentence. The lookup table layer extracts features for each single word, but the meaning of a word is strongly related to its surrounding words. Given a word, we consider a fixed size window w (another hyperparameter) of words around it. More precisely, given an input sentence x [1:n] , the feature window produced by the first lookup table layer at position x i can be written as: where the word feature window is a matrix f win ∈ R (d+q)×w , and each column of the matrix is the word feature vector in the context window. A onedimensional convolution is used to yield another feature vector by taking the dot product of filter vectors with the rows of the matrix f win at the same dimension. After each row of f win is convolved with the corresponding column of a filter matrix W 1 , some non-linear function φ(·) will be applied as: where the weights in the matrix W 1 ∈ R w×(d+q) are the parameters to be trained, and the output f con ∈ R (d+q) is a vector. We choose a hyperbolic tangent as the non-linear function φ. The word feature vectors from a window of text can be computed efficiently thanks to the speed advantage of the one-dimensional convolution (Kalchbrenner et al., 2014).

Direction-Specific Scoring
For the same head-modifier arc (h, m), the head word h may occur on the left size of m (i.e. leftarc) in some sentences while it also can appear on the right size of m (i.e. right-arc) in other ones. Considering two English sentences excerpted from the Penn Treebank (Buchholz and Marsi, 2006): "A group of workers exposed to it.", and "Mr. Vinken is chairman of Elsevier, the Dutch publishing group.", they have the same (group, of) head-modifier arc, but those two words occur in different orders. This would not be problem in the traditional models, such as (McDonald et al., 2005a;Nivre and McDonald, 2008), in which the arc directions are directly used as features by their structured learning algorithms. However, it is hard to train a single neural network that gives a higher score to the left-arc case than the right-arc one in some situations while reverses in others because of the symmetries in weight space (Note that we cannot tell which case is correct in advance, and both cases need to be scored). It would be more serious when the first-order factorization is applied due to the lack of context information. Based on the above observations, we use a multi-layer perceptron (MLP) to score the leftarc cases, and another MLP to score the right-arc ones. Those two MLPs share the word and POS tag embeddings, and can update them when necessary during the training process. Formally, if a MLP with one hidden layer is used, the score of each possible head-modifier arc is computed as: where the convolutional outputs of the head and dependent words are concatenated with a bucketed distance between the head and modifier, denoted by f dis h,m , in buckets of 0 (root), 1, 2, 3-5, and 6+, and feed into the MLP for scoring. The weights in the hidden and output layers are denoted by W 2 and W 3 respectively, and the corresponding bias by b 2 . Once every possible arc is scored, we obtain a matrix like Figure 1, in which the element at the row i and column j is the score for (x i , x j ) arc, denoted by s(i, j). An artificial word, x 0 , has been inserted at the beginning of a sentence that will always serve as the single root of the graph and is primarily a means to simplify computation. The scores at the lower (or upper) triangular are computed by the left-arc (or right-arc) MLP, and the shaded elements do not need to be calculated.
We can treat s(i, j) as a score of the corresponding arc and then search for the highest scoring directed spanning tree to form a dependency parse tree as proposed in (McDonald et al., 2005b). This problem can be solved using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967), which can be implemented in O(n 2 ).
The score of (xi−1, xi ) arc arcs, in which the element at the row i and column j is the score for (xi, xj) arc, denoted by s(i, j). A dependency tree can be formed by finding the highest scoring directed spanning tree over the scoring matrix.

Left-arc scores
The left-arc and right-arc MLPs should carefully collaborate with each other; otherwise, one MLP would be overwhelmed by another (i.e. the maximum score produced by one MLP is less than the minimum by another). To overcome this bias problem, we use the partition function by summing over the elements in each row of the scoring matrix, namely the scores/probabilities are normalized across the two MLPs. The conditional probability of arc (x i , x j ) given a sentence x [1:n] is defined as: Each Z i (x [1:n] ; θ) is a normalization term used to predict x i 's head word.

Incremental Refinement with High-order Features
Given an input sentence, once the initial dependency tree is built using the first-order factorization, we can define the high-order features over the resulting tree. For each head-modifier arc, the modifier's left sibling, right sibling, leftmost child, and rightmost child vector representations are concatenated with the inputs of Equation (6), which are then feed into two new left-arc and right-arc MLPs to update the scoring matrix. Like the head and modifier, those additional feature representations are added as the results produced by the convolution layer. As shown in Figure 2, commonlyused high-order features have been take into account, such as consecutive sibling (H, B, S), trisiblings (B, M, S), and grandparent (H, M, R). The missing feature vectors are replaced by one of four special vectors, namely "left-sibling", "rightsibling", "leftmost-child", and "rightmost-child" according to their relations to the modifier word. Although high-order features are used, the highest scoring parse tree still can be founded efficiently in O(n 2 ) by the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967). The main rationale is that, even in the presence of high-order features, the resulting scores remain based on single head-modifier arcs. The higher-order features are derived from the parse tree obtained with first-order inference, and because that tree is already pretty good, these higher-order features end up being a good approximation, and such approximation can be further improved by incremental refinements upon the parse tree. Thus, the highorder features used by the scoring MLPs can offer deliberate refinement above and beyond the firstorder results. Theoretically, the refinement can be made until there is no update in the scoring matrix. However, experimental results show that comparable performance can be achieved with no more than twice high-order refinements (see Section 3).
We add a softmax layer to the network (after removing the last scoring layer) to predict syntactic labels for each arc. Labeling is trained by minimizing the cross-entropy error of the softmax layer using backpropagation. The network performs the structure prediction and labeling jointly. The two tasks shared the several layers (from the input to convolutional layers) of the network. When mini-
x: an input sentence. T : maximum number of iterations. Output: optimal dependency tree y * . Algorithm: 1: form an initial tree using the first-order features; 2: t = 0; 3: repeat 4: update the scoring matrix using the high-order features; 5: find the highest scoring tree y by Chu-Liu-Edmonds algorithm; 6: t = t + 1; 7: until no change in this iteration or t ≥ T ; 8: predict syntactic labels based on the parse tree y; 9: return y * = y; mizing the cross-entropy error of the softmax layer, the error will also backpropagate and influence both the network parameters and the embeddings. We list our incremental neural dependency parsing algorithm in Figure 3. Staring with an initial tree formed using the first-order features, the algorithm makes changes to the parse tree with the high-order refinements in an attempt to climb the objective function.

Training
Given a training example (x, y), we defined a structured margin ∆(x, y,ŷ) loss for proposing a parseŷ for sentence x when y is the true parse. This penalty is proportional to the number of unlabeled arcs on which the two parse trees do not agree. In general, ∆(x, y,ŷ) is equal to 0 if y =ŷ. The loss function is defined as a penalization of incorrect arcs: where κ is a penalization term to each incorrect arc, and A(y) is a set of arcs in the true parse y. For a training set, we seek a function with small expected loss on unseen sentences. The function we consider take the following form as Equation (1). The score of a treeŷ is higher if the algorithm is more confident that the structure of the tree is correct. In the max-margin estimation framework, we want to ensure that the highest scoring tree is the true parse for all training instances (x i , y i ), i = 1, · · · , h, and it's score to be larger up to a margin defined by the loss. For all i in the training data: These lead us to minimize the following regularized objective for h training instances: where the coefficient λ governs the relative importance of the regularization term compared with the error. The trees are penalized more by the loss when they deviate from the correct one. Minimizing this objective maximizes the score of the correct tree, and minimizes that of the highest scoring but incorrect parse tree. The objective is not differentiable due to the hinge loss. We use the subgradient method to compute a gradient-like direction for minimizing the objective function.

Experiments
We conducted three sets of experiments. The first one is to test several variants of the INDP on the development set, to gain some understanding of how the choice of hyper-parameters impacts upon the performance. The goal of the second one is to see how well the incremental approach enhanced with the high-order features to improve the first-order results by analysing parsing errors relative to sentence length. In the third set, we compared the performance of the INDP with existing state-of-the-art models on both English and Chi-We show test results for the proposed model on the English Penn Treebank (PTB), converted into Stanford dependencies using version 3.3.0 of the Stanford dependency converter, and the Chinese Penn Treebank (CTB). We follow the standard splits of PTB, using section 2-21 for training, section 22 as development set and 23 as test set. We use POS tags generated from the Stanford POS tagger (Toutanova et al., 2003); for the Chinese PTB dataset, we use gold word segmentation and POS tags.

Training Strategy
Previous work demonstrated that the performance can be improved by using word embeddings learned from large-scale unlabeled data in many NLP tasks both in English (Collobert et al., 2011;Socher et al., 2011) and Chinese (Zheng et al., 2013). Unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization (Erhan et al., 2010). We leveraged large unlabeled corpus to learn word embeddings, and then used these improved embeddings to initialize the word embedding matrices of the neural networks. English and Chinese Wikipedia documents were used to train the word embeddings by Word2Vec tool 3 proposed in (Mikolov et al., 2013). Previous studies show that a joint solution (i.e., performing several tasks at the same time) usually leads to the improvement in accuracy over pipelined systems because the error propagation is avoided and the various information normally used in the different steps of pipelined systems can be integrated. The INDP networks are also trained in a joint way, but adopting three-step strategy. The parameters of the parsing neural network using the first-order factorization are first learned, and when its unlabeled parsing accuracy exceeds a given threshold (e.g. 85%), we start to train the high-order parsing network. The weights already trained in the first step will remain unchanged for the first several epochs, and they are in fact used to generate the high-order features. After the parsing accuracy reaches another threshold (e.g. 90%), all the parameters for the first-order, and high-order predictions as well as labeling are trained jointly.

Hyper-parameter Choices
Hyper-parameters was tuned with the PTB 3.3.0 development set by trying only a few different networks. Generally, the dimensionality of the embeddings, and the numbers of hidden units, provided they are large enough, have a limited impact on the generalization performance. In the following experiments, the window size was set to 5, the learning rate to 0.02, and the number of hidden layer to 300. The embedding size of words was set to 50, and that of tags to 30, which achieved a good trade-off between speed and performance. All experiments were run on a computer equipped with an Intel Xeon processor working at 2.2GHz, with 16GB RAM and a NVIDIA Titan GPU. The parsing speed of the INDP is around 250-300 sents/sec in average on the PTB dataset.

Sentence Length Factors
It is well known that dependency parsers tend to have lower accuracies for longer sentences because the increased presence of complex syntactic structures. In order to get a better understanding of how well the incremental strategy and high-order features benefit the models, Figure 4 shows the accuracy of our neural dependency parser using the first-order features only (indicated with "NDP + First-order") and INDP with at most twice highorder refinements (indicated with "INDP + Highorder + M2") on the English PTB develop set. For simplicity, the experiments report unlabeled parsing accuracy, and identical experiments using labeled parsing accuracy did not reveal any additional information. The INDP with the high-order refinements is more precise than the parser using only the firstorder features. Due to the fact that longer dependencies are typically harder to parse, there is still a degradation in performance for our INDP. However, the accuracy curve for INDP is slightly flatter than its reduced version in which the highorder features and incremental recipe are not applied when the sentence length is within 11-50. This behavior can be explained by the reasons that the feature representations are not restricted to a limited number of graph arcs, but can take into account with the (almost) entire dependency graph built so far at the refinement stages of the INDP, and it do offer substantial refinements.

Results
We report the experimental results on the English PTB and Chinese CTB datasets in Table 1 and 2 respectively, in which our networks are denoted by "INDP". The "M1" indicates that the results are obtained by the INDP with just one refinement over the parse graphs built using the first-order features, and similarly, the "M2" indicates the results are achieved by the INDP with at most twice high-order refinements, while the "UNC" in the last row indicates that the refinements will continue until no change is made in the structure predictions (see the algorithm listed in Figure 3). All compared transition-based parsing systems are indicated by a " ‡", and graph-based ones by " §". From these numbers, a handful of trends are readily apparent. Firstly, we note that the "fullfledged" INDP (indicated with "UNC") is superior to that without the high-order refinements by a fairly significant margin (5.01% for English and 6.55% for Chinese in LAS). Another striking result of these experiments is that comparable performance can be obtained by no more than twice refinements with high-order features, and "INDP + High-order + M2" achieves a good trade-off between the performance and parsing complexity. Our INDP gets nearly the same performance on the English PTB as the current models of (Kuncoro et al., 2016) and (Dozat and Manning, 2017) in spite of its simpler architectures, and gets stateof-the-art UAS accuracy on the Chinese CTB. The INDP lags behind in LAS, indicating one of a few possibilities. Firstly, we tried only a few different network configurations, and there are many ways (such as using deeper architectures, and recruiting bi-directional recurrent neural networks to produce word feature representations) that we could improve it further. Secondly, the model of (Kuncoro et al., 2016) is particularly designed to capture phrase compositionality, and thus, another possible improvement is to capture such compositionality by optimizing the network architectures, which may also lead to a better label score.

Related Work
Dependency-based syntactic representations of sentences have been found to be useful for various NLP tasks, especially for those involving natural language understanding in some way. We briefly review prior work both on graph-based and transition-based neural dependency parsers.
In transition-based parsing, we learn a model for scoring transitions from one state to the next, conditioned on the parse history, and parse a sentence by taking the highest-scoring transition out of every state until a complete dependency graph has been derived. Chen and Manning (2014) made the first successful attempt at introducing deep learning into a transition-based dependency parser. At each step, the feed-forward neural network assigns a probability to every action the parse can take from certain state (words on the stack and buffer). Some researchers have attempted to address the limitations of (Chen and Manning, 2014) by augmenting it with additional complexity.
A beam search and a conditional random field loss function were incorporated into the transitionbased neural network models (Weiss et al., 2015;Zhou et al., 2015;Andor et al., 2016), which allow the parsers to keep the top-k partial parse trees and revoke previous actions once it finds evidence that they may have been incorrect by locally greedy choices. Dyer et al (2015) used three LSTMs to represent the buffer, stack, and parsing history, getting state-of-the-art results on Chinese and English dependency parsing tasks.
Graph-based parsers use machine learning for scoring each possible edge for a given sentence, typically by factoring the graphs into their component arcs, and constructing the parse tree with the highest score from these weighted edges. Kiperwasser and Goldberg (2016b) presented a neural graph-based parser in which the bi-directional L-STM's recurrent output vector for each word is concatenated with each possible head's vector (also produced by the same biLSTM), and the result is used as input to a multi-layer perceptron (MLP) for scoring this modifier-head pair. Given the scores of the arcs, the highest scoring tree is constructed using Eisner's decoding algorithm (Eisner, 1996). Labels are predicted similarly, with each word's recurrent output vector and its head's vector being used in a multi-class MLP. Kiperwasser and Goldberg (2016a) also proposed a hierarchical tree LSTM to model the dependency tree structures in which each word is represented by the concatenation of its left and right modifier (child) vectors, and the modifier vectors are generated by two (leftward or rightward) recurrent neural networks. The tree representations were produced in a bottom-up recursive way with the (greedy) easy-first parsing algorithm (Goldberg and Elhadad, 2010). Similarly, Cheng et al (2016) proposed a graph-based neural dependency parser that is able to predict the scores for the next arc, conditioning on previous parsing decisions. In addition to using one bi-directional recurrent network that produces a recurrent vector for each word, they also have uni-directional recurrent neural networks (left-to-right and right-toleft) that keep track of the probabilities of each previous parsing actions.
In their many-task neural model, Hashimoto et al (2016) included a graph-based dependency parse in which the traditional MLP-based method that Kiperwasser and Goldberg (2016b) used was replaced with a bilinear one. Dozat and Manning (2017) modified the neural graph-based approach of (Kiperwasser and Goldberg, 2016b) in a few ways to improve the performance. In addition to building a network that is larger and uses more regularization, they replace the traditional MLPbased attention mechanism and affine label classifier with biaffine ones.
This work is most closely related to the graphbased parsing approaches with multiple high-order refinements (Rush and Petrov, 2012;Zhang et al., 2014), although the neural networks were not used in their parsers. Rush and Petrov (2012) proposed a multi-pass coarse-to-fine approach in which a coarse model was used to prune the search space in order to make the inference with up to thirdorder features practical. They start with a lineartime vine pruning pass and build up to high-order models. Zhang et al (2014) introduced a randomized greedy algorithm for dependency parsing in which they begin with a tree drawn from the uniform distribution and use hill-climbing strategy to find the optimal parse tree. Although they reported that drawing the initial tree randomly results in the same performance as when initialized from a trained first-order distribution, but multiple random restarts are required to avoid getting stuck in a locally optimal solution. Their greedy algorithm breaks the parsing into a sequence of local steps, which correspond to choosing the head for each modifier word (one arc at a time) in the bottom-up order relative to the current tree. In contrast, we employed the global inference algorithm to change the entire tree (all at a time) in each refinement step, which makes the improvement more efficient.

Conclusion
Graph-based parsers cannot easily condition on any extended scope of the dependency parse tree beyond a single arc, which is their primary shortcoming relative to transition-based competitors. We have shown that a simple, generally applicable incremental neural dependency parsing algorithm can deliver close to state-of-the-art parsing performance, which allows the high-order features to be taken into account without hurting the advantage of global exhaustive inference and learning as a member of graph-based parsing systems. Future work will involve exploring ways of augmenting the parser with a more innovative architecture than the relatively simple one used in current neural graph-based parsers.