Graph-based Dependency Parsing with Graph Neural Networks

We investigate the problem of efficiently incorporating high-order features into neural graph-based dependency parsing. Instead of explicitly extracting high-order features from intermediate parse trees, we develop a more powerful dependency tree node representation which captures high-order information concisely and efficiently. We use graph neural networks (GNNs) to learn the representations and discuss several new configurations of GNN’s updating and aggregation functions. Experiments on PTB show that our parser achieves the best UAS and LAS on PTB (96.0%, 94.3%) among systems without using any external resources.


Introduction
In recent development of dependency parsers, learning representations is gaining in importance. From observed features (words, positions, POS tags) to latent parsing states, building expressive representations is shown to be crucial for getting accurate and robust parsing performances.
Here we focus on graph-based dependency parsers. Given a sentence, a parser first scores all word pairs about how possible they hold valid dependency relations, and then use decoders (e.g., greedy, maximum spanning tree) to generate a full parse tree from the scores. The score function is a key component in graph-based parses. Commonly, a neural network is assigned to learn low dimension vectors for words (i.e., nodes of parse trees), and the score function depends on vectors of the word pair (e.g., inner products). The main task of this paper is to explore effective encoding systems for dependency tree nodes.
Two remarkable prior works on node representation are recurrent neural networks (RNNs) (Kiperwasser and Goldberg, 2016b) and biaffine mappings (Dozat and Manning, 2017). RNNs are powerful tools to collect sentence-level information, but the representations ignore features related to dependency structures. The biaffine mappings improve vanilla RNNs via a key observation: the representation of a word should be different regarding whether it is a head or a dependent (i.e., dependency tree edges are directional). Therefore, Dozat and Manning (2017) suggest distinguishing head and dependent vector of a word. Following this line of thought, it is natural to ask whether we can introduce more structured knowledge into node representations. In other words, if biaffine mappings encode the first order parent-children relations, can we incorporate other high-order relations (such as grandparents and siblings)?
In this work, we propose to use graph neural networks (GNNs) for learning dependency tree node representations. Given a weighted graph, a GNN embeds a node by recursively aggregating node representations of its neighbours. For the parsing task, we build GNNs on weighted complete graphs which are readily obtained in graphbased parsers. The graphs could be fixed in prior or revised during the parsing process. By stacking multiple layers of GNNs, the representation of a node gradually collects various high-order information and bring global evidence into decoders' final decision.
Comparing with recent approximate highorder parsers (Kiperwasser and Goldberg, 2016b;Zheng, 2017;Ma et al., 2018), GNNs extract highorder information in a similar incremental manner: node representations of a GNN layer are computed based on outputs of former layers. However, the main difference is that, instead of extracting highorder features on only one intermediate tree, the update of GNN node vectors is able to inspect all intermediate trees. Thus, it may reduce the influence of a suboptimal intermediate parsing result.
Comparing with the syntactic graph network Bastings et al., 2017;Zhang et al., 2018b) which runs GNNs on dependency trees given by external parsers, we use GNNs to build the parsing model. And instead of using different weight matrices for outgoing and ingoing edges, our way of handling directional edges is based on the separation of head and dependent representations, which requires new protocols for updating nodes. We discuss various configurations of GNNs, including strategies on neighbour vector aggregations, synchronized or asynchronized node vector update and graphs with different edge weights. Experiments on the benchmark English Penn Treebank 3.0 and CoNLL2018 multilingual parsing shared task show the effectiveness of the proposed node representations, and the result parser is able to achieve state-of-the-art performances.
To summarize, our major contributions include: 1. introducing graph neural networks to dependency parsing, which aims to efficiently encode high order information in dependency tree node representations.
2. investigating new configurations of GNNs for handling direct edges and nodes with multiple representations.

Basic Node Representations
In this section, we review word encoding systems used in recurrent neural networks and biaffine mappings. Our GNN encoder (Section 3) will base on these two prior works. 1 Given a sentence s = w 1 , . . . , w n , we denote a dependency tree of s to be T = (V, E), where the node set V contains all words and a synthetic root node 0, and the edge set E contains node pairs (i, j, r) which represents a dependency relation r between w i (the head) and w j (the dependent). Following the general graph-based dependency parsing framework, for every word pair (i, j), a function σ(i, j) assigns it a score which measures how possible is w i to be the head of 1 Following the convention of (Dozat and Manning, 2017), we use lowercase italic letters for scalars and indices, lowercase bold letters for vectors, uppercase italic letters for matrices. w j . 2 We denote G to be the directed complete graph in which all nodes in V are connected with weights given by σ. The correct tree T is obtained from G using a decoder (e.g., dynamic programming (Eisner, 1996), maximum spanning tree (McDonald et al., 2005), and greedy algorithm (Zhang et al., 2017)).
In neural-network-based models, the score function σ(i, j) usually relies on vector representations of nodes (words) i and j. How to get informative encodings of tree nodes is important for training the parser. Basically, we want the tree node encoder to explore both the surface form and deep structure of the sentence.
To encode the surface form of s, we can use recurrent neural networks (Kiperwasser and Goldberg, 2016b). Specifically, we apply a bidirectional long short-term memory network (biLSTM, (Hochreiter and Schmidhuber, 1997)). At each sentence position i, a forward LSTM chain (with parameter where x i is the input of a LSTM cell which includes a randomly initialized word embedding e(w i ), a pre-trained word embedding e ′ (w i ) from Glove (Pennington et al., 2014) and a trainable embedding of w i 's part-of-speech tag e(pos i ), Then, a context-dependent node representation of word i is the concatenation of the two hidden vectors, With the node representations, we can define the score function σ using a multi-layer perceptron σ(i, j) = MLP(c i ⊕ c j ) (Pei et al., 2015), or using a normalized bilinear function (A, b 1 , b 2 are parameters), x 1 x 2 x 3 x 4

MST
x 1 x 2 x 3 x 4 Figure 1: The GNN architecture. "RNN Encoder"+"Decoder" is equal to the Biaffine parser. For the "GNN Layers", each layer is based on a complete weighted graph, and the weights are supervised by the layer-wise loss.
which is actually a distribution on j's head words.
We note that from the RNN encoder, a node only obtains one vector representation. But as the dependency tree edges have directions, a word plays a different role regarding it is the head or the dependent in an edge. Thus, instead of using one vector representation, we employ two vectors to distinguish the two roles (Dozat and Manning, 2017). Concretely, based on c i , we use two multilayer perceptrons to generate two different vectors, The score funcion in Equation 2 now becomes The main task we will focus on in following sections is to further encode deep structure of s to node vectors h i and d i . Specifically, besides the parent-child relation, we would like to consider high-order dependency relations such as grandparents and siblings in the score function σ.

The GNN Framework
We first introduce the general framework of graph neural network. The setting mainly follows the graph attention network (Velikovi et al., 2018). 3 Given a (undirected) graph G, a GNN is a multilayer network. At each layer, it maintains a set of node representations by aggregating information from their neighbours.
where g is a non-linear activation function (we use LeakyReLU with negative input slope 0.1), W and B are parameter matrices. We use different edge weights α t ij , which is a function of v t−1 i and v t−1 j , to indicate different contributions of node j in building v t i . The update Equation 4 reads that the new representation v t i contains both the previous layer vector v t−1 i and a weighted aggregation of neighbour vectors v t−1 j . We can see that the GNN naturally catches multi-hop (i.e., high-order) relations. Taking the first two layers for example, for every node i at the second layer, v 2 i contains information of its 1-hop neighbours v 1 j . Since v 1 j has already encoded its own 1-hop neighbours at the first layer, v 2 i actually encodes information of its 2-hop neighbours. Inspired by this observation, we think GNNs may help parsing with high-order features.
On the other side, to parse with GNNs, instead of encoding one vector for each node, we need to handle the head representation h i and the dependent representation d i simultaneously on a directed graph G.
Furthermore, to approximate the exact highorder parsing (Eisner, 1996;McDonald and Pereira, 2006), we need each GNN layer to have a concrete meaning regarding parsing the sentence. For example, we could consider complete graphs : Three types of high-order information integrated in the parent-child pair (j, i). The grey shadows indicate which node representations already exist in first order feature. The orange shadows indicate which node representations should to be included for each high-order feature. Notice that k is actually a weighted sum of all applicable nodes (soft). Subfigure (a) helps to understand Equation 6. Since k acts as parent of j, to capture grandparent feature, h j should additionally contains information of h k . Subfigure (c) helps to understand Equation 7. Since k acts as child of j, to capture sibling feature, h j should additionally contains information of d k .
(i.e., all nodes are connected) and set edge weights using conditional probabilities, which is Equation 3 evaluated at layer t. 4 Thus, the graph at each layer appears as a "soft" parse tree, and the aggregated information would approximate high-order features on that tree. Comparing with existing incremental parsers which maintain only one intermediate tree ("hard"), the "soft" trees represented by GNN layers contain more information. In fact, the graphs keep all information to derive any intermediate parse trees. Therefore, it may reduce the risk of extracting high-order features on suboptimal intermediates.
We detail the GNN model in the following.

High-order Information
Given a node i, we mainly focus on three types of high-order information, namely, grandparents, grandchildren and siblings. We need to adapt the general GNN update formula to properly encode them into node representations.
First, for incorporating grandparent information (Figure 2.a), we expect σ t (j, i), which depends on the head vector of j and the dependent vector of i, not only considers the parent-child pair (j, i), but also consults the ("soft") parent of j suggested by the previous layer (denoted by k). Specifically, the new head representation of node j should examine representations of its neighbors when they act as parents of j. In other word, we will update h t j using h t−1 k . Similarly, for encoding grandchildren of j in σ t (j, i) (also denoted by k), we need the new dependent representation of node i examine its neighbors when they act as children of i. Thus, we will update d t i using d t−1 k . It suggests the following protocol, Note that we use α t ji in updating h t i and α t ji in updating d t i which is according to the probabilistic meaning of the weights.
On the other side, for extracting siblings of i (again denoted by k) in (j, i) (Figure 2.c), the new head representation of node j should examine representations of its neighbors when they act as dependents of j. We expect the update of h t j involving d t−1 k It suggests our second update protocol 5 , We can integrate Equation 6 and 7 in a single update which handles grandparents, grandchildren and siblings in an uniform way, Comparing with the general GNNs, above node vector updates are tailored to the parsing task using high-order feature rules. We think exploring the semantics of representations and graph weights would provide useful guidance in design of GNNs for specific tasks. Finally, besides the default synchronized setting, we also investigate asynchronized version of Equation 8, where we first update h, and then use the updated h to update d.

Graph Weights
In the graph-based parsing, the topology structure of G is mainly determined by edge weights α t ij . In fact, we usually work on a complete graph to obtain a parse tree. Thus, how to design α t ij is important to apply GNNs. As mentioned above, we can set α t ij equals to probability P t (i|j). In this section, we explore more settings on α t ij . First, instead of using the "soft" tree setting, we can assign {0, 1} values to α t ij to obtain a sparse graph, In this setting, a node only looks at the head node with the highest probability. An extension of Equation 10 is to consider topk head nodes, which could include more neighbourhood information. Defining N t k (j) be a set of nodes with top-k P t (i|j) for node j, we renormalize Equation 3 on this set and assign them to α t ij , Finally, for comparison, one can ignore P t (i|j) and see each neighbour equally at each layer,

Decoding
Given node representations and P (i|j), to build the final parse tree, we can either greedily set the head of w j to arg max i P (i|j) which is fast for decoding but may output an ill-formed tree, or use a MST algorithm on all word pairs with weight P (i|j), which forms a valid tree but could be slower.
To predict labels of dependency edges, we introduce P (r|i, j) which measures how possible a tree (i, j) holds a dependency relation r using another MLP. The setting is identical to the biaffine parser (Dozat and Manning, 2017).

Training
Given the gold standard tree T , the training objective consists of two parts. First, we have a decoder behind the final GNN layer (denote by τ ) which will perform decoding on both tree structures (using P τ (i|j)) and edge labels (using P (r|i, j)).
The loss from the final classifier is negative loglikelihood of T , Second, as mentioned in Section 3.1, we can provide supervision on P t (i|j) from each GNN layer (only on the tree structure, intermediate loss on labels are ignored). The layer-wise loss is The objective is to minimize a weighted combination of them L = λ 1 L 0 + λ 2 L ′ .

Experiments
We evaluate the proposed framework on the Stanford Dependency (SD) conversion of the English Penn Treebank (PTB 3.0) and the Universal Dependencies (UD 2.2)  treebanks used in CoNLL 2018 shared task (Zeman et al., 2018). For English, we use the standard train/dev/test splits of PTB (train= §2-21, dev= §22, test= §23), POS tags were assigned using the Stanford tagger with 10-way jackknifing of the training corpus (accuracy ≈ 97.3%). For 12 languages selected from UD 2.2, we use CoNLL 2018 shared task's official train/dev/test splits, POS tags were assigned by the UDPipe (Straka et al., 2016).
Parsing performance is measured with five metrics. We report unlabeled (UAS) and labeled attachment scores (LAS), unlabeled (UCM) and labeled complete match (LCM), and label accuracy score (LA). For evaluations on PTB, following (Chen and Manning, 2014), five punctuation symbols (" " : , .) are excluded from the evaluation. For CoNLL 2018 shared task, we use the official evaluation script.
All basic hyper-parameters are the same as those reported in Dozat and Manning (2017), which means that our baseline system without GNN layers is a re-implementation of the Biaffine parser. For GNN models, the only new parameters are matrices in P t (i|j) and matrices in GNN units. The λ 1 , λ 2 in objective L is set to λ 1 = 1, λ 2 = 0.5. The hyper-parameters of our default settings are summarized in Appendix A.

Main Results
Firstly, we compare our method with previous work (Table 1). The first part contains transitionbased models, the second part contains graphbased models and the last part includes three models with integrated hard high-order features. In general, our proposed method achieves significant improvements over our baseline biaffine parser and matches state-of-the-art models. In particular, it achieves 0.29 percent UAS and 0.35 percent LAS improvement over the baseline parser, and 0.1 percent UAS and 0.12 percent LAS improvement over the strong transition-based parser (Ma et al., 2018). It shows that our method can boost the performance of graph-based dependency parser using the global and soft high-order information by the GNN architecture. Secondly, we analyze different aggregating functions when capturing high-order information. (  Thirdly, we analyze the contributions and effects of the number of GNN layers (Figure 3 (a)). From the computation of GNNs, the more layers, the higher order of information is captured. The experimental results show that the 1-layer model significantly outperforms 0-layer model on all five scoring metrics. But continuing to increase the number of layers does not significantly improve performance. Previous work (Zheng, 2017) has shown that the introduction of more than secondorder information does not significantly improve parsing performances. Our results also present a consistent conclusion. Specifically, on UAS, LAS and LA, the 2-layer model has the highest sum of scores. On UCM and LCM, performance increases as the number of layers increases, showing the superiority of using high-order information in complete sentence parsing. In addition to parsing performance, we also focus on the speed. We observe that adding one layer of GNN slows down the prediction speed by about 2.1%. The 2-layer model can process 415.9 sentences per second on a single GPU. Its impact on the training process is also slight, increasing from 3 minutes to 3.5 minutes per epoch.
We futher examine different performance of each layer in a 3-layer model (Figure 3 (b)). We observe that, as we move to a higher layer, the average loss decreases during the training process (L 3 < L 2 < L 1 ). The figure shows that the introduction of high-order information leads to more accurate graph weights. We also do the MST decoding directly based on the graph weights on each layer and compare their development set UAS performances. From the layer-wise UAS   Table 3: Impact of different GNN update methods on PTB dataset. "Synch" is our default synchronized setting (Equation 8). "H-first" is an asynchronous update method that first updates head word representation (Equation 9). Similarly, the "D-first" model first updates dependent word representation. results, we observe that the difference between 2-layer and 3-layer is not obvious, but both are higher than the 1-layer.
Fourthly, we present the influences of synchronized/asynchronized GNN update methods (Table 3). We first compare the synchronous update and asynchronous update methods. It shows that the later one works better without adding extral parameters. The reason may be that asynchronous methods aggregate high-order information earlier.
The H-first model (Equation 9) is slightly better than the D-first model. This may indicate that dependent representation is more important than head representation, since the first updated representation will improve the representation of the late update, Fifthly, we experiment with unweighted graph (all set to 1) and hard weight graph (renormalized at top-k) (  that this approach will hurt the performance of the parser. For the Hard-k model (Equation 11), when k is equal to 1, it is equivalent to a GNN based on greedy decoding results, when k is equal to the sentence length, it is equivalent to our soft method. Experiments show that as k increases from 1 to 3, the performance of the Hard-k model is gradually improved. We also observe that hard weights affect the training stability of the parser. Finally, we report the results of our model on partial UD treebanks on the CoNLL 2018 shared task (

Error Analysis
Following McDonald and Nivre (2011); Ma et al. (2018), we characterize the errors made by the baseline biaffine parser and our GNN parser. Analysis shows that most of the gains come from the difficult cases (e.g. long sentences or longrange dependencies), which represents an encouraging sign of the proposed method's benefits.
Sentence Length. Figure 4 (a) shows the accuracy relative to sentence length. Our parser significantly improves the performance of the baseline parser on long sentence, but is slightly worse on short sentence (length ≤ 10). Dependency Length. Figure 4 (b) shows the precision and recall relative to dependency length. Our parser comprehensively and significantly improves the performance of the baseline parser in both precision and recall. Root Distance. Figure 4 (c) shows the precision and recall relative to the distance to the root. Our parser comprehensively and significantly improves baseline parser's recall. But for precision, the baseline parser performs better over long distances (≥ 6) than our parser.

Related Work
Graph structures have been extended to model text representation, giving competitive results for a number of NLP tasks. By introducing context neighbors, the graph structure is added to the sequence modeling tool LSTMs, which improves performance on text classification, POS tagging and NER tasks (Zhang et al., 2018a). Based on syntactic dependency trees, DAG LSTMs (Peng et al., 2017) and GCNs (Zhang et al., 2018b) are used to improve the performance of relation extraction task. Based on the AMR semantic graph representation, graph state LSTMs , GCNs (Bastings et al., 2017) and gated GNNs (Beck et al., 2018) are used as encoder to construct graph-to-sequence learning. To our knowledge, we are the first to investigate GNNs for dependency parsing task.
The design of the node representation network is a key problem in neural graph-based parsers. Kiperwasser and Goldberg (2016b) use BiRNNs to obtain node representation with sentence-level information. To better characterize the direction of edge, Dozat and Manning (2017) feed BiRNNs outputs to two MLPs to distinguish word as head or dependent, and then construct a biaffine mapping for prediction. It also performs well on multilingual UD datasets (Che et al., 2018).
Given a graph, a GNN can embed the node by recursively aggregating the node representations of its neighbors (Battaglia et al., 2018). Based on a biaffine mapping, GNNs can enhance the node representation by recursively integrating neighbors' information. The message passing neural network (MPNN) (Gilmer et al., 2017) and the non-local neural network (NLNN)  are two popular GNN methods. Due to the convenience of self-attention in handling variable sentence length, we use a GAT-like network (Velikovi et al., 2018) belonging to NLNN. Then, we further explore its aggregating functions and update methods on special task.
Apply the GAT to a directed complete graph similar to the Transformer encoder (Vaswani et al., 2017). But the transformer framework focuses only on head-dep-like dependency, we further explore it to capture high-order information on dependency parsing. Several works have investigated high-order features in neural parsing. Kiperwasser and Goldberg (2016b) uses a bottom-up tree-encoding to extract hard high-order features from an intermediate predicted tree. Zheng (2017) uses an incremental refinement framework to extract hard high-order features from a whole predicted tree. Ma et al. (2018) uses greedy decoding to replace the MST decoding and extract local 2-order features at the current decoding time. Comparing with the previous work, GNNs can efficiently capture global and soft high-order features.

Conclusions
We propose a novel and efficient dependency parser using the Graph Neural Networks. By recursively aggregating the neighbors' information, our parser can obtain node representation that incorporates high-order features to improve performance. Experiments on PTB and UD2.2 datasets show the effectiveness of our proposed method.