Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling

Semantic role labeling (SRL) is the task of identifying the predicate-argument structure of a sentence. It is typically regarded as an important step in the standard NLP pipeline. As the semantic representations are closely related to syntactic ones, we exploit syntactic information in our model. We propose a version of graph convolutional networks (GCNs), a recent class of neural networks operating on graphs, suited to model syntactic dependency graphs. GCNs over syntactic dependency trees are used as sentence encoders, producing latent feature representations of words in a sentence. We observe that GCN layers are complementary to LSTM ones: when we stack both GCN and LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009) both for Chinese and English.


Introduction
Semantic role labeling (SRL) (Gildea and Jurafsky, 2002) can be informally described as the task of discovering who did what to whom. For example, consider an SRL dependency graph shown above the sentence in Figure 1. Formally, the task includes (1) detection of predicates (e.g., makes); (2) labeling the predicates with a sense from a sense inventory (e.g., make.01); (3) identifying and assigning arguments to semantic roles (e.g., Sequa is A0, i.e., an agent / 'doer' for the corresponding predicate, and engines is A1, i.e., a patient / 'an affected entity'). SRL is often regarded as an important step in the standard NLP pipeline, providing information to downstream tasks such as information extraction and question answering. The semantic representations are closely related to syntactic ones, even though the syntaxsemantics interface is far from trivial (Levin, 1993). For example, one can observe that many arcs in the syntactic dependency graph (shown in black below the sentence in Figure 1) are mirrored in the semantic dependency graph. Given these similarities and also because of availability of accurate syntactic parsers for many languages, it seems natural to exploit syntactic information when predicting semantics. Though historically most SRL approaches did rely on syntax (Thompson et al., 2003;Pradhan et al., 2005;Punyakanok et al., 2008;Johansson and Nugues, 2008), the last generation of SRL models put syntax aside in favor of neural sequence models, namely LSTMs (Zhou and Xu, 2015;Marcheggiani et al., 2017), and outperformed syntactically-driven methods on standard benchmarks. We believe that one of the reasons for this radical choice is the lack of simple and effective methods for incorporating syntactic information into sequential neural networks (namely, at the level of words). In this paper we propose one way how to address this limitation.
Specifically, we rely on graph convolutional networks (GCNs) (Duvenaud et al., 2015;Kipf and Welling, 2017;Kearnes et al., 2016), a recent class of multilayer neural networks operating on graphs. For every node in the graph (in our case a word in a sentence), GCN encodes relevant information about its neighborhood as a real-valued feature vector. GCNs have been studied largely in the context of undirected unlabeled graphs. We introduce a version of GCNs for modeling syntactic dependency structures and generally applicable to labeled directed graphs.
One layer GCN encodes only information about immediate neighbors and K layers are needed to encode K-order neighborhoods (i.e., information about nodes at most K hops aways). This contrasts with recurrent and recursive neural networks (Elman, 1990;Socher et al., 2013) which, at least in theory, can capture statistical dependencies across unbounded paths in a trees or in a sequence. However, as we will further discuss in Section 3.3, this is not a serious limitation when GCNs are used in combination with encoders based on recurrent networks (LSTMs). When we stack GCNs on top of LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009), both for English and Chinese. 1 Interestingly, again unlike recursive neural networks, GCNs do not constrain the graph to be a tree. We believe that there are many applications in NLP, where GCN-based encoders of sentences or even documents can be used to incorporate knowledge about linguistic structures (e.g., representations of syntax, semantics or discourse). For example, GCNs can take as input combined syntactic-semantic graphs (e.g., the entire graph from Figure 1) and be used within downstream tasks such as machine translation or question answering. However, we leave this for future work and here solely focus on SRL.
The contributions of this paper can be summarized as follows: • we are the first to show that GCNs are effective for NLP; • we propose a generalization of GCNs suited 1 The code is available at https://github.com/ diegma/neural-dep-srl.
to encoding syntactic information at word level; • we propose a GCN-based SRL model and obtain state-of-the-art results on English and Chinese portions of the CoNLL-2009 dataset; • we show that bidirectional LSTMs and syntax-based GCNs have complementary modeling power.

Graph Convolutional Networks
In this section we describe GCNs of Kipf and Welling (2017). Please refer to Gilmer et al. (2017) for a comprehensive overview of GCN versions.
GCNs are neural networks operating on graphs and inducing features of nodes (i.e., real-valued vectors / embeddings) based on properties of their neighborhoods. In Kipf and Welling (2017), they were shown to be very effective for the node classification task: the classifier was estimated jointly with a GCN, so that the induced node features were informative for the node classification problem. Depending on how many layers of convolution are used, GCNs can capture information only about immediate neighbors (with one layer of convolution) or any nodes at most K hops aways (if K layers are stacked on top of each other).
More formally, consider an undirected graph G = (V, E), where V (|V | = n) and E are sets of nodes and edges, respectively. Kipf and Welling (2017) assume that edges contain all the self-loops, i.e., (v, v) 2 E for any v. We can define a matrix X 2 R m⇥n with each its column x v 2 R m (v 2 V) encoding node features. The vectors can either encode genuine features (e.g., this vector can encode the title of a paper if citation graphs are considered) or be a one-hot vector. The node representation, encoding information about its immediate neighbors, is computed as where W 2 R m⇥m and b 2 R m are a weight matrix and a bias, respectively; N (v) are neighbors of v; ReLU is the rectifier linear unit activation function. 2 Note that v 2 N (v) (because of selfloops), so the input feature representation of v (i.e. x v ) affects its induced representation h v .
Lane disputed those estimates Parameter matrices are sub-indexed with syntactic functions, and apostrophes (e.g., subj') signify that information flows in the direction opposite of the dependency arcs (i.e., from dependents to heads).
As in standard convolutional networks (LeCun et al., 2001), by stacking GCN layers one can incorporate higher degree neighborhoods: where k denotes the layer number and h

Syntactic GCNs
As syntactic dependency trees are directed and labeled (we refer to the dependency labels as syntactic functions), we first need to modify the computation in order to incorporate label information (Section 3.1). In the subsequent section, we incorporate gates in GCNs, so that the model can decide which edges are more relevant to the task in question. Having gates is also important as we rely on automatically predicted syntactic representations, and the gates can detect and downweight potentially erroneous edges.

Incorporating directions and labels
Now, we introduce a generalization of GCNs appropriate for syntactic dependency trees, and in general, for directed labeled graphs. First note that there is no reason to assume that information flows only along the syntactic dependency arcs (e.g., from makes to Sequa), so we allow it to flow in the opposite direction as well (i.e., from dependents to heads). We use a graph G = (V, E), where the edge set contains all pairs of nodes (i.e., words) adjacent in the dependency tree. In our example, both (Sequa, makes) and (makes, Sequa) belong to the edge set. The graph is labeled, and the label L(u, v) for (u, v) 2 E contains both information about the syntactic function and indicates whether the edge is in the same or opposite direction as the syntactic dependency arc. For example, the label for (makes, Sequa) is subj, whereas the label for (Sequa, makes) is subj 0 , with the apostrophe indicating that the edge is in the direction opposite to the corresponding syntactic arc. Similarly, self-loops will have label self . Consequently, we can simply assume that the GCN parameters are label-specific, resulting in the following computation, also illustrated in Figure 2: This model is over-parameterized, 3 especially given that SRL datasets are moderately sized, by deep learning standards. So instead of learning the GCN parameters directly, we define them as where dir(u, v) indicates whether the edge (u, v) is directed (1) along, (2) in the opposite direction to the syntactic dependency arc, or (3) Our simplification captures the intuition that information should be propagated differently along edges depending whether this is a head-to-dependent or dependent-to-head edge (i.e., along or opposite the corresponding syntactic arc) and whether it is a self-loop. So we do not share any parameters between these three very different edge types. Syntactic functions are important, but perhaps less crucial, so they are encoded only in the feature vectors b L(u,v) .

Edge-wise gating
Uniformly accepting information from all neighboring nodes may not be appropriate for the SRL setting. For example, we see in Figure 1 that many semantic arcs just mirror their syntactic counterparts, so they may need to be up-weighted. Moreover, we rely on automatically predicted syntactic structures, and, even for English, syntactic parsers are far from being perfect, especially when used out-of-domain. It is risky for a downstream application to rely on a potentially wrong syntactic edge, so the corresponding message in the neural network may need to be down-weighted.
In order to address the above issues, inspired by recent literature (van den Oord et al., 2016;Dauphin et al., 2016), we calculate for each edge node pair a scalar gate of the form where is the logistic sigmoid function, v L(u,v) 2 R are weights and a bias for the gate. With this additional gating mechanism, the final syntactic GCN computation is formulated as

Complementarity of GCNs and LSTMs
The inability of GCNs to capture dependencies between nodes far away from each other in the graph may seem like a serious problem, especially in the context of SRL: paths between predicates and arguments often include many dependency arcs (Roth and Lapata, 2016). However, when graph convolution is performed on top of LSTM states (i.e., LSTM states serve as input to GCN) rather than static word embeddings, GCN may not need to capture more than a couple of hops.
To elaborate on this, let us speculate what role GCNs would play when used in combinations with LSTMs, given that LSTMs have already been shown very effective for SRL (Zhou and Xu, 2015;Marcheggiani et al., 2017). Though LSTMs are capable of capturing at least some degree of syntax (Linzen et al., 2016) without explicit syntactic supervision, SRL datasets are moderately sized, so LSTM models may still struggle with harder cases. Typically, harder cases for SRL involve arguments far away from their predicates. In fact, 20% and 30% of arguments are more than 5 tokens away from their predicate, in our English and Chinese collections, respectively. However, if we imagine that we can 'teleport' even over a single (longest) syntactic dependency edge, the 'distance' would shrink: only 9% and 13% arguments will now be more than 5 LSTM steps away (again for English and Chinese, respectively). GCNs provide this 'teleportation' capability. These observations suggest that LSTMs and GCNs may be complementary, and we will see that empirical results support this intuition.

Syntax-Aware Neural SRL Encoder
In this work, we build our semantic role labeler on top of the syntax-agnostic LSTM-based SRL model of Marcheggiani et al. (2017), which already achieves state-of-the-art results on the CoNLL-2009 English dataset. Following their approach we employ the same bidirectional (BiL-STM) encoder and enrich it with a syntactic GCN.
The CoNLL-2009 benchmark assumes that predicate positions are already marked in the test set (e.g., we would know that makes, repairs and engines in Figure 1 are predicates), so no predicate identification is needed. Also, as we focus here solely on identifying arguments and labeling them with semantic roles, for predicate disambiguation (i.e., marking makes as make.01) we use of an offthe-shelf disambiguation model (Roth and Lapata, 2016;Björkelund et al., 2009). As in Marcheggiani et al. (2017) and in most previous work, we process individual predicates in isolation, so for each predicate, our task reduces to a sequence labeling problem. That is, given a predicate (e.g., disputed in Figure 3) one needs to identify and label all its arguments (e.g., label estimates as A1 and label those as 'NULL', indicating that those is not an argument of disputed).
The semantic role labeler we propose is composed of four components (see Figure 3): • look-ups of word embeddings; • a BiLSTM encoder that takes as input the word representation of each word in a sentence; • a syntax-based GCN encoder that re-encodes the BiLSTM representation based on the automatically predicted syntactic structure of the sentence; • a role classifier that takes as input the GCN representation of the candidate argument and the representation of the predicate to predict the role associated with the candidate word.

Word representations
For each word w i in the considered sentence, we create a sentence-specific word representation x i . We represent each word w as the concatenation of four vectors: 4 a randomly initialized word embedding x re 2 R dw , a pre-trained word embedding x pe 2 R dw estimated on an external text collection, a randomly initialized part-of-speech tag embedding x pos 2 R dp and a randomly initialized lemma embedding x le 2 R d l (active only if the word is a predicate). The randomly initialized embeddings x re , x pos , and x le are fine-tuned during training, while the pre-trained ones are kept fixed. The final word representation is given by x = x re x pe x pos x le , where represents the concatenation operator.

Bidirectional LSTM layer
One of the most popular and effective ways to represent sequences, such as sentences (Mikolov et al., 2010), is to use recurrent neural networks (RNN) (Elman, 1990). In particular their gated versions, Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU) (Cho et al., 2014), have proven effective in modeling long sequences (Chiu and Nichols, 2016;Sutskever et al., 2014). Formally, an LSTM can be defined as a function LST M ✓ (x 1:i ) that takes as input the sequence x 1:i and returns a hidden state h i 2 R d h . This state can be regarded as a representation of the sentence from the start to the position i, or, in other words, it encodes the word at position i along with its left context. However, the right context is also important, so Bidirectional LSTMs (Graves, 2008) use two LSTMs: one for the forward pass, and another for the backward pass, LST M F and LST M B , respectively. By concatenating the states of both LSTMs, we create a complete context-aware representation of a word BiLST M (x 1:n , i) = LST M F (x 1:i ) LST M B (x n:i ). We follow Marcheggiani et al. (2017) and stack J layers of bidirectional LSTMs, where each layer takes the lower layer as its input.

Graph convolutional layer
The representation calculated with the BiLSTM encoder is fed as input to a GCN of the form defined in Equation (4). The neighboring nodes of a node v, namely N (v), and their relations to v are predicted by an external syntactic parser.

Semantic role classifier
The classifier predicts semantic roles of words given the predicate while relying on word representations provided by GCN; we concatenate hidden states of the candidate argument word and the predicate word and use them as input to a classifier (Figure 3, top). The softmax classifier computes the probability of the role (including special 'NULL' role): where t i and t p are representations produced by the graph convolutional encoder, l is the lemma of predicate p, and the symbol / signifies proportionality. 5 As FitzGerald et al. (2015) and Marcheggiani et al. (2017), instead of using a fixed matrix W l,r or simply assuming that W l,r = W r , Figure 4: F 1 as function of word distance. The distance starts from zero, since nominal predicates can be arguments of themselves.
we jointly embed the role r and predicate lemma l using a non-linear transformation: where U is a parameter matrix, whereas q l 2 R d 0 l and q r 2 R dr are randomly initialized embeddings of predicate lemmas and roles. In this way each role prediction is predicate-specific, and, at the same time, we expect to learn a good representation for roles associated with infrequent predicates. As our training objective we use the categorical cross-entropy.

Datasets and parameters
We tested the proposed SRL model on the English and Chinese CoNLL-2009 dataset with standard splits into training, test and development sets. The predicted POS tags for both languages were provided by the CoNLL-2009 shared-task organizers.
For the predicate disambiguator we used the ones from Roth and Lapata (2016) for English and from Björkelund et al. (2009) for Chinese. We parsed English sentences with the BIST Parser (Kiperwasser and Goldberg, 2016), whereas for Chinese we used automatically predicted parses provided by the CoNLL-2009 shared-task organizers. For English, we used external embeddings of , learned using the structured skip n-gram approach of . For Chinese we used external embeddings produced with the neural language model of Bengio et al. (2003). We used edge dropout in GCN: when  v , we ignore each node v 2 N (v) with probability . Adam (Kingma and Ba, 2015) was used as an optimizer. The hyperparameter tuning and all model selection were performed on the English development set; the chosen values are shown in Appendix.

Results and discussion
In order to show that GCN layers are effective, we first compare our model against its version which lacks GCN layers (i.e. essentially the model of Marcheggiani et al. (2017)). Importantly, to measure the genuine contribution of GCNs, we first tuned this syntax-agnostic model (e.g., the number of LSTM layers) to get best possible performance on the development set. 6 We compare the syntax-agnostic model with 3 syntax-aware versions: one GCN layer over syntax (K = 1), one layer GCN without gates and two GCN layers (K = 2). As we rely on the same    off-the-shelf disambiguator for all versions of the model, in Table 1 and 2 we report SRL-only scores (i.e., predicate disambiguation is not evaluated) on the English and Chinese development sets. For both datasets, the syntax-aware model with one GCN layers (K = 1) performs the best, outperforming the LSTM version by 1.9% and 0.6% for Chinese and English, respectively. The reasons why the improvements on Chinese are much larger are not entirely clear (e.g., both languages are relative fixed word order ones, and the syntactic parses for Chinese are considerably less accurate), this may be attributed to a higher proportion of longdistance dependencies between predicates and arguments in Chinese (see Section 3.3). Edge-wise gating (Section 3.2) also appears important: removing gates leads to a drop of 0.3% F 1 for English and 0.6% F 1 for Chinese. Stacking two GCN layers does not give any benefit. When BiLSTM layers are dropped altogether, stacking two layers (K = 2) of GCNs greatly improves the performance, resulting in a 3.8% jump in F 1 for English and a 3.0% jump in F 1 for Chi-  In Figure 4, we show the F 1 scores results on the English development set as a function of the distance, in terms of tokens, between a candidate argument and its predicate. As expected, GCNs appear to be more beneficial for long distance dependencies, as shorter ones are already accurately captured by the LSTM encoder.
We looked closer in contribution of specific dependency relations for Chinese. In order to assess this without retraining the model multiple times, we drop all dependencies of a given type at test time (one type at a time, only for types appearing over 300 times in the development set) and observe changes in performance. In Figure 5, we see that the most informative dependency is COMP (complement). Relative clauses in Chinese are very frequent and typically marked with particle Ñ (de). The relative clause will syntactically depend on Ñ as COMP, so COMP encodes important information about predicate-argument structure. These are often long-distance dependencies and may not be accurately captured by LSTMs. Although TMP (temporal) dependencies are not as frequent (⇠2% of all dependencies), they are also important: temporal information is mirrored in semantic roles.
In order to compare to previous work, in Table 3 we report test results on the English indomain (WSJ) evaluation data. Our model is local, as all the argument detection and labeling decisions are conditionally independent: their interaction is captured solely by the LSTM+GCN encoder. This makes our model fast and simple, though, as shown in previous work, global modeling of the structured output is beneficial. 8 We leave this extension for future work. Interestingly, 7 Note that GCN layers are computationally cheaper than LSTM ones, even in our non-optimized implementation. 8 As seen in  we outperform even the best global model and the best ensemble of global models, without using global modeling or ensembles. When we create an ensemble of 3 models with the product-of-expert combination rule, we improve by 1.2% over the best previous result, achieving 89.1% F 1 . 9 For Chinese (Table 4), our best model outperforms the state-of-the-art model of Roth and Lapata (2016) by even larger margin of 3.1%.
For English, in the CoNLL shared task, systems are also evaluated on the out-of-domain dataset. Statistical models are typically less accurate when they are applied to out-of-domain data. Consequently, the predicted syntax for the out-ofdomain test set is of lower quality, which negatively affects the quality of GCN embeddings. However, our model works surprisingly well on out-of-domain data (Table 5), substantially outperforming all the previous syntax-aware models. This suggests that our model is fairly robust to mistakes in syntax. As expected though, our model does not outperform the syntax-agnostic model of Marcheggiani et al. (2017).

Related Work
Perhaps the earliest methods modeling syntaxsemantics interface with RNNs are due to (Henderson et al., 2008;Gesmundo et al., 2009), they used shift-reduce parsers for joint SRL and syntactic parsing, and relied on RNNs to model statistical dependencies across syntactic and semantic parsing actions. A more modern (e.g., based on LSTMs) and effective reincarnation of this line of research has been proposed in Swayamdipta et al. (2016). Other recent work which considered incorporation of syntactic information in neural SRL models include: FitzGerald et al. (2015) who use standard syntactic features within an MLP calculating potentials of a CRF model; Roth and Lapata (2016) who enriched standard features for SRL with LSTM representations of syntactic paths between arguments and predicates; Lei et al. (2015) who relied on low-rank tensor factorizations for modeling syntax. Also Foland and Martin (2015) used (nongraph) convolutional networks and provided syntactic features as input. A very different line of research, but with similar goals to ours (i.e. integrating syntax with minimal feature engineering), used tree kernels (Moschitti et al., 2008).
Beyond SRL, there have been many proposals on how to incorporate syntactic information in RNN models, for example, in the context of neural machine translation (Eriguchi et al., 2017;Sennrich and Haddow, 2016). One of the most popular and attractive approaches is to use treestructured recursive neural networks (Socher et al., 2013;Le and Zuidema, 2014;, including stacking them on top of a sequential BiLSTM (Miwa and Bansal, 2016). An approach of Mou et al. (2015) to sentiment analysis and question classification, introduced even before GCNs became popular in the machine learning community, is related to graph convolution. However, it is inherently single-layer and tree-specific, uses bottom-up computations, does not share parameters across syntactic functions and does not use gates. Gates have been previously used in GCNs (Li et al., 2016) but between GCN layers rather than for individual edges.
Previous approaches to integrating syntactic information in neural models are mainly designed to induce representations of sentences or syntactic constituents. In contrast, the approach we presented incorporates syntactic information at word level. This may be attractive from the engineering perspective, as it can be used, as we have shown, instead or along with RNN models.

Conclusions and Future Work
We demonstrated how GCNs can be used to incorporate syntactic information in neural models and specifically to construct a syntax-aware SRL model, resulting in state-of-the-art results for Chinese and English. There are relatively straightforward steps which can further improve the SRL results. For example, we relied on labeling arguments independently, whereas using a joint model is likely to significantly improve the performance. Also, in this paper we consider the dependency version of the SRL task, however the model can be generalized to the span-based version of the task (i.e. labeling argument spans with roles rather that syntactic heads of arguments) in a relatively straightforward fashion.
More generally, given simplicity of GCNs and their applicability to general graph structures (not necessarily trees), we believe that there are many NLP tasks where GCNs can be used to incorporate linguistic structures (e.g., syntactic and semantic representations of sentences and discourse parses or co-reference graphs for documents).