Graph Convolutions over Constituent Trees for Syntax-Aware Semantic Role Labeling

Semantic role labeling (SRL) is the task of identifying predicates and labeling argument spans with semantic roles. Even though most semantic-role formalisms are built upon constituent syntax and only syntactic constituents can be labeled as arguments (e.g., FrameNet and PropBank), all the recent work on syntax-aware SRL relies on dependency representations of syntax. In contrast, we show how graph convolutional networks (GCNs) can be used to encode constituent structures and inform an SRL system. Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial node representations are produced by `composing' word representations of the first and the last word in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are `decomposed' back into word representations which in turn are used as input to the SRL classifier. We show the effectiveness of our syntax-aware model on standard CoNLL-2005, CoNLL-2012, and FrameNet benchmarks.


Introduction
The task of semantic role labeling (SRL) consists of predicting the predicate-argument structure of a sentence. More formally, for every predicate, the SRL model has to identify all argument spans and label them with their semantic roles (see Figure 1).
The most popular resources for estimating SRL models are PropBank (Palmer et al., 2005) and FrameNet (Baker et al., 1998). In both cases annotations are made on top of syntactic constituent structures. * Research conducted when the author was at the University of Amsterdam.
Investors appeal to the CEO not to limit their access to sales data Earlier work on semantic role labeling hinged on constituent syntactic structure, using the trees to derive features and constraints on role assignments (Gildea and Jurafsky, 2002;Pradhan et al., 2005;Punyakanok et al., 2008). In contrast, modern SRL systems largely ignore treebank syntax (He et al., 2018(He et al., , 2017Zhou and Xu, 2015) and instead use powerful feature extractors, for example, LSTM sentence encoders.
There have been recent successful attempts to improve neural SRL models using syntax (Roth and Lapata, 2016;Strubell et al., 2018). Nevertheless, they have relied on syntactic dependency representations rather than constituent trees.
In these methods, information from dependency trees is injected into word representations using graph convolutional networks (GCN) (Kipf and Welling, 2017) or self-attention mechanisms (Vaswani et al., 2017). Since SRL annotations are done on top of syntactic constituents, 1 we argue that exploiting constituency syntax, rather than dependency one, is more natural and may yield more predictive features for semantic roles. For example, even though constituent boundaries could be derived from dependency structures, this would require unbounded number of hops over the dependency structure in GCNs or self attention. This would be impractical: both Strubell et al. (2018) and  use only one hop in their best systems.
Neural models typically treat SRL as a sequence labeling problem, and hence predictions are done for individual words. Though injecting dependency syntax into word representations is relatively straightforward, it is less clear how to incorporate constituency syntax into them. In this work, we show how this can be achieved with GCNs.
Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial span representations are produced by 'composing' word representations of the first and the last word in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are 'decomposed' back into word representations which in turn are used as input to the SRL classifier. This approach directly encodes into word representation information about boundaries and syntactic labels of constituents and also provides information about their neighbourhood in the constituent structure.
SpanGCNs may be beneficial in other NLP tasks, where neural sentence encoders are already effective and syntactic structure can provide a useful inductive bias. For example, consider logical semantic parsing (Dong and Lapata, 2016) or sentence simplification (Chopra et al., 2016). Moreover, SpanGCN can be in principle applied to other forms of span-based linguistic representations (e.g., co-reference graphs). However, we leave this for future work.

Constituency Tree Encoding
The architecture for encoding constituency trees make use of two building blocks, a bidirectional LSTM for encoding sequences and a graph convolutional network for encoding graph structures.  Figure 2: SpanGCN encoder. First, for each constituent, an initial representation is produced by composing the start and end tokens' BiLSTM states (purple and black dashed arrows, respectively). This is followed with a constituent GCN: red and black arrows represent parent-to-children and children-toparent messages, respectively. Finally, the constituent is decomposed back: each constituent sends messages to its start and end tokens.

BiLSTM encoder
A bidirectional LSTM (BiLSTM) (Graves, 2013) consists of two LSTMs (Hochreiter and Schmidhuber, 1997), one that encodes the left context of a word and one that encodes the right context. In this paper we use alternating-stack BiLSTMs as introduced by Zhou and Xu (2015), where the forward LSTM is used as input to the backward LSTM. As in (He et al., 2017), we also employ highway connections (Srivastava et al., 2015) between layers and recurrent dropout (Gal and Ghahramani, 2016) to avoid overfitting.

GCN
The second building block we use is a graph convolutional network (Kipf and Welling, 2017). GCNs are neural networks that, given a graph, compute the representation of a node conditioned on the neighboring nodes. It can be seen as a message passing algorithm where the representation of a node is updated based on 'messages' sent by its neighboring nodes (Gilmer et al., 2017).
The input to GCN is an undirected graph G = (V, E), where V (|V | = n) and E are sets of nodes and edges, respectively. Kipf and Welling (2017) assume that the set of edges E contains also a self-loop, i.e., (v, v) ∈ E for any v. We refer to the initial representation of nodes with a matrix X ∈ R m×n , with each its column x v ∈ R m (v ∈ V) encoding node features. The new node representation is computed as where U ∈ R m×m and b ∈ R m are a weight matrix and a bias, respectively; N (v) are neighbors of v; ReLU is the rectifier linear unit activation function.
The original GCN definition assumes that the edges are undirected and unlabeled. We take the inspiration from SyntacticGCNs  introduced for dependency syntactic structures. Our update function is defined as where LayerN orm refers to layer normalization (Ba et al., 2016) applied after summing the messages. Expressions T f (u, v) and T c (u, v) are finegrained and coarse-grained versions of edge labels. For example, T f (u, v) may simply return the direction of arc (i.e. whether the message flows along the graph edge or in the opposite direction), whereas the bias can provide some additional syntactic information. The typing decides how many parameters GCN has. It is crucial to keep the number of coarse-grained types low as the model will have to estimate one R m×m matrix per coarsegrained type. We will formally define the types in the next section. We also used scalar gates g u,v to weight the contribution of each node in the neighborhood and potentially ignore irrelevant edges: where σ is the logic sigmoid activation function, whereasû Tc(u,v) ∈ R m andb T f (u,v) ∈ R are edgetype-specific parameters. Now, we will show how to compose GCN and LSTM layers to produce a syntactically-informed encoder.

From words to constituents and back
The model we propose for encoding constituency structure is shown in Figure 2. It is composed of three modules: constituent composition, constituent GCN and constituent decomposition. Note that there is no parameter sharing across these components.

Constituent composition
The model takes as input word representations which can either be static word embeddings or contextual word vectors (Peters et al., 2018a). The sentence is first encoded with a BiLSTM to obtain a context-aware representation of each word. A constituency tree is composed of words (V w ) and constituents (V c ). 2 We add representations (initially zero vectors) for each constituent in the tree, i.e. green blocks in Figure 2. Each constituent representations is computed using GCN updates (Equation 1) from word representations corresponding to the beginning of its span and to the end of its span. The coarsegrained types T c (u, v) here are binary, distinguishing messages from start tokens vs. from end tokens. The fine-grained edge types T f (u, v) encode additionally the constituent label (e.g., NP or VP).
Constituent GCN Constituent composition is followed by a layer where constituent nodes exchange messages. This layer makes sure that information about children gets incorporated into representations of immediate parents and vice versa. GCN operates on the graph with nodes corresponding to all constituents (V c ) in the trees. The edges connect constituents and their immediate children in the syntactic tree, and do it in both directions. Again, the updates are defined as in Equation 2. As before, T c (u, v) is binary, now distinguishing parent-to-children messages from children-to-parent messages. T f (u, v) additionally encodes the label of the constituent sending the message. For example, consider the computation of the VP constituent in Figure 2. It receives a message from the S constituent, this is a parent-tochild message and the 'sender' is S. The parameters corresponding to these edge types will be used in computing this message.
Constituent decomposition At this point, we want to 'infuse' words with information coming from constituents. The graph here is the inverse of that used in the composition stage: the constituents pass the information to the first and the last words in their spans. As in the composition stage, T c (u, v) is binary, distinguishing messages to start and end tokens. The fine-grained edge types, also as before, additionally encode the constituent label. In order to spread syntactic information across the sentence, a further BiLSTM layer is used.
Note that residual connections indicated in blue in Figure 2, let the model bypass GCN if / where needed.

Semantic Role Labeling
SRL can be cast as a sequence labeling problem where given an input sentence x of length T , and the position of the predicate in the sentence p ∈ T , the goal is to predict a BIO sequence of semantic roles y (see Figure 1). We test our model on two different semantic role labeling formalisms, Prop-Bank and FrameNet.
PropBank In PropBank conventions, a frame is specific to a predicate sense. For example, for the predicate make, it distinguishes 'make.01' ('create') frame from 'make.02' ('cause to be') frame. Though roles are formally frame-specific (e.g., A0 is the 'creator' for the frame 'make.01' and the 'writer' for the frame 'write.01'), there are certain cross-frame regularities. For example, A0 and A1 tend to correspond to proto-agents and protopatients, respectively.
FrameNet In FrameNet, every frame has its own set of role labels (frame elements in FrameNet terminology). 3 This makes the problem of predicting role labels harder. Differently from PropBank, lexically distinct predicates (lexical units or targets in FrameNet terms) may evoke the same frame. For example, need and require both can trigger frame 'Needing'.
As in previous work we compare to, we assume to have access to gold frames (Swayamdipta et al., 2018;Yang and Mitchell, 2017).
Word representation We represented words with 100-dimensional GlovE embeddings (Pennington et al.) and we keep them fixed during training. Word embeddings are concatenated with 100-dimensional embeddings of a predicate binary feature (indicating if the word is the target predicate or not). Before concatenation the GlovE embeddings are passed through layer normalization (Ba et al., 2016) and dropout (Srivastava et al., 2014). Formally, where predemb(t) is a function that returns the embedding for the presence or absence of the predicate at position t. The obtained embedding x t is then fed to the sentence encoder.
Sentence encoder As a sentence encoder we use SpanGCN introduced in Section 2. The SpanGCN model is fed with word representations x t . Its output is a sequence of hidden vectors that encode syntactic information for each candidate argument h t . As a baseline we also use a syntaxagnostic sentence encoder that is the reimplementation of the encoder in (He et al., 2017) with stacked alternating LSTMs, i.e. our model with the three GCN layers stripped off. 4 Bilinear scorer Following Strubell et al. (2018) we used a bilinear scorer: h pred p and h role t are a non-linear projection of the predicate h p at position p in the sentence and the candidate argument h t . The scores s pt are passed through the softmax function and fed to the conditional random field (CRF) layer.
Conditional random field As output layer we use a first-order Markov CRF (Lafferty et al., 2001). The Viterbi algorithm is used to predict the most likely label assignment at test time.
At train time we learn the scores for transitions between BIO labels. The entire model is trained to minimize the negative conditional log-likelihood: where p is the predicate position for the training example j.

Data and setting
We experimented on the CoNLL-2005 and CoNLL-2012 (OntoNotes) datasets, and used the CoNLL 2005 evaluation script for evaluation. We also applied our approach to FrameNet 1.5 with the data split of Das et al. (2014) and followed the official evaluation set-up from the SemEval07 Task 19 on frame-semantic parsing (Baker et al., 2007).
We trained the self-attentive constituency parser of Kitaev and Klein (2018)  We used 100-dimensional GloVe embeddings for all our experiments, unless otherwise specified. The hyperparameters are tuned on the CoNLL-2005 development set. The LSTMs hidden states dimensions were set to 300 for CoNLL experiments and to 200 for FrameNet ones. In our model, we used a four-layer BiLSTM below GCN layers and a two-layer BiLSTM on top. We used an eight-layer BiLSTM in our syntax-agnostic baseline, the number of layers was independently tuned on the CoNLL-2005 development set. For ELMo experiments, we learned the mixing coefficients of ELMo and we projected the weighted sum of the ELMo layers to a 100-dimensional vector, applied layer normalization, ReLU, and dropout.
For FrameNet experiments, we constrained the CRF layer to accept only for BIO tags compatible with the selected frame. We used Adam (Kingma and Ba, 2015) as an optimizer with an initial learning rate of 0.001, we halved the learning rate if we did not see an improvement on the development set for two epochs. We trained the model for maximum of 100 epochs.
All models were implemented with PyTorch.  We used some modules from AllenNLP 7 and the reimplementation of the FrameNet evaluation scripts by Swayamdipta et al. (2018). 8

Importance of syntax and ablations
Before comparing our full model to state-of-theart SRL systems, we show that our model genuinely benefits from incorporating syntactic information and motivate other modeling decisions (e.g., the presence of BiLSTM layers at the top). We perform this analysis on the CoNLL-2005 dataset. We also experiment with gold-standard syntax, as this provides an upper bound on what SpanGCN can gain from using syntactic information.
From Table 1, we can see that SpanGCN improves over the syntax-agnostic baseline by 1.2% F1, a substantial boost from using predicted syntax. We can also observe that it is important to have the top BiLSTM layer. When we remove the BiLSTM layer, the performance drops by 1% F1. It is interesting that without this last layer, SpanGCN's performance is roughly the same as that of the baseline. This shows the importance of 7 https://github.com/allenai/allennlp 8 https://github.com/swabhs/scaffolding Distance from predicate (tokens)  spreading syntactic information from constituent boundaries to the rest of the sentence. When we compare SpanGCN relying on predicted syntax with the version using gold-standard syntax, we can see that SRL scores improve greatly. 9 This suggests that, despite its simplicity (e.g., somewhat impoverished parameterization of constituent GCNs), SpanGCN is capable of extracting predictive features from syntactic structures.
We also measured the performance of the models above as a function of sentence length (Figure 3), and as a function of the distance between a predicate and its arguments (Figure 4). Not surprisingly, the performance of every model degrades with the length. For the model using gold syntax, the difference between F1 scores on short sentences and long sentences is smaller (2.2% F1) than for the models using predicted syntax (6.9% F1). This is also expected as in the goldsyntax set-up SpanGCN can rely on perfect syntactic parses even for long sentences, while in the 9 The syntactic parser we use scores 92.5% F1 on the development set.  realistic set-up syntactic features start to be unreliable. SpanGCN performs on par with the baseline for very short and very long sentences. Intuitively, for short sentences BiLSTMs may already encode enough syntactic information, while for longer sentences the quality of predicted syntax is not good enough to get gains over the BiLSTM baseline.
When considering the performance of each model as a function of the distance between a predicate and its arguments, we observe that all models struggle with more 'remote' arguments. Evaluated in this setting, SpanGCN is slightly better than the baseline.
We also checked what kind of errors these models make by using an oracle to correct one error type a time and measuring influence on the performance (He et al., 2017). Figure 5 shows the results. We can see that all the models make the same fraction of mistakes in labeling arguments, even with gold syntax. It is also clear that using  gold syntax and, to a lesser extent, predicted syntax, helps the model to figure out exact boundaries of argument spans. The difference in improvement of SpanGCN with gold syntax after fixing the errors related to spans (merge two spans, spit into two spans, fix both boundaries) is 1.4% F1, while for SpanGCN with predicted syntax is 6.1% F1. The correction of the same errors for the BiLSTM baseline results in a difference of 6.8% F1.

Comparing to the state of the art
We compare SpanGCN with state-of-the-art models on both CoNLL-2005 andCoNLL-2012. 10 CoNLL-2005 In Table 2 (Single) we show results on the CoNLL-2005 dataset. We compare the model with state-of-the-art approaches that use syntax (Strubell et al., 2018) and with syntax-agnostic models (He et al., 2018(He et al., , 2017Tan et al., 2018;Ouchi et al., 2018). SpanGCN obtains state-of-the-art results outperforming also the multi-task self-attention model of Strubell et al. (2018)  The performance on the out-of-domain data shows that SpanGCN is quite robust with nosier syntax. This may be surprising given that the GCNbased dependency-SRL model of  did not benefit from using dependency syntax on out-of-domain data. 10 We only considered single, non-ensemble models. 11 We compared with the LISA model where no ELMo information is used, neither in the syntactic parser nor in the SRL components.  Table 3 (Single) we report results on the CoNLL-2012 dataset. SpanGCN obtains 84.4 F1, outperforming all previous models evaluated on this data.

ELMo Experiments
We also tested SpanGCN while using contextualized word embeddings, ELMo (Peters et al., 2018a) to train the syntactic parser of Kitaev and Klein (2018) and also provided them as input to our model.
In Table 4, we show the impact of ELMo used in different ways: as word embedding (EMB), as predicted syntax obtained with the ELMobased parser (SYN), and both (EMB-SYN). As expected, using ELMo always results in an improvement. Using ELMo as input word embeddings (EMB) is more effective than using it indirectly through predicted syntax (SYN), 85.9% vs. 85.7% F1. When using both ELMo embeddings and the ELMo parser, we obtain even better scores 86.6% F1. This result is 2.2% better than SpanGCN without ELMo and 0.65% better than the EMB model. This may suggest that although contextualized word embeddings contain information about syntax (Tenney et al., 2019;Hewitt and Manning, 2019;Peters et al., 2018b), explicitlyencoding high quality syntax is still useful.
In Table 2  We also compared our model against Strubell et al. (2018)   FrameNet On FrameNet data, we compare SpanGCN with the sequential and sequentialspan ensemble models of Yang and Mitchell (2017), and with the multi-task learning model of Swayamdipta et al. (2018). Swayamdipta et al.
(2018) use a multi-task learning objective where the syntactic scaffolding model and the semantic role labeler share the same sentence encoder and are trained together on disjoint data. Like our method, this approach injects syntactic information (though dependency rather than constituent syntax) into word representations which are then used by the SRL model. We show results obtained on the FrameNet test set in Table 5. The SpanGCN model obtains 69.3% F1 score. It performs better than the syntax-agnostic baseline (2.9% improvement) and better than the syntax-agnostic ensemble model (ALL) of Yang and Mitchell (2017) (3.8% improvement). SpanGCN slightly outperforms (0.2% F1) the multi-task syntactic model of Swayamdipta et al. (2018) obtaining state-of-theart results, 69.3% F1.

Related Work
Among earlier approaches to incorporating syntax in SRL, Socher et al. (2013); Tai et al. (2015) proposed recursive neural networks that encode constituency trees by recursively creating representations of constituents. There are two important differences with our approach. First, in our model the syntactic information in the constituents flows back to word representations. This may be achieved with their inside-outside versions (Le and Zuidema, 2014;Teng and Zhang, 2017) . Second, these previous model perform a global pass over the tree whereas GCNs take into account only small fragments of the graph. This may make GCNs more robust when using noisy predicted syntactic structures. More recently, dependency syntax has gained a lot of attention. Similarly to this work,  proposed to encode dependency structure using GCNs for SRL. Strubell et al. (2018) used a multi-task objective to force one of the heads of the self-attention model to predict syntactic edges. Roth and Lapata (2016) encoded dependency paths between predicates and arguments using an LSTM. Also, Swayamdipta et al. (2018) used a multi-task learning objective to produce syntactically-informed word representation, with a sentence encoder shared between two tasks, a main task (SRL) and an auxiliary syntax-related task. In earlier work, syntax has been incorporated in a number of different ways. Naradowsky et al. (2012) used graphical models to encode syntactic structures while Moschitti et al. (2008) applied tree kernels for encoding constituency trees for SRL. Many SRL approaches cast the problem of SRL as a span classification problem, instead of treating it as sequence labeling. FitzGerald et al. (2015) used hand-crafted features to represent spans, while He et al. (2018) and Ouchi et al. (2018) adopted a BiLSTM feature extractor. In principle, SpanGCN can also be used as a syntactic feature extractor within this class of models.

Conclusions
In this paper we introduced SpanGCN, a novel neural architecture encoding constituency syntax at the word level. We applied SpanGCN to the semantic role labeling task, on PropBank and FrameNet. We can observe substantial improvements from using constituent syntax on both datasets, and also in the realistic out-of-domain setting. Given that GCNs over dependency and constituency structure have access to very different information, it would be interesting to see in future work if combining two types of representations can lead to further improvements. While we experimented only with constituency syntax, SpanGCN may in principle be able to encode any kind of span structure, for example, coreference graphs, and can also be used to produce linguistically-informed encoders for other NLP tasks rather than only SRL.