A Unified Syntax-aware Framework for Semantic Role Labeling

Semantic role labeling (SRL) aims to recognize the predicate-argument structure of a sentence. Syntactic information has been paid a great attention over the role of enhancing SRL. However, the latest advance shows that syntax would not be so important for SRL with the emerging much smaller gap between syntax-aware and syntax-agnostic SRL. To comprehensively explore the role of syntax for SRL task, we extend existing models and propose a unified framework to investigate more effective and more diverse ways of incorporating syntax into sequential neural networks. Exploring the effect of syntactic input quality on SRL performance, we confirm that high-quality syntactic parse could still effectively enhance syntactically-driven SRL. Using empirically optimized integration strategy, we even enlarge the gap between syntax-aware and syntax-agnostic SRL. Our framework achieves state-of-the-art results on CoNLL-2009 benchmarks both for English and Chinese, substantially outperforming all previous models.


Introduction
The purpose of semantic role labeling (SRL) is to derive the predicate-argument structure of each predicate in a sentence. A popular formalism to represent the semantic predicate-argument structure is based on dependencies, namely dependency SRL, which annotates the heads of arguments rather than phrasal arguments. Given a sentence (in Figure 1), SRL is generally decomposed into multiple subtasks in pipeline framework, consisting of predicate identification (makes), predicate disambiguation (make.02), argument identification (e.g., Someone) and argument classification (Someone is A0 for the predicate makes). SRL is beneficial to a wide range of natural language processing (NLP) tasks, including machine translation (Shi et al., 2016) and question answering (Berant et al., 2013;Yih et al., 2016). Most traditional SRL methods rely heavily on feature templates that struggle to capture sufficient discriminative information, while neural models are capable of extracting features automatically. In particular, recent works (Zhou and Xu, 2015;He et al., 2017; propose syntax-agnostic models for SRL and achieve favorable results, which seems to be in conflict with the belief that syntactic information is an absolutely necessary prerequisite for high-performance SRL (Gildea and Palmer, 2002). Despite the success of these models, the main reasons for putting syntax aside are two-fold. First, it is still challenging to effectively incorporate syntactic information into neural SRL models, due to the sophisticated tree structure of syntactic relation. Second, the syntactic parsers are unreliable on account of the risk of erroneous syntactic input, which may lead to error propagation and an unsatisfactory SRL performance.
However, syntactic information is considered closely related to semantic relation and plays an essential role in SRL task (Punyakanok et al., 2008). Recently,  proposed a syntactic graph convolutional networks (GCNs) based SRL model and further improved the SRL performance with relatively better syntactic parser as input. Since syntax can provide rich structure and information for SRL, we seek to effectively model complex syntactic tree structure for incorporating syntax into neural SRL.
In this paper, we present a general framework 1 for SRL, which enables us to integrate syntax into SRL in diverse ways. Following , we focus on argument labeling and formulate SRL as sequence labeling problem. However, we differ by (1) leveraging enhanced word representation, (2) applying recent advances in recurrent neural networks (RNNs), such as highway connections (Srivastava et al., 2015), (3) using deep encoder with residual connections (He et al., 2016), (4) further extending Syntax Aware Long Short-Term Memory (SA-LSTM) (Qian et al., 2017) for SRL, and (5) introducing the Tree-Structured Long Short-Term Memory (Tree-LSTM) (Tai et al., 2015) to model syntactic information for SRL.
In addition, as pointed out by He et al. (2017) for span SRL, the worse syntactic input will hurt performance if the syntactically-driven SRL model trusts syntactic information too much, and high-quality syntax can still make a large impact on SRL, which motivates us to investigate the effect of syntactic quality on dependency SRL. In summary, our major contributions are as follows: • We propose a unified neural framework for dependency SRL to more effectively integrate syntactic information with multiple methods.
• Our SRL framework incorporated with syntax achieves the new state-of-the-art results on both English and Chinese CoNLL-2009 benchmarks.
• We explore the impact of different quality of syntactic input on SRL performance, showing that high quality syntactic parse may indeed improve syntax-aware SRL.

A Unified SRL Framework
In order to explore the effectiveness of the syntactic feature from various perspectives, we propose a unified neural framework that is capable of optionally accommodating various types of syntactic encoders for syntax-based SRL.
Since the CoNLL-2009 shared task (Hajič et al., 2009) have beforehand indicated the predicate positions, we need to identify and label all arguments for each predicate, which is a typical sequence tagging problem. In this work, we construct a general SRL framework for argument labeling. As shown in Figure 2, our SRL framework includes three main modules, (1) BiLSTM encoder that directly takes sequential inputs, (2) MLP with highway connections for softmax output layer, and (3) an optional syntactic encoder that receives the outputs of the BiLSTM encoder and then let its own outputs integrate with the BiL-STM outputs through residual connections. Note that when the syntactic encoder is completely removed, MLP only takes inputs directly from the BiLSTM encoder, which let our framework become a syntax-agnostic labeler.

Sentence Encoder
Word representation Given a sentence and known predicate, we consider predicate-specific word representation, following previous work . Specifically, each word embedding representation e i of input sentence is the concatenation of several features, a randomly initialized word embedding e r i , a pretrained word embedding e p i , a randomly initialized lemma embedding e l i , a randomly initialized POS tag embedding e pos i , and a predicate-specific feature e f i , which is a binary flag set 0 or 1 indicating whether the current word is the given predicate.
To further enhance the word representation, we leverage an external embedding ELMo (Embeddings from Language Models) proposed by Peters et al. (2018). ELMo is obtained by deep bidirectional language model that takes characters as input, enriching subword information and contextual information, which has expressive representation power. Eventually, the resulting word representation is concatenated as e i = [e r i , e p i , e l i , e pos i , e f i , ELMo i ]. BiLSTM encoder We use bi-directional Long Short-term Memory neural network (BiLSTM) (Hochreiter and Schmidhuber, 1997) as the sentence encoder to model sequential inputs. Given an input sequence (e 1 , . . . , e n ), the BiLSTM processes these embedding vectors sequentially from both directions to obtain two separated hidden states, − → h i and ← − h i respectively. By concatenating the two states, we get a contextual representation  BiLSTM layer as input. In this work, we stack four layers of BiLSTM.

Role Labeler
We adopt a Multi-Layer Perceptron (MLP) with highway connections (Srivastava et al., 2015) on the top of our deep encoder, which takesthe concatenated representation as input. The MLP consists of 10 layers and we employ ReLU activations for the hidden layer. To get the final predicted semantic roles, we use a softmax layer over the outputs to maximize the likelihood of labels. The MLP part takes inputs from both the BiLSTM encoder and syntactic encoder, which are joint through a residual connection (He et al., 2016) as shown in Figure 2. It is worth noting that our deep encoder is different from the one of Marcheggiani and Titov (2017), which directly applies a softmax transformation over the syntactic representation and predicts the role label for each word. That is, their syntactic encoder outputs are directly taken as the input of hidden layer.

Syntactic Encoder
To integrate the syntactic information into sequential neural networks, we employ a syntactic encoder on top of the BiLSTM encoder. Specifically, given a syntactic dependency tree T , for each node n k in T , let C(k) denote the syntactic children set of n k , H(k) denote the syntactic head of n k , and L(k, ·) be the dependency relation between node n k and those have a direct arc from or to n k . Then we formulate the syntactic encoder as a transformation f τ over the node n k , which may take some of C(k), H(k), or L(k, ·) as input, and compute a syntactic representation v k for node When not otherwise specified, x k denotes the input feature representation of n k which may be either the word representation e k or the output of BiLSTM h k , σ denotes the logistic sigmoid function, and denotes the element-wise multiplication.
In practice, the transformation f τ can be any syntax encoding method. In this paper, we will consider three types of syntactic encoders, syntactic graph convolutional network (Syntactic GCN) (in Section 3.1), syntax aware LSTM (SA-LSTM) (in Section 3.2), tree-structured LSTM (Tree-LSTM) (in Section 3.3). Then, we will provide a brief introduction in subsequent subsections.

Syntactic GCN
GCN (Kipf and Welling, 2017) is proposed to induce the representations of nodes in a graph based on the properties of their neighbors. Given its effectiveness,  in-troduce a generalized version for the SRL task, namely syntactic GCN, and shows that syntactic GCN is effective in incorporating syntactic information into neural models.
Syntactic GCN captures syntactic information flows in two directions, one from heads to dependents (along), the other from dependents to heads (opposite). Besides, it also models the information flows from a node to itself, namely, it assumes that a syntactic graph contains self-loop for each node. Thus, the syntactic GCN transformation of a node n k is defined on its neighborhood For each edge connects n k and its neighbor n j , we can compute a vector representation for it, where dir(k, j) denotes the direction type (along, opposite or self-loop) of the edge from n k to n j , W dir(k,j) is direction type specific parameter, b L(k,j) is label specific parameter. Considering that syntactic information from all the neighboring nodes may make different contribution to semantic role labeling, syntactic GCN introduces an additional edge-wise gating for each node pair (n k , n j ) as The syntactic representation v k for a node n k can be then computed as:

SA-LSTM
SA-LSTM (Qian et al., 2017) is an extension of the standard BiLSTM architecture, which aims to simultaneously encode the syntactic and contextual information for a given word as shown in Figure 2. On one hand, the SA-LSTM calculates the hidden state in sequence timestep order like the standard LSTM, On the other hand, it further incorporates the syntactic information into the representation of each word by introducing an additional gate, is the weighted sum of all hidden state vectors h j which come from previous node (word) n j , the weight factor α j is actually a trainable weight related to the dependency relation L(k, ·) when there exists a directed edge from n j to n k .
Note thath k is always the hidden state vector of the syntactic head of n k according to the definition of α j . Since a word will be assigned a single syntactic head, such a strict constraint prevents the SA-LSTM from incorporating complex syntactic structures. Inspire by the idea of GCN, we relax the directed constraint of α j , whenever there is an edge between n j and n k .
After the SA-LSTM transformation, the outputs of the SA-LSTM layer from both directions are concatenated and taken as the syntactic represen- Different from the syntactic GCN, SA-LSTM encoding both syntactic and contextual information in a single vector v k .

Tree-LSTM
Tree-LSTM (Tai et al., 2015) can be considered as an extension of the standard LSTM, which aims to model the tree-structured topologies. At each timestep, it composes an input vector and the hidden states from arbitrarily many child units. Specifically, the main difference between Tree-LSTM unit and the standard one is that the memory cell updating and the calculation of gating vectors are depended on multiple child units. A Tree-LSTM unit can be connected to arbitrary number of child units and assigns a single forget gate for each child unit. This provides Tree-LSTM the flexibility to incorporate or drop the information from each child unit.
Given a syntactic tree, the Tree-LSTM transformation is defined on node n k and its children set C(k), which can be formulated as follows (Tai et al., 2015) where j ∈ C(k), h j is the hidden state of the j-th child node, c k is the memory cell of the head node k, and h k is the hidden state of node k. Note that in Eq.(2), a single forget gate f k,j g is computed for each hidden state h j .
However, the primitive form of Tree-LSTM does not take the dependency relations into consideration. Given the importance of dependency relations in SRL task, we further extend the Tree-LSTM by adding an additional gate r g and reformulate the Eq. (1), where b L(k,j) is a relation label specific bias term. After the Tree-LSTM transformation, the hidden state of each node in dependency tree is taken as its syntactic representation, i.e., v k = h k .

Experiments
We evaluate our models performance of syntactic GCN (henceforth Syn-GCN), SA-LSTM and Tree-LSTM on CoNLL-2009 datasets both for English and Chinese with standard training, development and test splits. For predicate disambiguation, we follow previous work , using the off-the-shelf disambiguator from Roth and Lapata (2016). For syntactic dependency tree, we parse the corpus with Biaffine Parser (Dozat and Manning, 2017).

Experimental Settings
In our experiments, the pre-trained word embeddings for English are 100-dimensional GloVe vectors (Pennington et al., 2014). For Chinese, we exploit Wikipedia documents to train the same dimensional Word2Vec embeddings (Mikolov et al., 2013). All other vectors are randomly initialized, the dimension of lemma embeddings is 100, and the dimension of POS tag embedding is 32. In addition, we use 300-dimensional ELMo embedding for English 2 .
During training, we use the categorical crossentropy as objective, with Adam optimizer (Kingma and Ba, 2015) the learning rate 0.001, and the batch size is set to 64. The BiLSTM encoder consists of 4-layer BiLSTM with 512dimensional hidden units. We apply dropout for BiLSTM with a 90% keeping probability between time-steps and layers. We train models for a maximum of 20 epochs and obtain the nearly best model based on English development results.

Results
We compare our models of Syn-GCN, SA-LSTM and Tree-LSTM with previous approaches for dependency SRL on both English and Chinese. Noteworthily, our model is local (argument identification and classification decisions are conditionally independent) and single without reranking, which neither includes global inference nor com-   Tables 1 and 2, respectively. For English, our models of Syn-GCN, SA-LSTM and Tree-LSTM overwhelmingly surpass most previously published single models, achieving state-of-the-art results of 89.8%, 89.7% and 89.4% in F 1 scores respectively. In comparison to ensemble models, our Syn-GCN even performs better than the previous model ) with a margin of 0.7% F 1 .
From Table 1, we also see that our Syn-GCN model provides the best recall and F 1 score, while our SA-LSTM model yields the competitive performance with higher precision at the expense of recall, which shows that SA-LSTM is better at classifying arguments. Overall, the Tree-LSTM gives slightly weaker performance, which may be attributed to tree-structured network topology. More specifically, Tree-LSTM only considers information from arbitrary child units so that each node lacks of the information from parent. However, our Syn-GCN and SA-LSTM combine bidirectional information, both head-to-dependent and dependent-to-head.
For Chinese (Table 2), even though we use the same parameters as for English, our models are still comparable with the best reported results. Table 3 presents the results on English out-ofdomain test set. Our models outperform the highest records achieved by , with absolute improvements of 0.2-0.5% in F 1 scores. These favorable results on both in-domain and outof-domain data demonstrate the effectiveness and robustness of our proposed unified framework.

Ablation and Analysis
To investigate the contributions of word representation and deep encoder in our method, we conduct a series of ablation studies on the English development set, unless otherwise stated.
Effect of word representation In order to better understand how the enhanced word representation influences our model performance, we train our Syn-GCN model with different settings in input word embeddings. Table 4 shows results for our system when we remove POS tag and ELMo embedding respectively. Interestingly, the impact of POS tag embedding (about 0.4% F 1 ) is less compared to the previous works, which allows us to build an accuracy model even when the POS tag is unavailable. We also observe that effect of ELMo embedding is somewhat surprising (1.2% F 1 performance degradation). Experimental results indicate that a combination of these features could enhance the word representation, leading to SRL performance improvement.   ,  and  on the English test set. ∆ F 1 shows the absolute performance gap between syntax-agnostic and syntax-aware settings. Table 5 reports F 1 scores of our Syn-GCN model, ,  and  on English test set in both syntax-agnostic and syntax-aware settings. The comparison shows that our framework is more effective for incorporating syntactic information by giving more performance improvement through introducing syntax over syntax-agnostic SRL than previous state-ofthe-art systems did.

Effect of deep encoder
To further investigate the impact of deep encoder, we perform our Syn-GCN, SA-LSTM and Tree-LSTM models with another alternative configuration, using the same encoder as (Marcheggiani and Titov, 2017) (M&T encoder for short), which removes the residual connections from our framework. The corresponding results of our models are also summarized in Table 6 for comparison. Note that the first row is the results of our syntax-agnostic model. Surprisingly, we observe a dramatical performance decline of 1.2% F 1 for our Syn-GCN model with M&T encoder. A less significant performance loss for our SA-LSTM (−0.4%) and Tree-LSTM (−0.5%) models shows that the Syn-GCN is more sensitive to contextual information. Nevertheless, the overall results show that applying deep encoder could receive higher gains.

Syntactic Role
As mentioned before, syntactic parsers are unreliable due to the risk of erroneous syntactic input, especially on out-of-domain data. This section thus attempts to explore the impact of different quality of syntactic input on SRL performance. To this end, we further carry out experiments on English test data with different syntactic inputs based on our Syn-GCN model.  Table 6: Comparison of models with deep encoder and M&T encoder  on the English test set.

Syntactic Input
Four types of syntactic inputs are used to explore the role of syntax in our unified framework, (1) the automatically predicted parse provided by CoNLL-2009 shared task, (2) the parsing results of the CoNLL-2009 data by state-of-theart syntactic parser, the Biaffine Parser (used in our previous experiments), (3) corresponding results from another parser, the BIST Parser (Kiperwasser and Goldberg, 2016), which is also adopted by , (4) the gold syntax available from the official data set.

Evaluation Metric
It is worth noting that for SRL task, the standard evaluation metric is the semantic labeled F 1 score (Sem-F 1 ), and we use the labeled attachment score (LAS) to quantify the quality of syntactic input. In addition, the ratio between labeled F 1 score for semantic dependencies and the LAS for syntactic dependencies (Sem-F 1 /LAS) proposed by CoNLL-2008 shared task 3 (Surdeanu et al., 2008), are also given for reference. To a certain extent, the ratio Sem-F 1 /LAS could normalize the semantic score relative to syntactic parse, impartially estimating the true performance of SRL, independent of the performance of the input syntactic parser.  Table 7: Results on English test set, in terms of labeled attachment score for syntactic dependencies (LAS), semantic precision (P), semantic recall (R), semantic labeled F 1 score (Sem-F 1 ), the ratio Sem-F 1 /LAS. All numbers are in percent. A superscript * indicates LAS results from our personal communication with the authors.

Comparison and Discussion
taining overall higher scores compared to previous state-of-the-arts. Second, It is interesting to note that the Sem-F 1 /LAS score of our model becomes relatively smaller as the syntactic input becomes better. Though not so surprised, these results show that our SRL component is even relatively stronger. Third, when we adopt a syntactic parser with higher parsing accuracy, our SRL system will achieve a better performance. Notably, our model yields a Sem-F 1 of 90.5% taking gold syntax as input. It suggests that high-quality syntactic parse may indeed enhance SRL, which is consistent with the conclusion in (He et al., 2017).

Related Work
Semantic role labeling was pioneered by Gildea and Jurafsky (2002), also known as shallow semantic parsing. In early works of SRL, considerable attention has been paid to feature engineering (Pradhan et al., 2005;Zhao and Kit, 2008;Zhao et al., 2009a,b,c;Li et al., 2009;Björkelund et al., 2009;Zhao et al., 2013). Along with the the impressive success of deep neural networks Cai and Zhao, 2016;Wang et al., 2016b,a;, a series of neural SRL systems have been proposed. For instance, Foland and Martin (2015) presented a semantic role labeler using convolutional and time-domain neural networks. FitzGerald et al. (2015) exploited neural network to jointly embed arguments and semantic roles, akin to the work (Lei et al., 2015), which induced a compact feature representation applying tensor-based approach.
Recently, people have attempted to build endto-end systems for span SRL without syntactic input (Zhou and Xu, 2015;He et al., 2017;Tan et al., 2018). Similarly,  also proposed a syntax-agnostic model for dependency SRL and obtained favorable results. Despite the success of syntax-agnostic models, there are several works focus on leveraging the advantages of syntax. Roth and Lapata (2016) employed dependency path embedding to model syntactic information and exhibited a notable success. Marcheggiani and Titov (2017) leveraged the graph convolutional network to incorporate syntax into a neural SRL model. Qian et al. (2017) proposed SA-LSTM to model the whole tree structure of dependency relation in an architecture engineering way.
Besides, syntax encoding has also successfully promoted other NLP tasks. Tree-LSTM (Tai et al., 2015) is a variant of the standard LSTM that can encode a dependency tree with arbitrary branching factors, which has shown effectiveness on semantic relatedness and the sentiment classification tasks. In this work, we extend the Tree-LSTM with a relation specific gate and employ it to recursively encode the syntactic dependency tree for SRL. RCNN (Zhu et al., 2015) is an extension of the recursive neural network (Socher et al., 2010) which has been popularly used to encode trees with fixed branching factors. The RCNN is able to encode a tree structure with arbitrary number of factors and is useful in a re-ranking model for dependency parsing (Zhu et al., 2015).
In our experiments, we simplify and reformulate the RCNN model. However, the simplified model performs poorly on the development and the test sets. The reason might be that the RCNN model with a single global composition parameter is too simple to cover all types of syntactic relation in a dependency tree. Because of the poor performance of the modified RCNN, we do not include it in this work. Considering there might be other approach to incorporate the recursive network in SRL model, we leave it as our future work and just provide a brief discussion here.
In this work, we extend existing methods and introduce Tree-LSTM for incorporating syntax into SRL. Rather than proposing completely new model, we synthesize these techniques and present a unified framework to take genuine superiority of syntactic information.

Conclusion
This paper presents a unified neural framework for dependency-based SRL, effectively incorporating syntactic information by directly modeling syntax based on syntactic parse tree. Rather than proposing completely new model, we extend existing models and apply tree-structured LSTM for SRL. Our approach significantly outperforms all previous models, achieving state-of-the-art results on the CoNLL-2009 benchmarks for both English and Chinese.
Our experiments specially show that giving an enlarged performance gap from syntax-agnostic to syntax-aware setting, SRL can be further promoted with the help of deep enhanced representation and effective methods of integrating syntax. Furthermore, we explore the impact of the quality of syntactic input. The relevant results indicate that high-quality syntactic parse is more favorable to semantic role labeling.