Syntax for Semantic Role Labeling, To Be, Or Not To Be

Semantic role labeling (SRL) is dedicated to recognizing the predicate-argument structure of a sentence. Previous studies have shown syntactic information has a remarkable contribution to SRL performance. However, such perception was challenged by a few recent neural SRL models which give impressive performance without a syntactic backbone. This paper intends to quantify the importance of syntactic information to dependency SRL in deep learning framework. We propose an enhanced argument labeling model companying with an extended korder argument pruning algorithm for effectively exploiting syntactic information. Our model achieves state-of-the-art results on the CoNLL-2008, 2009 benchmarks for both English and Chinese, showing the quantitative significance of syntax to neural SRL together with a thorough empirical survey over existing models.


Introduction
Semantic role labeling (SRL), namely semantic parsing, is a shallow semantic parsing task, which aims to recognize the predicate-argument structure of each predicate in a sentence, such as who did what to whom, where and when, etc. Specifically, we seek to identify arguments and label their semantic roles given a predicate. SRL is an impor-tant method to obtain semantic information beneficial to a wide range of natural language processing (NLP) tasks, including machine translation (Shi et al., 2016), question answering (Berant et al., 2013;Yih et al., 2016) and discourse relation sense classification (Mihaylov and Frank, 2016).
There are two formulizations for semantic predicate-argument structures, one is based on constituents (i.e., phrase or span), the other is based on dependencies. The latter proposed by the CoNLL-2008 shared task (Surdeanu et al., 2008) is also called semantic dependency parsing, which annotates the heads of arguments rather than phrasal arguments. Generally, SRL is decomposed into multi-step classification subtasks in pipeline systems, consisting of predicate identification and disambiguation, argument identification and classification.
In prior work of SRL, considerable attention has been paid to feature engineering that struggles to capture sufficient discriminative information, while neural network models are capable of extracting features automatically. In particular, syntactic information, including syntactic tree feature, has been show extremely beneficial to SRL since a larger scale of empirical verification of Punyakanok et al. (2008). However, all the work had to take the risk of erroneous syntactic input, leading to an unsatisfactory performance.
To alleviate the above issues,  propose a simple but effective model for dependency SRL without syntactic input. It seems that neural SRL does not have to rely on syntactic features, contradicting with the belief that syntax is a necessary prerequisite for SRL as early as Gildea and Palmer (2002). This dramatic contradiction motivates us to make a thorough exploration on syntactic contribution to SRL. This paper will focus on semantic dependency parsing and formulate SRL as one or two se-quence tagging tasks with predicate-specific encoding. With the help of the proposed k-order argument pruning algorithm over syntactic tree, our model obtains state-of-the-art scores on the CoNLL benchmarks for both English and Chinese.
In order to quantitatively evaluate the contribution of syntax to SRL, we adopt the ratio between labeled F 1 score for semantic dependencies (Sem-F 1 ) and the labeled attachment score (LAS) for syntactic dependencies introduced by CoNLL-2008 Shared Task 1 as evaluation metric. Considering that various syntactic parsers contribute different syntactic inputs with various range of quality levels, the ratio provides a fairer comparison between syntactically-driven SRL systems, which will be surveyed by our empirical study.

Model
To fully disclose the predicate-argument structure, typical SRL systems have to step by step perform four subtasks. Since the predicates in CoNLL-2009(Hajič et al., 2009) corpus have been preidentified, we need to tackle three other subtasks, which are formulized into two-step pipeline in this work, predicate disambiguation and argument labeling. Namely, we do the work of argument identification and classification in one model. Argument structure for each known predicate will be disclosed by our argument labeler over a sequence including possible arguments (candidates). There are two ways to determine the sequence, one is to simply input the entire sentence as a syntax-agnostic SRL system does, the other is to select words according to syntactic parse tree around the predicate as most previous SRL systems did. The latter strategy usually works through a syntactic tree based argument pruning algorithm. We will use the proposed k-order argument pruning algorithm (Section 2.1) to get a sequence w = (w 1 , . . . , w n ) for each predicate. Then, we represent each word w i ∈ w as x i (Section 2.2). Eventually, we obtain contextual features with sequence encoder (Section 2.3). The overall role labeling model is depicted in Figure 1.

Argument Pruning
As pointed out by Punyakanok et al. (2008), syntactic information is most relevant in identifying 1 CoNLL-2008 is an English-only task, while CoNLL-2009 extends to a multilingual one. Their main difference is that predicates have been beforehand indicated for the latter.  the arguments, and the most crucial contribution of full parsing is in the pruning stage. In this paper, we propose a k-order argument pruning algorithm inspired by Zhao et al. (2009b). First of all, for node n and its descendant n d in a syntactic dependency tree, we define the order to be the distance between the two nodes, denoted as D(n, n d ). Then we define k-order descendants of given node satisfying D(n, n d ) = k, and k-order traversal that visits each node from the given node to its descendant nodes within k-th order. Note that the definition of k-order traversal is somewhat different from tree traversal in terminology. A brief description of the proposed k-order pruning algorithm is given as follow. Initially, we set a given predicate as the current node in a syntactic dependency tree. Then, collect all its argument candidates by the strategy of k-order traversal. Afterwards, reset the current node to its syntactic head and repeat the previous step till the root of the tree. Finally, collect the root and stop. The k-order argument algorithm is presented in Algorithm 1 in detail. An example of a syntactic dependency tree for sentence She began to trade the art for money is shown in Figure 2.
The main reasons for applying the extended korder argument pruning algorithm are two-fold.
Algorithm 1 k-order argument pruning algorithm Input: A predicate p, the root node r given a syntactic dependency tree T , the order k Output: The set of argument candidates S 1: initialization set p as current node c, c = p 2: for each descendant n i of c in T do goto step 2 12: end if 13: return argument candidates set S First, previous standard pruning algorithm may hurt the argument coverage too much, even though indeed arguments usually tend to surround their predicate in a close distance. As a sequence tagging model has been applied, it can effectively handle the imbalanced distribution between arguments and non-arguments, which is hardly tackled by early argument classification models that commonly adopt the standard pruning algorithm. Second, the extended pruning algorithm provides a better trade-off between computational cost and performance by carefully tuning k.

Word Representation
We produce a predicate-specific word representation x i for each word w i , where i stands for the word position in an input sequence, following . However, we differ by (1) leveraging a predicate-specific indicator embedding, (2) using deeper refined representation, including character and dependency relation embeddings, and (3) applying recent advances in RNNs, such as highway connections (Srivastava et al., 2015).
In this work, word representation x i is the concatenation of four types of features: predicatespecific feature, character-level, word-level and linguistic features. Unlike previous work, we leverage a predicate-specific indicator embedding x ie i rather than directly using a binary flag either 0 or 1. At character level, we exploit convolutional neural network (CNN) with bidirectional LSTM (BiLSTM) to learn character embedding x ce i . As shown in Figure 1, the representation calculated by the CNN is fed as input to BiL-STM. At word level, we use a randomly initialized word embedding x re i and a pre-trained word embedding x pe i . For linguistic features, we employ a randomly initialized lemma embedding x le i and a randomly initialized POS tag embedding x pos i . In order to incorporate more syntactic information, we adopt an additional feature, the dependency relation to syntactic head. Likewise, it is a randomly initialized embedding x de i . The resulting word representation is concatenated as

Sequence Encoder
As Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) have shown significant representational effectiveness to NLP tasks, we thus use BiLSTM as the sentence encorder. Given an input sequence x = (x 1 , . . . , x n ), BiLSTM processes the sequence in both forward and backward direction to obtain two separated hidden states, − → h i which handles data from x 1 to x i and ← − h i which tackles data from x n to x i for each word representation. Finally, we get a contextual representation h i ] by concatenating the states of BiLSTM networks.
To get the final predicted semantic roles, we exploit a multi-layer perceptron (MLP) with highway connections on the top of BiLSTM networks, which takes as input the hidden representation h i of all time steps. The MLP network consists of 10 layers with highway connections and we employ ReLU activations for the hidden layers. Finally, we use a softmax layer over the outputs to maximize the likelihood of labels.

Predicate Disambiguation
Although predicates have been identified given a sentence, predicate disambiguation is an indispensable task, which aims to determine the predicate-argument structure for an identified predicate in a particular context. Here, we also use the identical model (BiLSTM composed with MLP) for predicate disambiguation, in which the only difference is that we remove the syntactic dependency relation feature in corresponding word representation (Section 2.2). Exactly, given a predicate p, the resulting word representation is

Experiments
Our model 2 is evaluated on the CoNLL-2009 shared task both for English and Chinese datasets, following the standard training, development and test splits. The hyperparameters in our model were selected based on the development set, and are summarized in Table 1. Note that the parameters of predicate model are the same as these in argument model. All real vectors are randomly initialized, and the pre-trained word embeddings for English are GloVe vectors (Pennington et al., 2014). For Chinese, we exploit Wikipedia documents to train Word2Vec embeddings (Mikolov  et al., 2013). During training procedures, we use the categorical cross-entropy as objective, with Adam optimizer (Kingma and Ba, 2015). We train models for a maximum of 20 epochs and obtain the nearly best model based on development results. For argument labeling, we preprocess corpus with k-order argument pruning algorithm. In addition, we use four CNN layers with singlelayer BiLSTM to induce character representations derived from sentences. For English 3 , to further enhance the representation, we adopt CNN-BiLSTM character embedding structure from Al-lenNLP toolkit (Peters et al., 2018).

Preprocessing
During the pruning of argument candidates, we use the officially predicted syntactic parses provided by CoNLL-2009 shared-task organizers on both English and Chinese. Figure 3 shows changing curves of coverage and reduction following k on the English train set. According to our statistics, the number of non-arguments is ten times more than that of arguments, where the data distribution is fairly unbalanced. However, a proper pruning strategy could alleviate this problem. Accordingly, the first-order pruning reduces more than 50% candidates at the cost of missing 5.5% true ones on average, and the second-order prunes about 40% candidates with nearly 2.0% loss. The coverage of third-order has achieved 99% and it reduces approximately 1/3 corpus size.
It is worth noting that as k is larger than 19,  there will come full coverage on all argument candidates for English training set, which let our high order pruning algorithm degrade into a syntaxagnostic setting. In this work, we use the tenthorder pruning for pursuing the best performance.

Results
Our   with ensemble models, our single model even provides better performance (+0.4% F 1 ) than the system , and significantly surpasses all the rest models. In the syntaxagnostic setting (without pruning and dependency relation embedding), we also reach the new stateof-the-art, achieving a performance gain of 1% F 1 . On the out-of-domain (Brown) test set, we achieve the new best results of 79.3% (syntaxaware) and 78.8% (syntax-agnostic) in F 1 scores. Moreover, our syntax-aware model performs better than the syntax-agnostic one. Table 4 presents the results on Chinese test set. Even though we use the same parameters as for English, our model also outperforms the best reported results by 0.3% (syntax-aware) and 0.6% (syntax-agnostic) in F 1 scores.  Table 6: Ablation on development set. The "+" denotes a specific version over the basic model.

Analysis
To evaluate the contributions of key factors in our method, a series of ablation studies are performed on the English development set. In order to demonstrate the effectiveness of our k-order pruning algorithm, we report the SRL performance excluding predicate senses in evaluation, eliminating the performance gain from predicate disambiguation. Table 5 shows the results from our syntax-aware model with lower order argument pruning. Compared to the best previous model, our system still yields an increment in recall by more than 1%, leading to improvements in F 1 score. It demonstrates that refining syntactic parser tree based candidate pruning does help in argument recognition. Table 6 presents the performance of our syntaxagnostic SRL system with a basic configuration, which removes components, including indicator and character embeddings. Note that the first row is the results of BiLSTM (removing MLP from basic model), whose encoding is the same as . Experiments show that both enhanced representations improve over our basic model, and our adopted labeling model is superior to the simple BiLSTM. Figure 4 shows F 1 scores in different k-order pruning together with our syntax-agnostic model. It also indicates that the least first-order pruning fails to give satisfactory performance, the best performing setting coming from a moderate setting of k = 10, and the largest k shows that our argu- ment pruning falls back to syntax-agnostic type. Meanwhile, from the best k setting to the lower order pruning, we receive a much faster performance drop, compared to the higher order pruning until the complete syntax-agnostic case. The proposed k-order pruning algorithm always works even it reaches the syntax-agnostic setting, which empirically explains why the current syntax-aware and syntax-agnostic SRL models hold little performance difference, as maximum k-order pruning actually removes few words just like syntaxagnostic model.

End-to-end SRL
In this work, we consider additional model that integrates predicate disambiguation and argument labeling into one sequence labeling model. In order to implement an end-to-end model, we introduce a virtual root (VR) for predicate disambiguation similar to Zhao et al. (2013) who handled the entire SRL task as word pair classification. Concretely, we add a predicate sense feature to the input sequence by concatenating a VR. The word representation of VR is randomly initialized during training. In Figure 5, we give an example sequence with the labels for the given sentence. We also report results of our end-to-end model on CoNLL-2009 test set with syntax-aware and syntax-agnostic settings. As shown in Table 7, our end-to-end model yields slightly weaker performance compared with our pipeline. A reasonable account for performance degradation is that the training data has completely different genre distributions over predicate senses and argument roles, which may be somewhat confusing for integrative model to make classification decisions.  Figure 5: An example sequence with labels of endto-end model (makes is the given predicate).

CoNLL-2008 SRL Setting
For a full SRL task, the predicate identification subtask is also indispensable, which has been included in CoNLL-2008 shared task. We thus evaluate our model in terms of data and setting of the CoNLL-2008 benchmark (WSJ).
To identify predicates, we train the BiLSTM-MLP sequence labeling model with same parameters in Section 2.4 to tackle the predicate identification and disambiguation subtasks in one shot, and the only difference is that we remove the predicate-specific indicator feature. The F 1 score of our predicate labeling model is 90.53% on indomain (WSJ) data. Compared with the best reported results, we observe absolute improvements in semantic F 1 of 0.8% (in Table 8). Note that as predicate identification is introduced, our same model shows about 6% performance loss for either syntax-agnostic or syntax-aware case, which indicates that predicate identification should be carefully handled, as it is very needed in a complete practical SRL system.

Syntactic Contribution
Syntactic information plays an informative role in semantic role labeling. However, few studies were done to quantitatively evaluate the syntactic contribution to SRL. Furthermore, we observe that most of the above compared neural SRL systems took the syntactic parser of (Björkelund et al., 2010) as syntactic inputs instead of the one from CoNLL-2009 shared task, which adopted a much weaker syntactic parser. Especially , adopted an external syntactic System LAS Sem-F 1 Johansson and Nugues (2008) 90.13 81.75 Zhao and Kit (2008) 87.52 77.67 Zhao et al. (2009b) 88. parser with even higher parsing accuracy. Contrarily, our SRL model is based on the automatically predicted parse with moderate performance provided by CoNLL-2009 shared task, but outperforms their models. This section thus attempts to explore how much syntax contributes to dependency-based SRL in deep learning framework and how to effectively evaluate relative performance of syntax-based SRL. To this end, we conduct experiments for empirical analysis with different syntactic inputs.
Syntactic Input In order to obtain different syntactic inputs, we design a faulty syntactic tree generator (refer to STG hereafter), which is able to produce random errors in the output parse tree like a true parser does. To simplify implementation, we construct a new syntactic tree based on the gold standard parse tree. Given an input error probability distribution estimated from a true parser output, our algorithm presented in Algorithm 2 stochastically modifies the syntactic heads of nodes on the premise of a valid tree. Evaluation Measure For SRL task, the primary evaluation measure is the semantic labeled F 1 score. However, the score is influenced by the quality of syntactic input to some extent, leading to unfaithfully reflecting the competence of syntax-based SRL system. Namely, this is not the outcome of a true and fair quantitative comparison for these types of SRL models. To normalize the semantic score relative to syntactic parse, we take into account additional evaluation measure to estimate the actual overall performance of SRL. Here, we use the ratio between labeled F 1 score for semantic dependencies (Sem-F 1 ) and the labeled attachment score (LAS) for syntactic dependencies System LAS (%) P (%) R (%) Sem-F 1 (%) Sem-F 1 /LAS (%) Zhao et al. (2009c) Table 9: Results on English test set, in terms of labeled attachment score for syntactic dependencies (LAS), semantic precision (P), semantic recall (R), semantic labeled F 1 score (Sem-F 1 ), the ratio Sem- end if 14: end for 15: return the new generative tree N T proposed by Surdeanu et al. (2008) as evaluation metric. 6 The benefits of this measure are twofold: quantitatively evaluating syntactic contribution to SRL and impartially estimating the true performance of SRL, independent of the performance of the input syntactic parser. Table 9 reports the performance of existing models 7 in term of Sem-F 1 /LAS ratio on CoNLL-2009 English test set. Interestingly, even though our system has significantly lower scores than others by 3.8% LAS in syntactic components, we 6 The idea of ratio score in Surdeanu et al. (2008) actually was from author of this paper, Hai Zhao, which has been indicated in the acknowledgement part of Surdeanu et al. (2008). 7 Note that several SRL systems without providing syntactic information are not listed in the table. 1st-order SRL 10th-order SRL GCNs Figure 6: The Sem-F 1 scores of our models with different quality of syntactic inputs vs. GCNs  on test set. obtain the highest results both on Sem-F 1 and the Sem-F 1 /LAS ratio, respectively. These results show that our SRL component is relatively much stronger. Moreover, the ratio comparison in Table  9 also shows that since the CoNLL-2009 shared task, most SRL works actually benefit from the enhanced syntactic component rather than the improved SRL component itself. All post-CoNLL SRL systems, either traditional or neural types, did not exceed the top systems of CoNLL-2009 shared task, (Zhao et al., 2009c) (SRL-only track using the provided predicated syntax) and (Zhao et al., 2009a) (Joint track using self-developed parser). We believe that this work for the first time reports both higher Sem-F 1 and higher Sem-F 1 /LAS ratio since CoNLL-2009 shared task.
We also perform our first and tenth order pruning models with different erroneous syntactic inputs generated from STG and evaluate their per-formance using the Sem-F 1 /LAS ratio. Figure 6 shows Sem-F 1 scores at different quality of syntactic parse inputs on the English test set whose LAS varies from 85% to 100%. Compared to previous state-of-the-arts . Our tenth-order pruning model gives quite stable SRL performance no matter the syntactic input quality varies in a broad range, while our firstorder pruning model yields overall lower results (1-5% F 1 drop), owing to missing too many true arguments. These results show that high-quality syntactic parses may indeed enhance dependency SRL. Furthermore, it indicates that our model with an accurate enough syntactic input as Marcheggiani and Titov (2017), namely, 90% LAS, will give a Sem-F 1 exceeding 90% for the first time in the research timeline of semantic role labeling.

Related Work
Semantic role labeling was pioneered by Gildea and Jurafsky (2002). Most traditional SRL models rely heavily on feature templates (Pradhan et al., 2005;Zhao et al., 2009b;Björkelund et al., 2009). Among them, Pradhan et al. (2005) combined features derived from different syntactic parses based on SVM classifier, while Zhao et al. (2009b) presented an integrative approach for dependency SRL by greedy feature selection algorithm. Later, Collobert et al. (2011) proposed a convolutional neural network model of inducing word embeddings substituting for hand-crafted features, which was a breakthrough for SRL task.
With the impressive success of deep neural networks in various NLP tasks (Zhang et al., 2016;Qin et al., 2017;Cai et al., 2017), a series of neural SRL systems have been proposed. Foland and Martin (2015) presented a dependency semantic role labeler using convolutional and time-domain neural networks, while FitzGerald et al. (2015) exploited neural network to jointly embed arguments and semantic roles, akin to the work (Lei et al., 2015), which induced a compact feature representation applying tensor-based approach. Recently, researchers consider multiple ways to effectively integrate syntax into SRL learning. Roth and Lapata (2016) introduced dependency path embedding to model syntactic information and exhibited a notable success.  leveraged the graph convolutional network to incorporate syntax into neural models. Differently,  proposed a syntax-agnostic model using effective word representation for dependency SRL, which for the first time achieves comparable performance as stateof-the-art syntax-aware SRL models.
However, most neural SRL works seldom pay much attention to the impact of input syntactic parse over the resulting SRL performance. This work is thus more than proposing a high performance SRL model through reviewing the highlights of previous models, and presenting an effective syntactic tree based argument pruning. Our work is also closely related to (Punyakanok et al., 2008;He et al., 2017). Under the traditional methods, Punyakanok et al. (2008) investigated the significance of syntax to SRL system and shown syntactic information most crucial in the pruning stage. He et al. (2017) presented extensive error analysis with deep learning model for span SRL, including discussion of how constituent syntactic parser could be used to improve SRL performance.

Conclusion and Future Work
This paper presents a simple and effective neural model for dependency-based SRL, incorporating syntactic information with the proposed extended k-order pruning algorithm. With a large enough setting of k, our pruning algorithm will result in a syntax-agnostic setting for the argument labeling model, which smoothly unifies syntax-aware and syntax-agnostic SRL in a consistent way. Experimental results show that with the help of deep enhanced representation, our model outperforms the previous state-of-the-art models in both syntaxaware and syntax-agnostic situations.
In addition, we consider the Sem-F1/LAS ratio as a mean of evaluating syntactic contribution to SRL, and true performance of SRL independent of the quality of syntactic parser. Though we again confirm the importance of syntax to SRL with empirical experiments, we are aware that since (Pradhan et al., 2005), the gap between syntax-aware and syntax-agnostic SRL has been greatly reduced, from as high as 10% to only 1-2% performance loss in this work. However, maybe we will never reach a satisfying conclusion, as whenever one proposes a syntax-agnostic SRL system which can outperform all syntax-aware ones at then, always there comes argument that you have never fully explored creative new method to effectively exploit the syntax input.