Semantic Role Labeling with Heterogeneous Syntactic Knowledge

Recently, due to the interplay between syntax and semantics, incorporating syntactic knowledge into neural semantic role labeling (SRL) has achieved much attention. Most of the previous syntax-aware SRL works focus on explicitly modeling homogeneous syntactic knowledge over tree outputs. In this work, we propose to encode heterogeneous syntactic knowledge for SRL from both explicit and implicit representations. First, we introduce graph convolutional networks to explicitly encode multiple heterogeneous dependency parse trees. Second, we extract the implicit syntactic representations from syntactic parser trained with heterogeneous treebanks. Finally, we inject the two types of heterogeneous syntax-aware representations into the base SRL model as extra inputs. We conduct experiments on two widely-used benchmark datasets, i.e., Chinese Proposition Bank 1.0 and English CoNLL-2005 dataset. Experimental results show that incorporating heterogeneous syntactic knowledge brings significant improvements over strong baselines. We further conduct detailed analysis to gain insights on the usefulness of heterogeneous (vs. homogeneous) syntactic knowledge and the effectiveness of our proposed approaches for modeling such knowledge.


Introduction
Semantic role labeling (SRL) is a fundamental task in natural language processing (NLP), which aims to find the predicate argument structures (Who did what to whom, when and where, etc.) in a sentence (see Figure 1 as an example). Recent SRL works can mostly be divided into two categories, i.e., syntax-aware (Roth and Lapata, 2016;He et al., 2018b;Strubell et al., 2018) and syntax-agnostic (He et al., 2017;He et al., 2018a) approaches according to whether incorporating syntactic knowledge or not.
Most syntax-agnostic works employ deep BiLSTM or self-attention encoder to encode the contextual information of natural sentences, with various kinds of scorers to predict the probabilities of BIO-based semantic roles (He et al., 2017;Tan et al., 2018) or predicate-argument-role tuples (He et al., 2018a;. Motivated by the strong interplay between syntax and semantics, researchers explore various approaches to integrate syntactic knowledge into syntax-agnostic models. Roth and Lapata (2016) propose to use dependency-based embeddings in a neural SRL model for dependency-based SRL. He et al. (2018b) introduce k-order pruning algorithm to prune arguments according to dependency trees. However, previous syntax-aware works mainly employ singleton/homogeneous automatic dependency trees, which are generated by a syntactic parser trained on a specific syntactic treebank, like Penn Treebank (PTB) (Marcus et al., 1994).
Our work follows the syntax-aware approach and enhances SRL with heterogeneous syntactic knowledge. We define heterogeneous syntactic treebanks as treebanks that follow different annotation guidelines. All is well known, there exist many published dependency treebanks that follow different annotation guidelines, i.e., English PTB (Marcus et al., 1994), Universal Dependencies (UD) (Silveira et al., 2014), Penn Chinese Treebank (PCTB) (Xue et al., 2005), Chinese Dependency Treebank (CDT) (Che et  al., 2012) and so on. These dependency treebanks contain high-quality dependency trees and provide rich syntactic knowledge. Due to different construction purposes, these treebanks have different annotation emphases and data domains. For example, Xue et al. (2005) mainly follow the annotation guideline of PTB to annotate the PCTB treebank on the news data, while Che et al. (2012) use a different annotation guideline with fewer syntactic label on the news and story data. Figure 2 shows an example of automatic heterogeneous trees, where several dependencies are different in the two trees. The word "traveling" is the conjunction modifier of "to" in the PCTB tree, while it is the attribute modifier of "tourists" in the CDT tree. We think both dependencies are grammatically reasonable. Thus, we believe that such heterogeneous syntactic treebanks provide more valid information than each homogeneous treebank.
In this work, we propose two types of methods from the perspective of explicit and implicit to take advantage of heterogeneous syntactic knowledge, which we believe are highly complementary. Our baseline model follows the architecture of He et al. (2018a). Afterwards, we inject the heterogeneous syntactic knowledge into the base model using two proposed methods. For the explicit method, we try to encode the heterogeneous automatic dependency trees with the recent popular graph convolutional networks (GCN) (Kipf and Welling, 2016). For the implicit method, which is inspired by the powerful representations from pre-trained language models, like ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), we introduce a method to extract implicit syntactic representations from the dependency parser trained with heterogeneous syntactic treebanks. It is well known that the main reason for the success of pre-trained language model representations is the use of large amounts of natural text. However, it is difficult to obtain and costly to annotate large amounts of syntactic data. Therefore, making full use of existing heterogeneous data is the most feasible and natural idea. Intuitively, the explicit method models the syntactic structure of a sentence, providing valuable syntactic position information, while the implicit method aims to capture the syntactic representation for each word into a vector. These two methods contain different types of syntactic knowledge, which are thus highly complementary.
To verify the effectiveness of injecting heterogeneous syntactic knowledge, we conduct experiments on the widely-used Chinese and English SRL benchmarks, i.e., Chinese Proposition Bank 1.0 and English CoNLL-2005. Our contributions are listed as follows: • To our best knowledge, we are the first to utilize heterogeneous syntactic knowledge to help neural semantic role labeling.
• We introduce two kinds of methods that effectively encode the heterogeneous syntactic knowledge for SRL, and achieve significant improvements over the strong baselines.  Figure 3: Overview of our model. The middle component is our basic SRL model, the left is the ExpHDP module, and the right is the ImpHDP component. Our model concatenates the ExpHDP representation and ImpHDP representation with the basic SRL input, as shown by the red and blue dashed lines. For clarity, we only show the ExpHDP and ImpHDP representation flow of the word "guitar".
• Detailed analyses clearly show that integrating heterogeneous syntactic knowledge outperforms homogeneous syntactic knowledge and also demonstrate the effectiveness of our methods for encoding heterogeneous syntactic knowledge.

Base Model
Following He et al. (2018a), we treat SRL as a predicate-argument-role tuple identification task in our work. Formally, given a sentence S = w 1 , w 2 , ..., w n , we denote the candidate predicates as P = {w 1 , w 2 , ..., w n }, the candidate arguments as A = {(w i , ..., w j )|1 ≤ i ≤ j ≤ n}, and the semantic roles as R. The goal is to predict a set of predicate-argument-role tuples Y ∈ P × A × R.
We basically use the framework of He et al. (2018a) as our baseline model. In general, the model consists of four modules, i.e., input layer, BiLSTMs encoder layer, predicate and argument representation layer, and MLP scorers layer. The middle component of Figure 3 shows the architecture of the baseline model. In the following, we briefly introduce the framework of the baseline model.
Input layer. The input of the i-th word in S is composed of fixed word embedding and fine-tuned char representation. Formally, x i = emb word w i ⊕ rep char w i , where ⊕ is the concatenate operation. The char representation rep char w i is generated by CNNs on the characters of the i-th word. BiLSTMs encoder layer. The baseline model employs three layer BiLSTMs as the encoder layer, which is enhanced by the highway connections. We denote the i-th output of BiLSTMs as h i .
Predicate and argument representation layer. The model directly treats the output hidden states from the top BiLSTM layer as the representations of candidate predicates. The representation of the k-th word as the candidate predicate is denoted as r p k = h k . The representation of candidate argument is composed of four parts: 1) the BiLSTM output of the beginning word in the argument, 2) the BiLSTM output of the end word in the argument, 3) an embedding indicating the length of the argument, and 4) a softmax weighted summation of the BiLSTM hidden outputs in the range of candidate argument, where the softmax weights are computed by attention mechanism over words in the argument span. Formally, for a candidate argument from i-th word to j-th word, its representation is defined as r , where e i,j is the softmax weighted summation. We use p and a as the abbreviation of predicate and argument.
MLP scorers layer. Three MLP scorers are employed to compute the scores of the candidate predicates, arguments, and the semantic roles between the predicted predicates and arguments, respectively.
The objective is to find the highest-scoring semantic structure, which is computed as: where θ represents the model parameters, and s(p, a, r) = s p (p) + s a (a) + s r (p, a) is the score of the candidate predicate-argument-role tuple.

Syntactic Knowledge
In this section, we introduce the proposed methods for extracting syntactic knowledge from heterogeneous dependency treebanks. First, we introduce the method for encoding singleton dependency trees and then detailedly describe the variations to extract heterogeneous syntactic knowledge.

Syntactic Representation
We employ two different methods to fully encode the homogeneous syntactic trees, i.e., GCN that encodes the syntactic structures and implicit representations that encode the syntactic features.
Explicit Method. Graph convolutional networks (GCN) are neural networks that work on graph structures, which have been explored in many NLP tasks (Guo et al., 2019;Zhang et al., 2020). Formally, we denote an undirected graph as G = (V, E), where V and E are the set of nodes and edges, respectively. The GCN computation flow of node v ∈ V at l-th layer is defined as where W l ∈ R m×m is the weight matrix, b l ∈ R m is the bias term, N (v) is the set of all one-hop neighbour nodes of v, and ρ is an activation function. Especially, h 0 u ∈ R m is the initial input representation, and m is the representation dimension. In our work, we employ a 1-layer BiLSTM encoder over the input layer 1 , and treat the BiLSTM outputs as the input of the GCN module, as depicted in the left component in Figure 3. We enhance the basic GCN module with the dense connections (Huang et al., 2017;Guo et al., 2019). The key idea is that the node v of the l-th layer takes input from the concatenation of h l−1 u and all the representations from previous layers. Formally, the input representation in the l-th layer is defined as r l u : Then, the GCN computation at l-th layer would be modified as: where the weight matrix W l increases its column dimension by d hidden per layer, i.e., Recently popular pre-trained language model embeddings (such as ELMo and BERT) have received much attention. These language models are trained on large amounts of natural text and can produce powerful implicit representations, whose effectiveness is shown in many NLP tasks. Inspired by these pre-trained language model representations and previous works on syntactic representations (Yu et al., 2018;Xia et al., 2019a), we make a trial to train a syntactic parser and extract similar implicit syntactic representations for SRL. We choose the state-of-the-art BiAffine parser (Dozat and Manning, 2017) as our basic dependency parser module. Concisely, BiAffine parser consists of an input layer, BiLSTMs encoder layer, and BiAffine scorers layer, as shown by the right component of Figure 3. We extract the hidden outputs from the 3-layer BiLSTMs encoder of the dependency parser module and make a softmax weighted summation on the outputs as the implicit syntactic representations.

Heterogeous Dependency Parsing (HDP)
ExpHDP. In order to encode the automatic heterogeneous syntactic trees, we extend the basic GCN module into the heterogeneous scenarios with two techniques. First, we propose to use the prior probability of two neighbouring nodes as the weight, namely probability-based GCN (P-GCN). Specifically, the probability between node v and u in an automatic tree is the softmax score of node u as v's parent node. Our preliminary experiment shows that this modification would yield a slight improvement of +0.2 F1 score on the CPB1.0 test data. Second, we propose to make a summation of the prior probabilities between each node pair in heterogeneous syntactic trees of a sentence, which we call explicit heterogeneous dependency parsing method (ExpHDP). Intuitively, this approach can combine the different syntactic structures together and enhance the general node pairs, such as verb-subject, verb-object, etc, which would make the graph denser. Finally, the ExpHDP computation of node v in the l-th layer becomes: where HT s is the set of automatic heterogeneous syntactic trees, and P uv represent the prior probability between node u and v. We treat the outputs of ExpHDP as the explicit representations and concatenate with the input representations of the SRL module to enhance the basic SRL model, as demonstrated by the left two components in Figure 3. ImpHDP. We can not directly train a dependency parser on heterogeneous syntactic treebanks because of the different annotation guidelines. To solve this problem, we adapt the vanilla BiAffine parser into heterogeneous dependency parser by adding more BiAffine scorer modules according to the number of heterogeneous treebanks, as shown by the rightmost part of Figure 3. Thus, the shared input and BiLSTMs layer can learn more knowledge by training with heterogeneous dependency treebanks. Afterwards, we extract the hidden outputs from the 3-layer BiLSTMs encoder of the dependency parser module, and make a softmax weighted summation on the outputs as the implicit syntactic representations, as depicted by orange dashed lines in the ImpHDP module in Figure 3, which we call ImpHDP (Implicit Heterogeneous Dependency Parsing) method. Formally, the implicit syntactic representation of the i-th word is formulated as h isr i = N j=1 α j h dep j,i , where N is the number of BiLSTMs encoder of the dependency parser, α is the softmax weight, and h dep j,i is the i-th output hidden states of the j-th BiLSTM layer. Considering the relatively small data size of syntactic data, the representation ability of ImpHDP may be not that strong. So we make multi-task learning between SRL and dependency parsing, the work flow is shown by the right two parts of Figure 3. Please note that the losses of SRL and dependency parsing are not accumulated, so our model back-propagates and updates the gradients once a batch of SRL data or dependency data completes the forward process.

Hybrid HDP
Our model combines the two representations together, according to our intuition that explicit and implicit syntactic representations are highly complementary, which is denoted as "HybridHDP" (Hybrid Heterogeneous Dependency Parsing) in later sections. In detail, we concatenate the two heterogeneous syntactic representations with the SRL input, formulated as

Experimental Setup
We conduct experiments on the commonly used Chinese Proposition Bank 1.0 (CPB1.0) (Xue, 2008) and English CoNLL-2005 (Carreras andMàrquez, 2005) benchmarks. We implement our methods and baseline model with Pytorch, and our code, configurations, and models are released in https: //github.com/KiroSummer/HDP-SRL. Heterogeneous Dependency Treebanks. We employ PCTB7 and CDT as the heterogeneous dependency treebanks for Chinese, PTB and UD 2 dependency treebanks for English. We employ BiAffine parser (Dozat and Manning, 2017) to train dependency parsers to generate automatic dependency trees for ExpHDP, which achieve 89.12% and 89.72% UAS on the development data of the Chinese PCTB7 and CDT datasets, 95.73% and 90.14% UAS on the development data of the English PTB and UD datasets, respectively. Besides, we use 5-way jackknifing to obtain automatic dependency trees of the training data of CPB1.0 and CoNLL-2005 from parsers trained with PCTB7 and PTB, respectively. RoBERTa Representations. We employ pre-trained language model embeddings from RoBERTa (Liu et al., 2019) to boost the performance of both Chinese 3 and English 4 SRL. In detail, we average the fixed subword-level RoBERTa representations as the word-level representations from the last four encoder layers. Then, we employ a softmax weighted operation to sum the four word-level real vectors as the final RoBERTa representations for each word in a sentence. Please note that we only use the RoBERTa representations in the SRL module.
Hyperparameters and Training Criterion. We employ word2vec (Mikolov et al., 2013) to train the Chinese word embeddings with the Chinese Gigaword corpus. The English word embeddings are 300dimension GloVe vectors (Pennington et al., 2014). We choose Adam (Kingma and Ba, 2015) optimizer with 0.001 as the initial learning rate and decays 0.1% for every 100 steps. Gradients bigger than 1.0 are clipped. All the models are trained for at most 180k steps, and early stop when no further improvement over 30 epochs. We pick the best model according to the performance of the development data to evaluate the test data.
Evaluation. We adopt the official evaluation scripts from CoNLL-2005 5 to evaluate our system outputs. Significant tests are conducted using Dan Bikel's randomized parsing evaluation comparer.

Results and Analyses on CPB1.0
Results and comparison with previous works on CPB1.0 are shown in Table 1. First, our model, i.e., "HybridHDP" brings absolute improvements of +3.26 and +3.39 F1 scores on the development and test data over our baseline model, and outperforms Xia et al. (2019a)    Ablation study on heterogeneous treebanks. To clearly show the performance gains from singleton syntactic knowledge and heterogeneous syntactic knowledge, we report the results of the two methods when only using single syntactic trees in Table 2 and Table 3. For the ExpHDP method, we can see that using single automatic syntactic trees from Parser PCBT and Parser CDT can both achieve substantial improvements. Combining the two automatic dependency trees with the proposed ExpHDP reaches a higher performance, which demonstrates the effectiveness of our proposed ExpHDP to explicitly encode heterogeneous syntactic knowledge. The results of Table 3 also verify that encoding implicit syntactic knowledge from heterogeneous syntactic treebanks is better than only using singleton treebank.
Ablation study on syntactic representations. Table 4 gives the results of models with ablation on the two key components of our method, which clearly shows the contribution of each module in our model. The model without ImpHDP, which is "baseline + ExpHDP", drops from 85.25 F1 to 84.24 F1 (-1.01 F1) and 84.65 F1 to 83.41 F1 (-1.24 F1) on the development and test data, respectively. And the model without ExpHDP, i.e., "baseline + ImpHDP", drops from 85.25 F1 to 84.19 F1 (-1.06 F1) and 84.65 F1 to 84.12 F1 (-0.53 F1) on the development and test data, respectively. From the results, we can conclude that 1) both the two methods can effectively encode syntactic knowledge and 2) simultaneously integrating the explicit heterogeneous syntactic knowledge and implicit heterogeneous syntactic knowledge performs better than using any single syntactic information, which proves our intuition that the explicit and implicit methods are highly complementary.
Error breakdrown. In order to understand where syntactic knowledge helps in SRL, we follow the work of He et al. (2017) and Table 4 shows the results after fixing various kinds of prediction errors on the model outputs incrementally. In simple terms, the smoother the curves in the table, the fewer errors the model makes at each mistake item. "Orig." represents the F1 scores of the original model outputs. From Figure 4, we can see that our two syntax-aware components both effectively outperform our baseline model, especially on the span mistakes, as shown by "Merge Spans" and "Fix Span Boundary" errors, demonstrating that syntactic knowledge can effectively help the determination of argument boundaries. To better understand the improvements on the span prediction performance, we report the Binary, Proportional, and Exact F1 scores on the spans, where Binary treats an argument as correct if it   Figure 5: Results of our methods on span performance.

AM-ADV V A0
Gold V

AM-TMP AM-ADV A0
Baseline

AM-ADV V A0
HybridHDP V

AM-TMP A0
+ExpHDP AM-ADV A0 +ImpHDP V Throughout the year reach 7.9 million tourists traveling Hainan to number Figure 6: A case study of using heterogeneous syntactic knowledge with the ExpHDP and ImpHDP methods, where the blue block means the gold predicate or argument and red block means the wrongly predicted argument.
overlaps with a gold-standard argument and the Proportional measures the overlapped region between a predicted argument and a gold-standard argument. Table 5 shows the results and we can see that introducing syntactic knowledge can consistently help the determination on the spans of SRL arguments. Case Study. To better understand the usefulness of the heterogeneous syntactic knowledge, we give a case study of using the ExpHDP and ImpHDP method in Figure 6. We observe that the baseline model can not correctly predict the "AM-ADV" argument and wrongly treat "Hainan" as an "AM-ADV" argument. With the help of ExpHDP, our model successfully excludes the wrongly predicted "AM-ADV" of "Hainan". We think this is because our model learnt the information that "Hainan" and "traveling" are not syntactic-relevant in the trees, especially in the CDT tree, as shown in Figure 2. However, only using the explicit syntactic knowledge can not fix the error of "AM-TMP". Finally, adding the implicit syntactic representation successfully predicts the correct predicate-argument structures.    we report the results of our methods on the development data when only using singleton dependency trees in Table 6. We can clearly see that compared with the corresponding improvements on the CPB1.0 dataset, the performance gains on the CoNLL-2005 is relatively small, especially when using the UD dependency treebanks. We think this is due to the intention of the construction of UD treebank, which is designed for cross-lingual studies and aims to capture similarities among different languages, thus may be relatively weak in morphosyntax.

Results and Analyses on English SRL
Limits of syntactic knowledge. Another possible reason for the relatively lower improvements is the larger number of training samples of the CoNLL-2005 dataset, which can strengthen the basic SRL model and thus weaken the effect of syntactic information. Table 7 shows the results of models with the different number of SRL training samples on the CoNLL-2005 test WSJ data. This indicates that syntactic knowledge works better on those tasks that contain relatively fewer training samples. We also find that the syntactic knowledge nearly brings no performance gains on CoNLL-2005 when the model uses RoBERTa representations. We think it is understandable because the RoBERTa model is trained using very large scale text data and advanced training techniques. So, with the pre-trained language models become more and more powerful, is syntactic knowledge useless anymore for other tasks? It is apparently no because integrating syntactic knowledge into pre-trained language models has attracted some attention (Wang et al., 2020). And unitizing heterogeneous syntactic knowledge would be a direct and natural idea, which we leave for future work.

Related Work
Recently, SRL has achieved significant improvements because of the development of deep learning. Previous works can mostly be divided into two kinds of methods, syntax-agnostic methods, which focus on the SRL problem itself, and syntax-aware methods, which explore various ways to integrate syntactic knowledge into the SRL models. Zhou and Xu (2015) propose to use deep BiLSTMs for English spanbased SRL. He et al. (2017) further employ several deep learning advanced practices into the stacked BiL-STMs. With the rise of Transformer in machine translation, Tan et al. (2018) employ deep self-attention encoder for SRL, achieving strong performance. Marcheggiani et al. (2017) propose a simple and fast model with rich input representations. Cai et al. (2018) present a full end-to-end model which composed of BiLSTM encoder and BiAffine scorer. He et al. (2018a) first treat SRL as a predicate-argument-role tuple identification task. Following the trend,  extend this framework for both span-based and dependency-based SRL, with constraining the argument length to be 1 for dependency-based SRL.
Syntactic knowledge has been explored in various ways to promote the performance of SRL models. Roth and Lapata (2016) propose to integrate the dependency path embeddings into the basic SRL model for dependency-based SRL. Strubell et al. (2018) propose a multi-task learning framework based on the self-attention encoder, which treats dependency parsing as an auxiliary task.  propose an argument pruning method based on dependency tree positions for multilingual SRL. Recently, Xia et al. (2019a) propose a similar framework to extract syntactic representation for SRL, but they only focus on Chinese SRL. Xia et al. (2019b) compares four explicit methods to encode automatic dependency trees for SRL.  present different methods to encode dependency trees and compare various incorporation ways into a self-attention based SRL model. These previous works mainly focus on encoding single-sourced dependency treebank, which can only provide limited syntactic knowledge. Our work focus on exploiting heterogeneous dependency benchmarks and the results verify our intuition that heterogeneous syntactic knowledge can provide more valid information.

Conclusion
We propose to encode heterogeneous syntactic knowledge with explicit and implicit methods to help SRL. For the explicit aspect, we propose ExpHDP to encode the heterogeneous automatic dependency trees, which can provide more information compared with singleton automatic dependency trees. For the implicit aspect, we extract implicit syntactic representations from the dependency parser trained with heterogeneous dependency treebanks. Experimental results and detailed analyses demonstrate that our proposed methods effectively capture heterogeneous syntactic knowledge, and thus achieve more improvements compared with using singleton dependency trees. We also discuss the limitations of syntactic knowledge, for which we will explore ways to integrate heterogeneous syntactic knowledge into pre-trained language models in the future.