A Syntax-aware Multi-task Learning Framework for Chinese Semantic Role Labeling

Semantic role labeling (SRL) aims to identify the predicate-argument structure of a sentence. Inspired by the strong correlation between syntax and semantics, previous works pay much attention to improve SRL performance on exploiting syntactic knowledge, achieving significant results. Pipeline methods based on automatic syntactic trees and multi-task learning (MTL) approaches using standard syntactic trees are two common research orientations. In this paper, we adopt a simple unified span-based model for both span-based and word-based Chinese SRL as a strong baseline. Besides, we present a MTL framework that includes the basic SRL module and a dependency parser module. Different from the commonly used hard parameter sharing strategy in MTL, the main idea is to extract implicit syntactic representations from the dependency parser as external inputs for the basic SRL model. Experiments on the benchmarks of Chinese Proposition Bank 1.0 and CoNLL-2009 Chinese datasets show that our proposed framework can effectively improve the performance over the strong baselines. With the external BERT representations, our framework achieves new state-of-the-art 87.54 and 88.5 F1 scores on the two test data of the two benchmarks, respectively. In-depth analysis are conducted to gain more insights on the proposed framework and the effectiveness of syntax.


Introduction
Semantic role labeling (SRL) is a fundamental and important task in natural language processing (NLP), which aims to identify the semantic structure (Who did what to whom, when and where, etc.) of each given predicate in a sentence. Semantic knowledge has been widely exploited in many down-stream NLP tasks, such as information ex- * Corresponding author. traction (Bastianelli et al., 2013), machine translation (Liu and Gildea, 2010;Gao and Vogel, 2011) and question answering (Shen and Lapata, 2007;Wang et al., 2015a).
There are two formulations of SRL in the community according to the definition of semantic roles. The first is called span-based SRL, which employs a continuous word span as a semantic role and follows the manual annotations in the Prop-Bank (Palmer et al., 2005) and NomBank (Meyers et al., 2004). The second is word-based SRL (Surdeanu et al., 2008), also called dependencybased SRL, whose semantic role is usually syntactic or semantic head word of the manually annotated word span. Figure 1 gives an example of the two forms in a sentence, where "bought" is the given predicate.
Intuitively, syntax and semantics are strongly correlative. For example, the semantic A0 and A1 roles are usually the syntactic subject and object, as shown in Figure 1. Inspired by the correlation, researchers try to improve SRL performance by exploring various ways to integrate syntactic knowledge (Roth and Lapata, 2016;He et al., 2018b;. In contrast, some recent works Tan et al., 2018;Cai et al., 2018) propose deep neural models for SRL without considering any syntactic in-formation, achieving promising results. Most recently, He et al. (2018a);  extend the span-based models to jointly tackle the predicate and argument identification sub-tasks of SRL.
Compared with the large amount of research for English SRL, Chinese SRL works are rare, mainly because of the limited amount of data and lack of attention of Chinese researchers. For Chinese, the commonly used datasets are Chinese Proposition Bank 1.0 (CPB1.0) (span-based) (Xue, 2008) and CoNLL-2009 Chinese (word-based) (Hajič et al., 2009). The CPB1.0 dataset follows the same annotation guideline with the English Prop-Bank benchmark (Palmer et al., 2005). Wu and Palmer (2015) present a top model based selection preference approach to improve Chinese SRL. Since the amount of CPB1.0 dataset is small, Xia et al. (2017) exploit heterogeneous SRL data to improve the performance via a progressive learning approach. The CoNLL-2009 benchmark is released by the CoNLL-2009 shared task (Hajič et al., 2009). Previous works He et al., 2018b;Cai et al., 2018) mainly focus on building more powerful models or exploring the usage of external knowledge on this dataset.
Inspired by the development of neural models and exploration of syntactic information, this paper proposes a MTL framework to extract syntactic representations as the external input features for the simple unified SRL model. The contributions of our paper are three-folds: 1. We introduce a simple unified model for span-based and word-based Chinese SRL. 2. We propose a MTL framework to extract implicit syntactic representations for SRL model, which significantly outperforms the baseline model. 3. Detailed analysis gains crucial insights on the effectiveness of our proposed framework.
We conduct experiments on the benchmarks of CPB1.0 and CoNLL-2009. The results show that our framework achieves new state-of-the-art 87.54 and 88.5 F1 scores on the two test data, respectively.

Basic SRL Model
Motivated by the recently presented span-based models (He et al., 2018a; for jointly predicting predicates and arguments, we introduce a simple unified span-based model. Formally, given a sentence s = w 1 , w 2 , ..., w n , the span-based model aims to predict a set of labeled predicate-argument relationships Y ⊆ P ×A×R, where P = {w 1 , w 2 , ..., w n } is the set of all candidate predicates, A = {(w i , ..., w j )|1 ≤ i ≤ j ≤ n} is the set of all candidate arguments, and R is the set of the semantic roles. Following He et al. (2018a), we also include a null label in the role set R indicating no relation between the focused predicate and argument. The model objective is to optimize the probability of the predicateargument-role tuples y ∈ Y in a sentence s, which is formulated as: where φ(p, a, r) = φ p (p) + φ a (a) + φ r (p, a) is the score of the predicate-argument-relation tuple. We directly adopt the model architecture of He et al. (2018a) as our basic SRL model with a modification on the argument representation. The architecture of the basic SRL module is shown in the right part of Figure 2, and we will describe it in the following subsections.

Input Layer
Following He et al. (2018a); , we employ CNNs to encode Chinese characters for each word w i into its character representation, denoted as rep char i . Then, we concatenate rep char i with the word embedding emb word i to represent the word-level features as our basic model input. In addition, we also employ BERT representations (Devlin et al., 2019) to boost the performance of our baseline model, which we denote as rep BERT i . Formally, the input representation of w i is: , where ⊕ is the concatenation operation. Our basic SRL model and BERT-enhanced baseline depend on whether including the BERT representation rep BERT i or not.

BiLSTM Encoder
Over the input layer, we employ the BiLSTMs with highway connections ( Figure 2: The detailed architecture of our proposed framework, where the left part is the dependency parser and the right part is the basic SRL module, respectively. Zhang et al., 2016b) to encode long-range dependencies and obtain rich representations denoted as h i for time stamp i. The highway connections are used to alleviate the gradient vanishing problem when training deep neural networks.

Predicate and Argument Representations
We directly employ the output of the top BiLSTM as the predicate representation at each time stamp. For all the candidate arguments, we simplify the representations by employing the mean operation over the BiLSTM outputs within the corresponding argument spans, which achieves similar results compared with the attention-based span representations (He et al., 2018a) on English SRL in our preliminary experiments. Formally, Specifically, for word-based SRL, we only need to set the length of candidate arguments to be 1.

MLP Scorer
We employ the MLP scorers as the scoring functions to determine whether the candidate predicates or arguments need to be pruned. Another MLP scorer is employed to compute the score of whether the focused candidate predicate and argu-ment can compose a semantic relation.

Proposed Framework
Our framework includes two modules, a basic SRL module and a dependency parser module, as shown in Figure 2. In this section, we will first describe the architecture of the employed dependency parser, and then illustrate the integration of the syntactic parser into the basic SRL model.

Dependency Parser Module
We employ the state-of-the-art biaffine parser proposed by Dozat and Manning (2017) as the dependency parser module in our framework, as shown by the left part of Figure 2. In order to better fit the dependency parser into our framework, we make some modifications on the original model architecture. First, we remove the Part-of-Speech (PoS) tagging embeddings and add the Chinese character CNN representations, so the resulting input representation is the same as the SRL module.
Second, we substitute the BiLSTMs in the original biaffine parser with the same 3-layer highway BiLSTMs used in our SRL module. The biaffine scorer is proposed to compute the score of candidate syntactic head and modifier, which remains unchanged.

Details of Integration
Multi-task learning (MTL) approaches can effectively exploit the standard dependency trees to improve the SRL performance, regarding dependency parsing as the auxiliary task. Hard parameter sharing (Ruder, 2017) is the most commonly used method in MTL, which shares several common layers between tasks and keeps the taskspecific output layers, as illustrated by the subfigure a in Figure 3. We propose a better way to integrate the focused two tasks in this work. In the following, we will describe the intuition and details of the integration on the two tasks.
As is well known, the hard parameter sharing approach can provide representations for all the shared tasks and reduce the probability of overfitting on the main task. However, this kind of sharing strategy somewhat weakens the representation framework maintains distinct model parameters for each task, due to the neutralization of knowledge introduced by the auxiliary task. Different from the hard parameter sharing strategy, we propose to integrate the syntactic information into the input layer of the basic SRL module, as illustrated by sub-figure b of Figure 3. And Figure  2 shows the detailed architecture. First, we extract all the 3 BiLSTM hidden outputs of the dependency parser as the syntactic representations. Second, we employ the normalized weights to sum the extracted representations as the final syntactic representation for word w i , denoted as rep syn i . Formally, where N is the layer number of the highway BiL-STMs, and α j is the j−th softmax weight. Fi-nally, the extracted syntactic representations are fed into the input layer of the SRL module, and concatenated with the original SRL module input. We design this framework for several considerations: 1) the proposed framework keeps the own model parameters for each task, thereby maximizing task-specific information for the main task, 2) the dependency parser module can be updated by the gradients returned from the extracted syntactic representations, which can encourage it to produce semantic preferred representations.

Training Objective
Given the sets of SRL data S and dependency data D, the framework loss function is defined as the sum of the negative log-likelihood loss of the two tasks: where Y * s and Y * d are gold semantic and syntactic structures respectively, and α is a corpus weighting factor to control the loss contribution of the dependency data in each batch as discussed in the experiments.

Settings
We evaluate the proposed MTL framework on two commonly used benchmark datasets of Chinese: Chinese Proposition Bank 1.0 (CPB1.0) (spanbased) (Xue, 2008) and CoNLL-2009 (wordbased) (Hajič et al., 2009). Following previous works, we report the results of span-based SRL in two setups: pre-identified predicates and endto-end. For word-based SRL, we only report the results in the pre-identified predicates setting. Following Roth and Lapata (2016), we employ the mate-tools 1 (Björkelund et al., 2010) for the predicate disambiguation, which achieves 94.87% and 94.91% F1 scores on the CoNLL-2009 Chinese development and test data respectively.
Dependency Parsing Data. We employ the Chinese Open Dependency Treebank 2 constructed at Soochow University. The treebank construction 5386 project aims to continually build a large-scale Chinese dependency treebank that covers up-to-date texts from different domains and sources (Peng et al., 2019). So far, CODT contains 67,679 sentences from 9 different domains or sources.
Recently, BERT (Bidirectional Encoder Representations from Transformers) is proposed by Devlin et al. (2019), which makes use of Transformers to learn contextual representations between words. In this paper, we use the pre-trained Chinese model 3 to extract the BERT representations for our span-based and word-based SRL datasets. We extract the fixed BERT representations from the last four hidden layers of the pre-trained model. Finally, we also employ the normalized weighted sum operation to obtain the final BERT representation for each word w i , denoted as rep BERT i .

Hyperparameters.
We employ word2vec (Mikolov et al., 2013) to train the Chinese word embeddings on the Chinese Gigaword dataset 4 . The Chinese char embeddings are randomly initialized, and the dimension is 100. We employ the CNN to get the Chinese char representations, which has window size of 3, 4 and 5, and the output channel size is 100. For other parameter settings in the SRL module, we mostly follow the work of He et al. (2018a). As for the pruning of candidate predicates and arguments, we choose the pruning ratios according to the training data, using λ p = 0.4 for predicates and λ a = 0.8 for arguments with up to 30 words.
Training Criterion. We choose Adam (Kingma and Ba, 2015) optimizer with 0.001 as the initial learning rate and 0.1% as the decay rate for every 100 steps. Each data batch is composed of both SRL and dependency instances. We randomly shuffle the SRL and dependency training datasets if the smaller SRL data is used up. All baseline models are trained for at most 180,000 steps, and 100,000 steps for other models. In addition, we pick the best model on the development data for testing. We apply 0.5 dropout to the word embeddings and Chinese character representations and 0.2 dropout to all hidden layers. We employ variational dropout masks that are shared across all timesteps (Gal and Ghahramani, 2016) for the highway BiLSTMs, with 0.4 dropout rate.

Syntax-aware Methods
To illustrate the effectiveness and advantage of our proposed framework 7 (Integration of Implicit Representations, IIR), we conduct several experiments with the recently employed syntax-aware methods on CPB1.0 dataset for comparison: • • HPS We employ the commonly used hard parameter sharing (HPS) strategy of MTL as a strong baseline, which shares the word and char embeddings and 3-layer BiLSTMs between the dependency parser and the basic SRL module.

Main Results
Results of Syntax-aware Methods. Table 1 shows the results of these syntax-aware methods on CPB1.0 dataset. First, the first line shows the results of our baseline model, which only employs the word embeddings and char representations as the inputs of the basic SRL model. Second, the Tree-GRU method only achieves 80.06 F1 score on the test data, which even didn't catch up with the baseline model. We think this is caused by the relatively low accuracy in Chinese dependency parsing. Third, the FIR approach outperforms the baseline by 2.17 F1 score on the test data, demonstrating the effectiveness of introducing fixed implicit syntactic representations. Forth, the HPS strategy achieves more significant performance by 83.51 F1 score. Finally, our proposed framework achieves the best performance of 83.91 F1 score among these methods, outperforming the baseline by 3.43 F1 score. All the improvements are statistically significant (p < 0.0001). From these experimental results, we can conclude that: 1) the quality of syntax has a crucial impact on the methods which depend on the systematic dependency trees, like Tree-GRU, 2) the implicit syntactic features have the potential to improve the down-stream NLP tasks, and 3) learning the syntactic features with the main task performs better than extract them from a fixed dependency parser.
Results on CPB1.0. Table 2 shows the results of our baseline model and proposed framework using external dependency trees on CPB1.0, as well as the corresponding results when adding BERT representations. It is clear that adding dependency trees into the baseline SRL model can effectively improve the performance (p < 0.0001), no matter whether employ the BERT representations or not. Especially, our proposed framework (IIR) consistently outperforms the hard parameter sharing strategy. So we only report the results of our proposed framework in later experiments. Our final results outperforms the best previous model (Xia et al., 2017) by 7.87 and 4.24 F1 scores with BERT representations or not, respectively. Table 3 shows the results of our framework in the end-to-end setting. To our best knowledge, we are the first to present the results of end-to- Sun et al. (2009) 74.12 Wang et al. (2015b) 77.59 Sha et al. (2016) 77.69 Xia et al. (2017) 79.67  end on the CPB1.0 dataset. We achieve the result of 85.57 in F1 score, which is a strong baseline for later works. It is clear that our framework can still achieve better results compared with the strong baseline, which employs BERT representations as the external input.

Ours
Results on CoNLL-2009.    formance with Cai et al. (2018), which is an endto-end neural model that consists of BiLSTM encoder and biaffine scorer. Our proposed framework outperforms the best reported result (Cai et al., 2018) by 0.8 F1 score and brings a significant improvement (p < 0.0001) of 0.9 F1 score over our baseline model. Our experimental result boosts to 88.5 F1 score when the framework is enhanced with BERT representations. However, compared with the results in the settings without BERT, the improvement is fairly small (88.53 -88.47 = 0.06 F1 score, p > 0.1) 8 of the proposed framework, which we will discuss in Section 5.3.

Analysis
In this section, we conduct detailed analysis to understand the improvements introduced by our proposed framework.

Long-distance Dependencies
To analyze the effect of the proposed framework regarding to the distance of sentence lengths, we report the F1 scores of different sets of sentence lengths, as shown in Figure 4. We can see that improvements are obtained for nearly all sets of sentences, especially on the sentences with longdistance. It demonstrate that syntactic knowledge is beneficial for SRL and effective to capture longdistance dependencies.

Improvements on Semantic Roles
To find which semantic roles benefit from our syntax-aware framework, we report the F1 scores on several semantic role labels in Figure 5. We can see that syntax helps most on the A0 and A1 roles, which is consistent with the intuition that the semantic A0 and A1 roles are usually the syntactic subject and object of a verb predicate. Other adjunct semantic roles like ADV, LOC, MNR and TMP all benefit from the introduction of syntactic information. There is an interesting finding that the DIS role obtains worse performance when introduce syntactic information. We conduct error analysis on this phenomena, and we found that the framework mostly confuses DIS with ADV. The possible reason is that the two semantic roles are both labeled as adv in syntax.

Integration with BERT
BERT is employed to boost the performance of our basic SRL model and our proposed framework. Compared with results in the settings without BERT, the improvements of our framework over the BERT-enhanced baseline are fairly small on CoNLL-2009, as shown by the last two lines in Table 4. To analyze the difference between the two models (Baseline + BERT and Baseline + BERT + Dep (IIR)), we conduct an analysis on the sentence performance comparison between them, which is inspired by Zhang et al. (2016a). As shown in Fig ure 6, we can see that most of the scatter points are off the diagonal line, demonstrating strong differences between the two models. Based on this finding, how to better integrate syntactic knowledge and BERT representations becomes an interesting and meaningful question, and we leave it for future work.

Related Work
Traditional discrete-feature-based SRL works (Swanson and Gordon, 2006;Zhao et al., 2009) mainly make heavy use of syntactic information. Along with the impressive development of neuralnetwork-based approaches in the NLP community, much attention has been paid to build more powerful neural model without considering any syntactic information. Zhou and Xu (2015)  Apart from the above syntax-free works, researchers also pay much attention on improving the neural-based SRL approaches by introducing syntactic knowledge. Roth and Lapata (2016) introduce the dependency path embeddings to the neural-based model and achieve substantial improvements.  employ the graph convolutional neural networks on top of the BiLSTM encoder to encode syntactic information. He et al. (2018b) propose a k-th order argument pruning algorithm based on systematic dependency trees. Strubell et al. (2018) propose a self-attention based neural MTL model which incorporate dependency parsing as a auxiliary task for SRL.  propose a MTL framework using hard parameter strategy to incorporate constituent parsing loss into semantic tasks, i.e. SRL and coreference resolution, which outperforms their baseline by +0.8 F1 score. Xia et al. (2019) investigate and compare several syntax-aware methods on span-based SRL, showing the effectiveness of integrating syntactic information.
Compared with the large amount of works on English SRL, Chinese SRL works are rare, mainly because of the limitation of datasize and lack of attention of Chinese researchers. Sun et al. (2009) treat the Chinese SRL as a sequence labeling problem and build a SVM-based model by exploiting morphological and syntactic features. Wang et al. (2015b) build a basic BiLSTM model and introduce a way to exploit heterogeneous data by sharing word embeddings. Xia et al. (2017) propose a progressive model to learn and transfer knowledge from heterogeneous SRL data. The above works are all focus on the span-based Chinese SRL, and we compare with their results in Table 2. Different from them, we propose a MTL framework to integrate implicit syntactic representations into a simple unified model on both span-based and wordbased SRL, achieving substantial improvements.
In addition to the hard parameter sharing strategy that we discuss in Section 3.2, partial parameter sharing strategy is also a commonly studied approach in MTL and domain adaptation. Kim et al. (2016) introduce simple neural extensions of feature argumentation by employing a global LSTM used across all domains and independent LSTMs used within individual domains. Peng et al. (2017) explore a multitask learning approach which shares parameters across formalisms for semantic dependency parsing. In addition, Peng et al. (2018) present a multi-task approach for frame-semantic parsing and semantic dependency parsing with latent structured variables.

Conclusion
This paper proposes a syntax-aware MTL framework to integrate implicit syntactic representations into a simple unified SRL model. The experimental results show that our proposed framework can effectively improve the basic SRL model, even when the basic model is enhanced with BERT representations. Especially, our proposed framework is more effective at utilizing syntactic information, compared with the hard parameter sharing strategy of MTL. By utilizing BERT representations, our framework achieves new state-ofthe-art performance on both span-based and wordbased Chinese SRL benchmarks, i.e. CPB1.0 and CoNLL-2009 respectively. Detailed analysis shows that syntax helps most on the long sentences, because of the long-distance dependencies captured by syntax trees. Moreover, the comparison of sentence performance indicates that there is still a lot of work to do to better integrate syntactic information and BERT representation.