Syntax-aware Multilingual Semantic Role Labeling

Recently, semantic role labeling (SRL) has earned a series of success with even higher performance improvements, which can be mainly attributed to syntactic integration and enhanced word representation. However, most of these efforts focus on English, while SRL on multiple languages more than English has received relatively little attention so that is kept underdevelopment. Thus this paper intends to fill the gap on multilingual SRL with special focus on the impact of syntax and contextualized word representation. Unlike existing work, we propose a novel method guided by syntactic rule to prune arguments, which enables us to integrate syntax into multilingual SRL model simply and effectively. We present a unified SRL model designed for multiple languages together with the proposed uniform syntax enhancement. Our model achieves new state-of-the-art results on the CoNLL-2009 benchmarks of all seven languages. Besides, we pose a discussion on the syntactic role among different languages and verify the effectiveness of deep enhanced representation for multilingual SRL.


Introduction
Semantic role labeling (SRL) aims to derive the meaning representation such as an instantiated predicate-argument structure for a sentence. The currently popular formalisms to represent the semantic predicate-argument structure are based on dependencies and spans. Their main difference is that dependency SRL annotates the syntactic head of argument rather than the entire constituent (span), and this paper will focus on the * These authors made equal contribution. † Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (No. U1836222 and No. 61733011). dependency-based SRL. Be it dependency or span, SRL plays a critical role in many natural language processing (NLP) tasks, including information extraction (Christensen et al., 2011), machine translation (Xiong et al., 2012) and question answering (Yih et al., 2016).
Almost all of traditional SRL methods relied heavily on syntactic features, which suffered the risk of erroneous syntactic input, leading to undesired error propagation. To alleviate this inconvenience, researchers as early as Zhou and Xu (2015) propose neural SRL models without syntactic input.  employ the biaffine attentional mechanism (Dozat and Manning, 2017) for dependency-based SRL. In the meantime, a series of studies (Roth and Lapata, 2016;Strubell et al., 2018; have introduced syntactic clue in creative ways for further performance improvement, which achieve favorable results. However, applying the k-order syntactic tree pruning of  to the biaffine SRL model  does not boost the performance as expected, which indicates that exploiting syntactic clue in state-of-theart SRL models still deserves deep exploration. Besides, most of SRL literature is dedicated to impressive performance gains on English and Chinese, but other multiple languages have received relatively little attention. We even observe that to date the best reported results of some languages (Catalan and Japanese) are still from the initial CoNLL-2009 shared task (Hajič et al., 2009). Therefore, we launch this multilingual SRL study to fill the obvious gap ignored since a long time ago. Especially, we attempt to improve the overall performance of multilingual SRL by incorporating syntax and introducing contextualized word representation, and explore syntactic effect on other multiple languages.
Multilingual SRL needs to be carefully han-dled for the diversity of syntactic and semantic representations among quite different languages. Despite such a diversity, in this paper, we manage to develop a simple and effective neural SRL model to integrate syntactic information by applying argument pruning method in a uniform way. Specifically, we introduce new pruning rule based on syntactic parse tree unlike the k-order pruning of , which is only simply determined by the relative distance of predicate and argument. Furthermore, we propose a novel method guided by syntactic rule to prune arguments for dependency SRL, different from the existing work.
With the help of the proposed pruning method, our model can effectively alleviate the imbalanced distribution of arguments and non-arguments, achieving faster convergence during training.
To verify the effectiveness and applicability of the proposed method, we evaluate the model on all seven languages of CoNLL-2009 datasets. Experimental results indicate that our argument pruning method is generally effective for multilingual SRL over our unified modeling. Moreover, our model using contextualized word representation achieves the new best results on all seven datasets, which is the first overall update since 2009. To the best of our knowledge, this is the first attempt to study seven languages comprehensively in deep learning models.

Model
Given a sentence, SRL can be decomposed into four classification subtasks, predicate identification and disambiguation, argument identification and classification. Since the CoNLL-2009 shared task has indicated all predicates beforehand, we focus on identifying arguments and labeling them with semantic roles. Our model builds on a recent syntax-agnostic SRL model  by introducing argument pruning and enhanced word representation. In this work, we handle argument identification and classification in one shot, treating the SRL task as word predicate-argument pair classification. Figure 1 illustrates the overall architecture of our model, which consists of three modules, (1) a bidirectional LSTM (BiLSTM) encoder, (2) an argument pruning layer which takes as input the BiLSTM representations, and (3) a biaffine scorer which takes as input the predicate and its argument candidates.

BiLSTM Encoder
Given a sentence and marked predicates, we adopt the bidirectional Long Short-term Memory neural network (BiLSTM) (Hochreiter and Schmidhuber, 1997) to encode sentence, which takes as input the word representation. Following , the word representation is the concatenation of five vectors: randomly initialized word embedding, lemma embedding, part-of-speech (POS) tag embedding, pre-trained word embedding and predicate-specific indicator embedding.
Besides, the latest work  has demonstrated that the contextualized representation ELMo (Embeddings from Language Models) (Peters et al., 2018) could boost performance of dependency SRL model on English and Chinese. To explore whether the deep enhanced representation can help other multiple languages, we further enhance the word representation by concatenating an external embedding from the recent successful language models, ELMo and BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018), which are both contextualized representations. It is worth noting that we use ELMo or BERT to obtain pre-trained contextual embeddings rather than fine-tune the model, which are fixed contextual representations.

Argument Pruning Layer
For word pair classification modeling, one major performance bottleneck is caused by unbalanced data, especially for SRL, where more than 90% of argument candidates are non-arguments. A series of pruning methods are then proposed to alleviate the imbalanced distribution, such as the korder pruning . However, it does not extend well to other languages, and even hinders the syntax-agnostic SRL model as  has experimented with different k values on English. The reason might be that this pruning method breaks up the whole sentence, leading the BiLSTM encoder to take the incomplete sentence as input and fail to learn sentence representation sufficiently.
To alleviate such a drawback from the previous syntax-based pruning methods, we propose a novel pruning rule extraction method based on syntactic parse tree, which generally suits multilingual cases at the same time. In detailed model implementation, we add an argument pruning layer guided by syntactic rule following BiL- STM layers, which can absorb the syntactic clue simply and effectively.
Syntactic Rule Considering that all arguments are predicate-specific instances, it has been generally observed that the distances between predicate and its arguments on syntactic tree are within a certain range for most languages. Therefore, we introduce language-specific rule based on syntactic dependency parses to prune some unlikely arguments, henceforth syntactic rule. Specifically, given a predicate p and its argument a, we define d p and d a to be the distance from p and a to their nearest common ancestor node (namely, the root of the minimal subtree which includes p and a) respectively. For example, 0 denotes that predicate or argument itself is their nearest common ancestor, while 1 represents that their nearest common ancestor is the parent of predicate or argument. Then we use the distance tuple (d p , d a ) as their relative position representation inside the parse tree. Finally, we make a list of all tuples ordered according to how many times that each distance tuple occurs in the training data, which is counted for each language independently.
It is worth noting that our syntactic rule is determined by the top-k frequent distance tuples. During training and inference, the syntactic rule takes effect by excluding all candidate arguments whose predicate-argument relative position in parse tree is not in the list of top-k frequent tuples.  Figure 2 shows simplified examples of syntactic dependency tree. Given an English sentence in (a), the current predicate is likes, whose arguments are cat and fish. For likes and cat, the predicate (likes) is their common ancestor (denoted as Root arg ) according to the syntax tree. Therefore, the relative position representation of predicate and argument is (0, 1), so it is for likes and fish. As for the right one in (b), suppose the marked predicate has two arguments−arg1 and arg2, the common ancestors of predicate and arguments are respectively Root arg1 and Root arg2 . In this case, the relative position representations are (0, 1) and (1, 2).

Argument Pruning Method
To maintain the integrity of sequential inputs from the whole sentence, we propose a novel syntax-based method to prune arguments, unlike most existing work (Xue and Palmer, 2004;Zhao et al., 2009a; which prunes argument candidates in the pre-processing stage. As shown in Figure 1, the way to perform argument pruning strategy is very straightforward. In the argument pruning layer, our model drops these candidate arguments (more exactly, BiLSTM representations) which do not comply with the syntactic rule. In other words, only the predicates and arguments that satisfy the syntactic rule will be output to next layer.
For example (in Figure 1), given the sentence Keep your heart and mind open, the predicate Keep and the corresponding syntactic dependency (bottom), by definition, (0, 1) is inside the syntactic rule on this occasion. Therefore, these candidate arguments (i.e., your, and, mind) will be pruned by the argument pruning layer.

Biaffine Scorer
As mentioned above, our model treats SRL task as a word pair classification problem, tackling argument identification and classification in one shot. To label arguments of given predicates, we employ a scorer with biaffine attention (Dozat and Manning, 2017) (biaffine scorer for short) as role classifier on top of argument pruning layer for the final prediction, similar to . Biaffine scorer takes as input the BiLSTM hidden states of predicate and candidate arguments filtered by argument pruning layer, denoted by h p and h a respectively, and then computes the probability of corresponding semantic labels using biaffine transformation as follows: where ⊕ represents concatenation operator, W 1 and W 2 denote the weight matrix of the bilinear and the linear terms respectively, and b is the bias item. Note that the predicate itself is also included in its own argument candidate list and will be applied to compute scores, because a nominal predicate sometimes takes itself as its own argument.

Experiments
Our model 1 is evaluated on the CoNLL-2009 benchmark datasets, including Catalan, Chinese, Czech, English, German, Japanese and Spanish. The statistics of the training datasets can be seen in 1 The code is available at https://github.com/ bcmi220/multilingual_srl.  (Zhao et al., 2009a) for Catalan and Spanish, and the ones (Björkelund et al., 2009) for other languages. Besides, we use the officially predicted POS tags and syntactic parses provided by CoNLL-2009 sharedtask for all languages. 2 As for the contextualized representation, ELMo, we employ the multilingual version from Che et al. (2018). For BERT, this work uses the BERT-Base, Multilingual Cased model (Devlin et al., 2018). For syntactic rule in argument pruning layer, to ensure more than 99% coverage of true arguments in pruning output, we use the top-120 distance tuples on Japanese and top-20 on other multiple languages for a better trade-off between computation and coverage.

Model Setup
In our experiments, all real vectors are randomly initialized, including 100-dimensional word, lemma, POS tag embeddings and 16dimensional predicate-specific indicator embedding . The pre-trained word embedding is 100-dimensional GloVe vectors (Pennington et al., 2014) for English, 300-dimensional fastText vectors (Grave et al., 2018) trained on Common Crawl and Wikipedia for other languages, while the dimension of ELMo or BERT word embedding is 1024. Besides, we use 3 layers BiLSTM with 400-dimensional hidden states, applying dropout with an 80% keep probability between time-steps and layers. For biaffine scorer, we employ two 300-dimensional affine transfor-2 There were two tracks in the CoNLL-2009 shared task, SRL-only and joint. For the former, all participants did not have to develop their own syntactic parsers and focused on the SRL model development, while for the latter, the participants had to build their own syntactic parser as well. For the sake of focusing the SRL work, in this work, we will take the official syntax provided by CoNLL-2009.   mations with the ReLU non-linear activation, also setting the dropout probability to 0.2. During training, we use the categorical cross-entropy as objective, with Adam optimizer (Kingma and Ba, 2015) initial learning rate 2e −3 . All models are trained for up to 500 epochs with batch size 64.

Results and Discussion
In Table 2, we compare our single model (AP is an acronym for argument pruning) against previous work on English in-domain data and Chinese test set. Our baseline is a modification to the model of  which uniformly handled the predicate disambiguation. For English, our baseline gives slightly weaker performance than the work of Li et al. (2019), which used ELMo and employed a sophisticated span selection model for predicting predicates and arguments jointly. Our model with the proposed argument pruning layer (+ AP) brings absolute improvements of 0.35% and 0.5% F 1 on English and Chinese, respectively, which is on par with the best published scores. Moreover, we introduce deep enhanced representation based on the argument pruning. Our model utilizing BERT (+ AP + BERT) achieves the new best results on English and Chinese benchmarks. Table 3 presents all test results on seven languages of CoNLL-2009 datasets. So far, the best previously reported results of Catalan, Japanese and Spanish are still from CoNLL-2009 shared task. Compared with previous methods, our baseline yields strong performance on all datasets except German. Especially for Catalan, Czech, Japanese and Spanish, our baseline performs better than existing methods with a large margin of 3.5% F 1 on average. Nevertheless, applying our argument pruning to the strong syntax-agnostic baseline can still boost the model performance, which demonstrates the effectiveness of proposed method. On the other hand, it indicates that syntax is generally beneficial to multiple languages, and can enhance the multilingual SRL performance with effective syntactic integration.
Besides, we report the scores of leveraging ELMo and BERT for multiple languages (the last three rows in Table 3). The use of contextualized word representation further improves model performance, which overwhelmingly outperforms previously published best results and achieves the new state of the art in multilingual SRL for the first time. Furthermore, we find that ELMo promotes the overall performance of SRL model, but BERT gives more significant performance increase than ELMo on all languages, which suggests that BERT is better at enriching contextual information. More interestingly, we observe that the performance gains from the proposed argument pruning method or these deep enhanced representations are relatively marginal on Japanese, one possible reason is the relatively small size of   its training set. This observation also indicates that ELMo or BERT is more suitable for learning on large annotated corpus.

End-to-end SRL
As mentioned above, we combine the predicate sense output of previous work to make results directly comparable, since the official evaluation script includes such prediction in the F 1 -score calculation. However, predicate disambiguation is considered a simpler task with higher semantic F 1score and deserves more further research. To this end, we present a full end-to-end neural model for multilingual SRL, namely Biaffine SRL, following .
Unlike most of SRL work treating the predicate sense disambiguation and semantic role assignment tasks as independent, we jointly handle predicate disambiguation and argument labeling in one shot by introducing a virtual node <VR> as the nominal semantic head of predicate. It should be noted that the predicate sense annotation of Czech and Japanese is simply the lemmatized token of the predicate, a one-to-one predicate-sense mapping. Therefore, we ignore them and conduct experiments on other five languages. Table 4 shows the results of end-to-end setting. Compared to the baseline, our full end-toend model (Biaffine SRL) yields slightly higher precision of predicate disambiguation as a whole, which gives rise to a corresponding gain of semantic F 1 . What is more, our model (using argument pruning and BERT) reaches the highest scores on the five benchmarks. Besides, experiments indicate that argument pruning promotes role labeling performance while BERT significantly improves the performance of predicate disambiguation.  (-0.56) 78.12 (-0.24) Japanese 83.08 82.02 (-1.06) 82.80 (-0.28) Spanish 83.47 83.15 (-0.32) 83.00 (-0.47)   tic contribution for multilingual SRL. Since recent work well studied dependency SRL on English and Chinese, we focus on other five languages, and these analyses are performed on CoNLL-2009 test sets without using ELMo or BERT embeddings.

Effectiveness of Language Feature
As  point out, POS tag information is highly beneficial for English. Consequently, we conduct an ablation study on indomain test set to explore how the language features impact our model. Table 5 reports the F 1 scores of model which removes POS tag or lemma from the baseline. Results show that omitting POS tag or lemma leads to slight performance degradation (−0.5% and −0.32% F 1 on average, respectively), indicating that both can help improve performance of multilingual SRL. Interestingly, we see a drop of 1.0% F 1 for Japanese not using POS tag, which demonstrates its importance.

Effectiveness of Syntactic Rule
According to our statistics of syntactic rule on training data for seven languages, which is based on the automatically predicted parse provided by CoNLL-2009 shared task, the total number of distance tuples in syntactic rule is no more than 120   in these languages except that Japanese is about 260. It is in favor of previous hypothesis that the relative distances between predicate and its arguments are within a certain range. Besides, (0, 1) is the most frequently occurring relationships in all languages except Japanese, which indicates that arguments most frequently appear to be the children of their predicate in dependency syntax tree. Figure 3 shows F 1 scores on test set by topk argument pruning for German, Catalan and Japanese, 3 where k = 0 represents the baseline without pruning. We observe that the case of k = 20 yields the best performance on German and Catalan. As for Japanese, it falls short of the baseline in our current observation range, but our experiment has shown that the setting of top-120 can achieve the best results.
To reveal the strengths of proposed syntactic rule, we conduct further experiments, replacing our syntactic rule based pruning with the k-order argument pruning of . Following their setting, we use the tenth-order pruning for pursuing the best performance. Table 6 shows the performance gaps between two pruning methods. Comparing with syntactic rule, the k-order prun-ing declines model performance by 0.2% F 1 on average, showing that our syntactic rule based pruning method is more effective and can be extended well to multiple languages especially for Japanese.

Syntactic Impact
In this part, we attempt to explore the syntactic impact on other five languages. To investigate the most contribution of syntax to multilingual SRL, we perform experiments using the gold syntactic parse also officially provided by the CoNLL-2009 shared task instead of the predicted one. 4 To be more precise, the syntactic rule is counted based on gold syntactic tree and applied to argument pruning layer. The corresponding results of our syntax-agnostic and syntax-aware models are summarized in Table 7. We also report the unlabeled attachment scores (UAS) of predicted syntax as syntactic accuracy measurement, considering that we do not use the dependency labels.
Results indicate that high-quality syntax can further improve model performance, showing syntactic information is generally effective for multilingual SRL. In particular, based on gold syntax, the top-1 argument pruning for Catalan and Spanish has reached 100 percent coverage (namely, for Catalan and Spanish, all arguments are the children of predicates in gold dependency syntax tree), and hence our syntax-aware model obtains significant gains of 1.43% and 1.35%, respectively.
In addition, combining the results of Tables 6 and 7, we find that applying the k-order argument pruning of  to syntax-agnostic model results in better performance on most languages. However,  argue that korder pruning does not boost the performance for English. One reason to account for this finding is the lack of effective approaches for incorporating syntactic information into sequential neural networks. Nevertheless, syntactic contribution is overall limited for multilingual SRL in this work, due to strong syntax-agnostic baseline. Therefore, more effective methods to incorporate syntax into neural SRL model are worth exploring. Besides, utilization of syntactic dependency labels information is also a promising direction and we leave it for future work.

Related Work
In early work of semantic role labeling, most of researchers were dedicated to feature engineering (Pradhan et al., 2005;Punyakanok et al., 2008;Zhao et al., 2009bZhao et al., , 2013. The first neural SRL model was proposed by Collobert et al. (2011), which used convolutional neural network but their efforts fell short. Later, Foland and Martin (2015) effectively extended their work by using syntactic features as input. Roth and Lapata (2016) introduced syntactic paths to guide neural architectures for dependency SRL.
However, putting syntax aside has sparked much research interest since Zhou and Xu (2015) employed deep BiLSTMs for span SRL. A series of neural SRL models without syntactic inputs were proposed.  applied a simple LSTM model with effective word representation, achieving encouraging results on English, Chinese, Czech and Spanish.  built a full end-to-end SRL model with biaffine attention and provided strong performance on English and Chinese. Li et al. (2019) also proposed an end-to-end model for both dependency and span SRL with a unified argument representation, obtaining favorable results on English.
Despite the success of syntax-agnostic SRL models, more recent work attempts to further improve performance by integrating syntactic information, with the impressive success of deep neural networks in dependency parsing (Zhang et al., 2016;. Marcheggiani and Titov (2017) used graph convolutional network to encode syntax into dependency SRL.  proposed an extended k-order argument pruning algorithm based on syntactic tree and boosted SRL performance.  presented a unified neural framework to provide multiple methods for syntactic integration. Our method is closely related to the one of , designed to prune as many unlikely arguments as possible.
Multilingual SRL To promote NLP applications, the CoNLL-2009 shared task advocated performing SRL for multiple languages. Among the participating systems, Zhao et al. (2009a) proposed an integrated approach by exploiting largescale feature set, while Björkelund et al. (2009) used a generic feature selection procedure. Until now, only a few of work (Lei et al., 2015;Swayamdipta et al., 2016;Mulcaire et al., 2018) seriously considered multilingual SRL. Among them, Mulcaire et al. (2018) built a polyglot model (training one model on multiple languages) for multilingual SRL, but their results were far from satisfactory. Therefore, this work aims to complete the overall upgrade since CoNLL-2009 shared task and leaves polyglot training as our future work.

Conclusion
This paper is dedicated to filling the long-term performance gap of multilingual SRL since a long time ago with a newly proposed syntax-based argument pruning method. Experimental results demonstrate its effectiveness and shed light on a new perspective for many NLP tasks to incorporate syntax simply and effectively. Besides, our model substantially boosts multilingual SRL performance by introducing deep enhanced representation, achieving new state-of-the-art results on the in-domain CoNLL-2009 benchmark for Catalan, Chinese, Czech, English, German, Japanese and Spanish. These results further show that syntactic information and deep enhanced representation can also promote multiple languages rather than only the case of English.