High-order Semantic Role Labeling

Semantic role labeling is primarily used to identify predicates, arguments, and their semantic relationships. Due to the limitations of modeling methods and the conditions of pre-identified predicates, previous work has focused on the relationships between predicates and arguments and the correlations between arguments at most, while the correlations between predicates have been neglected for a long time. High-order features and structure learning were very common in modeling such correlations before the neural network era. In this paper, we introduce a high-order graph structure for the neural semantic role labeling model, which enables the model to explicitly consider not only the isolated predicate-argument pairs but also the interaction between the predicate-argument pairs. Experimental results on 7 languages of the CoNLL-2009 benchmark show that the high-order structural learning techniques are beneficial to the strong performing SRL models and further boost our baseline to achieve new state-of-the-art results.


Introduction
Linguistic parsing seeks the syntactic/semantic relationships between language units, such as words or spans (chunks, phrases, etc.). The algorithms usually use factored representations of graphs to accomplish the target: a set of nodes and relational arcs. The types of features that the model can exploit in the inference depend on the information included in the factorized parts.
Before the introduction of deep neural networks, in syntactic parsing (a kind of linguistic parsing), several works (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Zhang and McDonald, 2012;Ma and Zhao, 2012) showed that high-order parsers utilizing richer factorization information achieve higher accuracy than low-order ones due to the extensive decision history that can lead to significant improvements in inference (Chen et al., 2010).
Semantic role labeling (SRL) (Gildea and Jurafsky, 2002;Zhao and Kit, 2008;Zhao et al., 2009bZhao et al., , 2013 captures the predicate-argument structure of a given sentence, and it is defined as a shallow semantic parsing task, which is also a typical linguistic parsing task. Recent high-performing SRL models (He et al., 2017;Marcheggiani et al., 2017;He et al., 2018a;Strubell et al., 2018;He et al., 2018b;Cai et al., 2018), whether labeling arguments for a single predicate using sequence tagging model at a time or classifying the candidate predicateargument pairs, are (mainly) belong to first-order parsers. High-order information is an overlooked potential performance enhancer; however, it does suffer from an enormous spatial complexity and an expensive time cost in the inference stage. As a result, most of the previous algorithms for highorder syntactic dependency tree parsing are not directly applicable to neural parsing. In addition, the target of model optimization, the high-order relationship, is very sparse. It is not as convenient for training the model with negative likelihood as the first-order structure is because the efficient gradient backpropagation of parsing errors from the high-order parsing target is indispensable in neural parsing models.
To alleviate the computational and graphic mem-  Figure 1: Left is the second-order parts (structures) considered in this paper, where the P stands for a predicate, and A stands for argument. Right is an example of semantic role labeling from the CoNLL-09 training dataset. ory occupation challenges of explicit high-order modeling in the training and inference phase, we propose a novel high-order scorer and an approximation high-order decoding layer for the SRL parsing model. For the high-order scorer, we adopt a triaffine attention mechanism, which is extended from the biaffine attention (Dozat and Manning, 2017), for scoring the second-order parts. In order to ensure the high-order errors backpropagate in the training stage and to output the part score of the first-order and highest-order fusion in the highest-scoring parse search stage during decoding, inspired by (Lee et al., 2018;Wang et al., 2019), we apply recurrent layers to approximate the highorder decoding iteratively and hence make it differentiable.
We conduct experiments on popular English and multilingual benchmarks. From the evaluation results on both test and out-of-domain sets, we observe a statistically significant increase in semantic-F 1 score with the second-order enhancement and report new state-of-the-art performance in all test set of 7 languages except for the out-of-domain test set in English. Additionally, we also evaluated the results of the setting without pre-identified predicates and compared the effects of every different high-order structure combination on all languages to explore how the high-order structure contributes and how its effect differs from language to language. Our analysis of the experimental results shows that the explicit higher-order structure learning yields steady improvements over our replicated strong BERT baseline for all scenarios.
2 High-order Structures in SRL High-order features or structure learning is known to improve linguistic parser accuracy. In dependency parsing, high-order dependency features encode more complex sub-parts of a dependency tree structure than the features based on first-order, bigram head-modifier relationships. The clear trend in dependency parsing has shown that the addition of such high-order features improves parse accuracy (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Zhang and McDonald, 2012;Ma and Zhao, 2012). We find that this addition can also benefit semantic parsing, as a tree is a specific form of a graph, and the high-order properties that exist in a tree apply to the graphs in semantic parsing tasks as well.
For a long time, SRL has been formulated as a sequential tagging problem or a candidate pair (word pair) classification problem. In the pattern of sequential tagging, only the arguments of one single predicate are labeled at a time, and a CRF layer is generally considered to model the relationship between the arguments implicitly (Zhou and Xu, 2015). In the candidate pair classification pattern, He et al. (2018a) propose an end-to-end approach for jointly predicting all predicates, arguments, and their relationships. This pattern focuses on the first-order relationship between predicates and arguments and adopts dynamic programming decoding to enforce the arguments' constraints. From the perspective of existing SRL models, high-order information has long been ignored. Although current first-order neural parsers could encode the highorder relationships implicitly under the stacking of the self-attention layers, the advantages of explicit modeling over implicit modeling lie in the lower training cost and better stability. This performance improvement finding resultant of high-order features or structure learning suggests that the same benefits might be observed in SRL. Thus, this paper intends to explore the integration and effect of high-order structures learning in the neural SRL model.
The trade-offs between rich high-order structures (features), decoding time complexity, and memory requirements need to be well considered, especially in the current neural models. The work of Li et al. (2020) suggests that with the help of deep neural network design and training, exact decoding can be replaced with an approximate decoding algorithm, which can significantly reduce the decoding time complexity at a very small performance loss; however, using high-order structure unavoidably brings problematically high graphic memory demand due to the gradient-based learning methods in the neural network model. Given an input sentence with length L, order J of parsing model, the memory required is O(L J+1 ). In the current GPU memory conditions, second-order J = 2 is the upper limit that can be explored in practice if without pruning. Therefore, we enumerate all three second-order structures as objects of study in SRL, as shown in the left part of Figure 1, namely sibling (sib), co-parents (cop), and grandparent (gp).
As shown in the SRL example presented in right part of Figure 1, our second-order SRL model looks at several pairs of arcs: • sibling (Smith and Eisner, 2008;Martins et al., 2009): arguments of the same predicate; • co-parents (Martins and Almeida, 2014): predicates sharing the same argument; • grandparent (Carreras, 2007): predicate that is the argument of another predicate.
Though some high-order structures have been studied by some related works (Yoshikawa et al., 2011;Ouchi et al., 2015;Shibata et al., 2016;Ouchi et al., 2017;Matsubayashi and Inui, 2018) in Japanese Predicate Argument Structure (PAS) (Iida et al., 2007) analysis and English SRL (Yang and Zong, 2014), the integration of multiple high-order structures into a single framework and exploring the high-order effects on multiple languages, different high-order structure combinations in a comprehensive way on popular CoNLL-2009 benchmark is the first considered in this paper and thus takes the shape of the main novelties of our work.

Overview
SRL can be decomposed into four subtasks: predicate identification, predicate disambiguation, argument identification, and argument classification. Since the CoNLL-2009 shared task identified all predicates beforehand, we mainly focus on identifying arguments and labeling them with semantic roles. We formulate the SRL task as a set of arc (and label) assignments between part of the words in the given sentence instead of focusing too much on the roles played by the predicate and argument individually. The predicate-argument structure is regarded as a general dependency relation, with predicate as the head and argument as the dependent (dep) role. Formally, we describe the task with a sequence X = w 1 , w 2 , ..., w n , a set of unlabeled arcs Y arc = W × W, where × is the cartesian product, and a set of labeled predicate-argument relations Y label = W × W × R which, along with the set of arcs, is the target to be predicted by the model. W = {w 1 , w 2 , ..., w n } refers to the set of all words, and R is the candidate semantic role labels.
Our proposed model architecture for secondorder SRL is shown in Figure 2, which is inspired and extended from (Lee et al., 2018;Li et al., 2019a;Wang et al., 2019) 1 . The baseline is a first-order SRL model (Li et al., 2019a), which only considers predicate-argument pairs. Our proposed model composes of three modules: contextualized encoder, scorers, and variational inference layers. Given an input sentence, it first computes contextualized word representations using a BiL-STM encoder on the concatenated embedding. The contextualized word representations are then fed into three scorers to give the arc score, arc label score, and high-order part score following the practice of Dozat and Manning (2017). Rather than looking for a model in which exact decoding is tractable, which could be even more stringent for parsing semantic graphs than for dependency trees, we embrace approximate decoding strategies and introduce the variational inference layers to make the high-order error fully differentiable.

Encoder
Our model builds the contextualized representations by using a stacked bidirectional Long Shortterm Memory neural network (BiLSTM) (Hochreiter and Schmidhuber, 1997) to encode the input sentence. Following (He et al., 2018b;Cai et al., 2018;Li et al., 2019a), the input vector is the concatenation of of multiple source embeddings, including a pre-trained word embedding, a random initialized lemma embedding, a predicate indicator embedding, and pre-trained language model layer features; however, unlike their work, we do not use Part-Of-Speech (POS) tag embeddings 2 , which enables our model to be truly syntactic-agnostic. 1 Code available at https://github.com/ bcmi220/hosrl. 2 POS tags are also considered to be a kind of syntactic information. Additionally, we use pre-trained language model (PLM) layer features because the latest work (He et al., 2018b;Li et al., 2018bLi et al., , 2019aHe et al., 2019) has demonstrated it can boost performance of SRL models. Since these language models were trained at the character-or subword-level, and the out-of-vocabulary (OOV) problem was solved well, we did not use the the bi-directional LSTM-CNN architecture, where convolutional neural networks (CNNs) encode characters inside a word into a character-level representation. Finally, the contextualized representation is obtained as: where e i = e word i ⊕e lemma i ⊕e indicator i ⊕e plm i is the concatenation (⊕) of the multiple source embeddings of word w i , E represents [e 1 , e 2 , ..., e n ], and H = [h 1 , h 2 , ..., h n ] represents the hidden states (i.e., the contextualized representation) of the BiL-STM encoder.

Scorers
Before scoring the arcs and their corresponding role labels, we adopt two multi-layer perceptron (MLP) layers in different scorers to obtain lowerdimensional and role-specific representations of the encoder outputs to strip away irrelevant information from feature extraction.
First-order Arc and Label Scorers: In order to score the first-order parts (arcs and labels), we adopt the biaffine classifier proposed by (Dozat and Manning, 2017) to compute the possibility of arc existence and label for dependency i → j via biaffine attention.
To reduce the computation and memory cost, we only use an arc triaffine function to compute scores of second-order parts; the label triaffine scorer is not considered. A triaffine function is defined as follows: where the weight matrix U 2nd is (d × (d + 1) × (d + 1))-dimensional.

Variational Inference Layers
In the first-order model, we adopt the negative likelihood of the golden structure as the loss to train the model, but in the second-order module of our proposed model, a similar approach will encounter the sparsity problem, as the maximum likelihood estimates cannot be obtained when the number of trainable variables is much larger than the number of observations. In other words, it is not feasible to directly approximate the real distribution with the output distribution of the second-order scorer because of the sparsity of the real distribution.
Computing the arc probabilities based on the first-order and multiple second-order scores outputs can be seen as doing posterior inference on a Conditional Random Field (CRF). As exact inference on this CRF is intractable(Wang et al., 2019), we resort to using the variational inference algorithms that allow the model to condition on high-order structures while being fully differentiable.
The variational inference computes the posterior distribution of unobserved variables in the probability graph model. Then, parameter learning is carried out with the observed variables and the predicted unobservable variables. Mean field variational inference approximates a true posterior distribution with a factorized variational distribution and tries to iteratively minimize its KL divergence. Thus, we use mean field variational inference approximates to obtain the final arc distribution. This inference involves T iterations of updating arc probabilities, denoted as Q (t) i,j for the probabilities of arc i → j at iteration t. The iterative update process is described as follows: is the second-order voting scores, Q i,k = softmax(S arc i,j ), and t is the updating step. Zheng et al. (2015) Table 1: Semantic-F 1 score on CoNLL-2009 English treebanks. WSJ is used for evaluating the in-domain performance and Brown for the out-of-domain. " * " denotes that the model uses syntactic information for enhancement, and " † " represents the model is trained with other tasks jointly. "+E" stands for using ELMo as pretrained PLM features, "+B" for using BERT.
each iteration takes Q value estimates from the previous iteration and the unary values (first-order scores) in their original form. In this RNN structure, CRF-RNN, the model parameters therefore can be optimized from the second-order error using the standard backpropagation through time algorithm (Rumelhart et al., 1985;Mozer, 1995). Notably, the number of stacked layers is equal to the iteration steps T . Since when T > 5, increasing the number of iterations usually does not significantly improve results (Krähenbühl and Koltun, 2011), training does not suffer from the vanishing and exploding gradient problem inherent to deep RNNs, and this allows us to use a plain RNN architecture instead of more sophisticated architectures such as LSTMs.

Training Objective
The full model is trained to learn the conditional distribution P θ (Ŷ |X) of predicted graphŶ with gold parse graph Y * . Since the parse graph can  be factorized to arcs and corresponding labels, the conditional distribution P θ (Ŷ |X) is also factorized to P θ (Ŷ (arc) |X) and P θ (Ŷ (label) |X), given by: where θ represents the model parameters. The losses to optimize the model are implemented as cross-entropy loss using negative likelihood to the golden parse: where r ∈ R is the semantic role label of arc (predicate-argument) i → j. The final loss is the weighted average of the arc loss L (arc) (θ) and the label loss L (label) (θ): where λ is the balance hyper-parameter.

Setup
We conduct experiments and evaluate our model on the CoNLL-2009(Hajič et al., 2009) benchmark datasets including 7 languages: Catalan (CA), Czech (CS), German (DE), English (EN), Spanish (ES), Japanese (JA), and Chinese (ZH). To better compare with previous works, and to bring the model closer to a real-world usage scenario, we consider two SRL setups on all 7 languages: w/ pre-identified predicate and w/o pre-identified predicate. In order to compare with most previous models, the former setup follows official requirements and has predicates identified beforehand in the corpora. The latter one is consistent with a real scenario; where the model is required to predict all the predicates and their arguments and is therefore relatively more difficult. Since the predicates need to be predicted in the w/o preidentified predicate setup, we treat the identification and disambiguation of predicates as one sequence tagging task, and we adopt BiLSTM+MLP and BERT+MLP sequence tagging architectures to adapt to different requirements. We directly adopt most hyper-parameter settings and training   Table 2 are obtained by averaging the every results from 5 training rounds with different random seeds to avoid random initialization impact on the model. We compare our baseline and full model with previous multilingual works. The performance of our baseline is similar to the model of He et al. (2019), which integrated syntactic information and achieved the best results. This shows that our baseline is a very strong SRL model, and owes its success to directly modeling on the full semantic graph rather than separately based on predicates. Moreover, our model with the proposed high-order structure learning (+ HO) obtains absolute improvements of 0.49% and 0.42% F1 without pre-training and with BERT, respectively, achieving the new best results on all benchmarks. Because the quantities of high-order structures are different among different languages, consistent improvement on 7 languages already shows that our empirical results are convincing.
In addition, we also report the results of the w/o pre-identified predicate setup for all languages, which is a more realistic scenario. The overall decline without pre-identified predicates shows that predicate recognition has a great impact. Especially for German, the obvious drop is probably because the ratio of predicates in the German evaluation set is relatively small and is sensitive to the model parameters; however, in this setup, our high-order structure learning leads to consistent improvements in all languages with the w/ pre-identified predicate setup, demonstrating the effectiveness of the proposed method.
To show the statistical significance of our results, in addition to adopting the above-mentioned common practice in SRL at model-level that reports the average results with multiple runs and random seeds, we further follow the practice in machine translation (Koehn, 2004) to conduct a significant test at example level. We sampled the prediction results for 500 times, 50 sentences each time, and evaluated the sampled subset. The result of +HO is significantly higher than that of the baseline model (p < 0.01), verifying the significance of the results.
Out-of-domain Results Besides English, there are also out-of-domain test sets for German and Czech. To verify the generalization capability of our model, we further conduct experiments on these test sets under w/ pre-identified predicates and compare results with existing work (in Table  3). Our model achieves new state-of-the-art results of 70.49% (German) and 90.75% (Czech) F1-score, significantly outperforming the previous best system (Lyu et al., 2019). Furthermore, there is even a gain with using pre-trained BERT, showing that BERT can improve the generalization ability of the model. In addition, we observe that the model (+HO) yields stable performance improvement in recall, which shows the proposed high-order structure learning is beneficial to identifying arguments.
Time Complexity and Parsing Speed The time complexity and parsing speed of high-order models have always been concerns. In our proposed highorder model, the time complexity comes from two parts: one is the matrix operations in the biaffine and triaffine attention (O(d 2 BiAF ) and O(d 3 TriAF ), respectively), where d BiAF and d TriAF is the hidden size of the scorer, and the other is the inference procedure (O(n 3 )), making the total time complexity O(d 2 BiAF + d 3 TriAF + n 3 ), while, for for our baseline, the full time complexity is O(d 2 BiAF + n 2 ). Additionally, in the case of leveraging pre-trained PLM features, the time complexity of encoders such as BERT is a part that cannot be ignored. We measured the parsing speed of our baseline and high-order models on the English test set, both with BERT pre-training and without, on the CPU and GPU, respectively, with an Intel Xeon 6150 CPU and a Titan V100 GPU. The comparison is shown in Figure 3. Results show that the speed loss of +HO is 26.7%, 5.5%, 15.4% and 7.6% in the respective four scenarios, while the speed loss brought by BERT is 84.1%, 79.5% on CPU and 60.0% and 55.2% on GPU. Therefore, +HO brings a loss of speed, but with GPU acceleration, the loss ratio is reduced. In the case of BERT pre-training, +HO is no longer the bottleneck of parsing speed.

High-order Structures Contribution
To explore the contribution of high-order structures in depth , we consider all possible combinations of structures and conduct experiments on the English test set under the w/ pre-identified predicate setup. Table 4 shows the results of two baseline models (with and without BERT pre-training). Using these structures separately improves our model, as shown in +sib with a 0.36 F 1 gain; however, the further improvement of applying two structures is limited. For example, model (+sib) performs even better than (+sib+gp). The reason might be that the sib (between the arguments) and the gp (between the predicates) are two irrelevant structures. Regardless, we can observe that +ALL (the combination of all three structures) model achieves the best performance ( up to 0.65 F 1 ). One possible reason for the result is that the cop (between arguments and predicates) sets up a bridge for sib and gp structures. In other words, these observations suggest that the three structure learning may be complementary. We further explored the sources of the higher order structure's improvement in SRL performance. we split the test set into two parts, one with the high-order relationship (cop and gp), and the other without. Taking the CoNLL09 English test set as an example, the total size of the test set is 2399 sentences, and there are 1936 sentences with highorder relationships. We recalculated Sem-F1 for these two subsets, and found that the scores on the subsets with higher-order relationships are significantly higher than those without(¿0.4% F-score). It shows that our model does improve the prediction of high-order structure, rather than a specific type of semantic role. For simple sentences (without HO), the baseline can already parse it very well, which also explains the reason why the improvement in some languages is not great.

Related Work
The CoNLL-2009 shared task advocated performing SRL for multiple languages to promote multilingual NLP applications. (Zhao et al., 2009a) proposed an integrated approach by exploiting large-scale feature sets, while (Björkelund et al., 2009) used a generic feature selection procedure, which yielded significant gains in the multilingual SRL shared task. With the development of deep neural networks (Li et al., 2018a;Xiao et al., 2019;Zhang et al., 2019c,a;Li et al., 2019c;Luo et al., 2020;Li et al., 2019b;Zhang et al., 2019b)  High-order parsing is one of the research hotspots in which first-order parsers meet performance bottlenecks; this has been extensively studied in the literature of syntactic dependency parsing (McDonald et al., 2005;McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010;Martins et al., 2011;Ma and Zhao, 2012;Gormley et al., 2015;Zhang et al., 2020). In semantic parsing, Martins and Almeida (2014) proposed a way to encode high-order parts with hand-crafted features and introduced a novel co-parent part for semantic dependency parsing. Cao et al. (2017) proposed a quasi-second-order semantic dependency parser with dynamic programming. Wang et al.
(2019) trained a second-order parser in an end-toend manner with the help of mean field variational inference and loopy belief propagation approximation. In SRL or related research field, there is also some related work on the improvement of performance by high-order structural information. On the Japanese NAIST Predict-Argument Structure (PAS) dataset, some works (Yoshikawa et al., 2011;Ouchi et al., 2015;Iida et al., 2015;Shibata et al., 2016;Ouchi et al., 2017;Matsubayashi and Inui, 2018) mainly studied the relationship between multiple predicates separately, that is, the gp and cp high-order relationship mentioned in our paper. (Yang and Zong, 2014) considered the interactions between predicate-argument pairs on Chinese Propbank dataset. Although the motivation is consistent with our work, we first consider multiple high-order relationships at the same time within a more uniform framework on more popular benchmarks and for more languages.

Conclusion and Future Work
In this work, we propose high-order structure learning for dependency semantic role labeling. The proposed framework explicitly models high-order graph structures on a strong first-order baseline model while scoring the correlation of predicted predicate-argument pairs. The resulting model achieves state-of-the-art results on all 7 languages in the CoNLL-2009 test data sets except the outof-domain benchmark in English. In addition, we consider both given and not-given predicates on all languages, explore the impact of every high-order structure combinations on performance for all languages, and reveal the adaptive range of high-order structure learning on different languages. In future work, we will continue to explore higher-order structures and pruning strategies to reduce the time complexity and memory occupation.

A.1 Hyper-parameters and Training Details
In our experiments, the BiLSTM+MLP predicate tagging model only takes words and lemmas as input, and its encoder structure is the same as our main model, so the hyper-parameters are also consistent with our main model. With the BERT+MLP predicate tagging model, the motivation for choosing this instead of using BERT as embedding in the BiLSTM+MLP architecture is to achieve fair comparability with the results of . For the hyper-parameters of our main model, we borrowed most parameter settings from (Dozat and Manning, 2017;Wang et al., 2019), including dropout and initialization strategies. Hyperparameters for our baseline and proposed highorder model are shown in Table 5. We use 100dimensional Glove (Pennington et al., 2014) pretrained word embeddings for English and 300dimensional FastText embeddings (Bojanowski et al., 2017;Grave et al., 2018) for all other languages. As for the pre-training, ELMo(Peters et al., 2018) is only used in English, we take the weighted sum of the 3 layers as the final features, while different versions of BERT(Devlin et al., 2019) are used in different languages, as shown in Table 6, we always use the second-to-last layer outputs as the pre-trained features.
Following the work of (Wang et al., 2019), during model training, the training strategy includes two phases. In the first phase, we used Adam (Kingma and Ba, 2014) and annealed the learning rate 0.5 every 10,000 steps. When the training reaches 5,000 steps without improvement, the model optimization enters the second phase; the Adam optimizer is replaced by AMSGrad (Reddi et al., 2018). We trained the model for maximum 100K update steps with batch sizes of {4K, 2K, 3K, 4K, 6K, 6K, 6K} tokens for CA, CS, DE, EN, ES, JA, and ZH, respectively. The training is terminated with an early stopping mechanism when there is no improvement after 10,000 steps on the development sets.

A.2 Detail Results
The proverb that there is no such thing as a free lunch tells us that no method works in every condition and scope. We explore our proposed high order structure learning for SRL in different languages and conditions: using pre-training or not, given or     Tables 7, 8, 9 and 10. The experimental results illustrate the following points: 1. In different languages, combinations of high-order structures bring different improvements. Some high-order structure combinations are even worse for performance in some languages.
2. Pre-training can bring about a significant improvement in performance on both in-domain and out-of-domain test sets; however, the in-domain improvement is significantly greater than that of out-of-domain when the two domains are far apart. In particular, the difference between in-domain and out-of-domain in German and English is large, while the two domains in Czech are similar.
3. The SRL results in German are lower than in other languages, the data analysis found that the proportion of predicates is very small, resulting in the sparse targets, which can not train the model well, especially when no predicates are preidentified.