Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach

Recent research indicates that taking advantage of complex syntactic features leads to favorable results in Semantic Role Labeling. Nonetheless, an analysis of the latest state-of-the-art multilingual systems reveals the difficulty of bridging the wide gap in performance between high-resource (e.g., English) and low-resource (e.g., German) settings. To overcome this issue, we propose a fully language-agnostic model that does away with morphological and syntactic features to achieve robustness across languages. Our approach outperforms the state of the art in all the languages of the CoNLL-2009 benchmark dataset, especially whenever a scarce amount of training data is available. Our objective is not to reject approaches that rely on syntax, rather to set a strong and consistent language-independent baseline for future innovations in Semantic Role Labeling. We release our model code and checkpoints at https://github.com/SapienzaNLP/multi-srl.


Introduction
Semantic Role Labeling (SRL) -the task of automatically addressing "Who did What to Whom, How, When and Where?" (Gildea and Jurafsky, 2000; -is a long standing open problem in Natural Language Processing (NLP), and a central task required to complete the puzzle of Natural Lan-guage Understanding (Navigli, 2018). Its roots date back to several decades ago, to when Fillmore (1968) first theorized the existence of deep semantic relations between a predicate and other sentential constituents. Over the years, different linguistic formalisms and their corresponding predicate-argument structure inventories expanded Fillmore's seminal intuition (Dowty, 1991;Levin, 1993), yet the need to rely on manually designed complex feature templates severely limited early SRL models (Zhao et al., 2009). Fortunately, the recent great success of neural networks in NLP has drawn attention back to SRL and has led to considerable performance gains. This has been particularly the case for recurrent neural networks, thanks to their ability to better capture relations over sequences He et al., 2017). The positive results obtained in SRL were rapidly extended to other fields where they proved to be beneficial to several downstream tasks, from Machine Translation (Marcheggiani et al., 2018) to Information Extraction (Christensen et al., 2011), Opinion Role Labeling (Zhang et al., 2019a), and Question Answering (He et al., 2015).
As researchers constantly explored new approaches to improve SRL, the exploitation of syntactic features soon emerged as a natural choice. Cai and Lapata (2019b) suggested that syntax ought to help semantic role labelers since i) a significant portion of the predicate-argument relations in a semantic dependency graph mirrors the edges that appear in a syntactic dependency graph, and ii) there is often a deterministic mapping from syntactic to semantic roles. Following this line of thought, numerous papers from major venues reported improvements in SRL by explicitly taking advantage of different properties in syntactic dependency trees to various extents Zhang et al., 2019b;.
However, if we step back to observe the larger picture, a significant gap among languages strikes the eye. Indeed, among the state-of-the-art multilingual SRL systems that reported their results on all the languages of the CoNLL-2009 benchmark dataset (Hajic et al., 2009), the currently best-performing system This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.  manifests a very significant discrepancy in performance between high-and low-resource languages (around 10% in F 1 score); the works of Chen et al. (2019) and , inter alia, also show the same behavior, with gaps in F 1 scores that fluctuate around 15% and 14%, respectively. These large differences from language to language suggest that recently proposed innovations do not seem to generalize consistently across different languages, especially when syntax plays a central role or the novelty is tested on out-of-domain data. Anyhow, and perhaps most importantly, these approaches do not address the ever-present performance gap between high-resource and low-resource languages, such as English and German, respectively, where the disparity in F 1 score still hovers around 10% in the in-domain evaluations of CoNLL-2009, and is even wider in the out-of-domain evaluations.
We argue that correctly harnessing contextual intra-sentence information can lead to large gains in performance on single languages and a more robust generalization capability when scaling across languages, especially when these belong to different linguistic families (e.g., English and Chinese). More precisely, the main contributions of this paper are as follows: • We propose the first language-agnostic SRL model that achieves state-of-the-art performance across 6 languages 1 without the use of any morphological, part-of-speech or syntactic information.
• In the wake of the recent interest in cross-linguality, we report promising results in zero-shot crosslingual SRL.
• We conduct an analysis to provide an empirical demonstration of the robustness of our approach in low-resource settings.
• We release our code and model checkpoints to allow easy reproduction of our experiments and facilitate the integration of future innovations on top of our model.
We stress that our objective is not to reject syntax in SRL. On the contrary, we strongly believe that clever integration of syntactic features is a promising avenue to advancing research. Here, however, our aim is to provide a strong language-agnostic baseline that is robust and consistent across languages, especially low-resource ones. We hope our effort can become a stepping stone for future developments of both syntax-agnostic and syntax-focused SRL.

Related Work
Dependency or Span? Currently, SRL is cast as either a span-based or a dependency-based labeling task. Given a predicate in a sentence, the main difference between the two settings is that, in the former, semantic role labels are assigned to the entire span of an argument, whereas, in the latter, semantic role labels are assigned only to the semantic head of the argument. Both span-and dependency-based SRL have continued to be developed and supported in parallel over the years with the organization of the CoNLL-2005 (Carreras andMàrquez, 2005) and CoNLL-2012(Pradhan et al., 2012 tasks for spanbased SRL, and the CoNLL-2008 (Surdeanu et al., 2008) and CoNLL-2009 tasks for dependency-based SRL. The debate over which representation is best is still open and subject to active investigation, with ongoing efforts aimed at merging the two into a unified formalism . In this work, we focus mainly on dependency-based SRL as CoNLL-2009 includes the widest and most varied set of languages, but we also report results on the span-based CoNLL-2012 English benchmark.
End-to-end approaches. Regardless of the formalism of choice, SRL is traditionally divided into a set of simpler subtasks: predicate identification, predicate sense disambiguation, argument identification and argument classification. Early work used different sets of template features and statistical/neural models to tackle each predicate and argument subtask separately (Zhao et al., 2009), but the recent success of the multi-task learning paradigm (Caruana, 1997) prompted the development of end-to-end models that jointly address some of the subtasks (Cai et al., 2018;. Since the CoNLL-2009 shared task provides pre-identified predicates, systems -end-to-end approaches included -usually process the same sentence n p times, where n p is the number of predicates in the sentence; at the cost of longer training times, this approach lets a system contextualize a sentence with respect to a specific pre-identified predicate. In our work we take another approach: our system contextualizes a sentence to all the predicates it contains in a single forward pass. This results in shorter inference time and a significant reduction in training time, because our model converges in less than 30 epochs compared to the 300 epochs required by the current state-of-the-art multilingual system of . Syntax-agnostic SRL.  opened the door to the initial wave of syntaxagnostic models for SRL by efficiently employing a BiLSTM-based encoder to capture longer predicateargument relations within an input sequence, thereby outperforming syntax-aware systems in the CoNLL-2009 English, Czech and Spanish evaluation sets for the first time. Cai et al. (2018) proposed the first full end-to-end syntax-agnostic SRL model to jointly learn to disambiguate predicate senses and recognize their corresponding semantic arguments, and further enhanced the model with an attentive biaffine scorer (Dozat and Manning, 2018) to better condition argument predictions on a given predicate in the input sentence. The combined contribution of these innovations realigned the performances of syntax-agnostic systems to the best syntax-aware systems. Most recently,  showed that the use of contextualized word embeddings such as ELMo (Peters et al., 2018) leads to further progress, thus lending support to the hunch that high-quality contextual information is key to enabling high-performing SRL systems. We stress that there is a subtle catch in the definition of syntax-agnostic: the foregoing approaches do not make use of sentence-level syntax, but do still consider lexical-level syntactic features, such as part-of-speech tags. As a result, their input is still language-dependent. In contrast, our approach does away with any lexical-or sentence-level syntactic feature and is therefore truly syntax-agnostic.
Syntax-aware SRL. Syntactic features have recently gained traction in SRL research, mostly due to the diversity of the available representations, the quality of the information they encode, and the wide range of techniques that can be used to take advantage of them. Notable work includes the use of graph convolutional networks to capture short-distance relations between neighbors in the syntactic dependency graph (Marcheggiani and Titov, 2017a), deriving argument pruning rules from syntactic dependency trees (He et al., 2018b), clustering dependency relations (Kasai et al., 2019), and syntax-based attention mechanisms (Strubell et al., 2018;Zhang et al., 2019b). However, most of the work that reported performance improvements in the CoNLL-2009 benchmark dataset only did so in a subset of its in-and out-of-domain evaluations of the 6 available languages, with the noteworthy exception of . This may fuel the suspicion that either there is a lack of real interest in syntax-aware multilingual SRL, or syntaxfocused innovations do not scale immediately across languages. In any case, reporting fragmented results strongly limits full-fledged comparisons and, therefore, hinders progress in multilingual SRL.

Model Description
Building on top of recent successes in deep learning, our model learns to tackle predicate sense disambiguation and argument identification/classification jointly. In particular: • our model features a word encoder which, in contrast to current work in SRL, exploits the internal states of a language model to compute the contextualized representation of a word (Section 3.1); • while previous work uses a single sequence encoder to capture predicate-argument relations, we adopt a two-stage sequence encoding strategy: a predicate-aware word encoder which re-contextualizes the word representations with respect to all the predicates in a sentence in a single forward pass (Section 3.2); a predicate-argument encoder which specializes the representation of each argument to a single predicate, for each predicate in the input sentence (Section 3.3).

Contextualized Word Representation
The prior knowledge encoded in pretrained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2018) and XLM-RoBERTa (Conneau et al., 2020) is currently showing significant benefits in an ever-increasing array of NLP tasks. Taking inspiration from recent work (Bevilacqua and Navigli, 2020), our model computes word-level representations by combining the different knowledge encoded at different hidden layers of a pretrained language. More formally, let L : V |t| → R n×h be a language model with an input vocabulary V that, given a sequence t = t 1 , t 2 , . . . , t |t| of tokens Also, let L be structured in K inner layers l 1 , l 2 , . . . , l K such that L(t) = l K (· · · (l 1 (t))), and that, for the k-th layer l k , the corresponding output states o k = l k (o k−1 ) are accessible. Then, given an input sentence w = w 1 , w 2 , . . . , w n where each word w i can be tokenized into m i subwords belonging to the language model vocabulary, i.e., w i = t i1 , t i2 , . . . , t im i : t ij ∈ V , we define our input sequence t as: where t START and t END are special tokens that indicate the beginning and the end of a sentence, respectively. We compute the contextualized word representation e i of w i as the average of the Swish activation (Ramachandran et al., 2018) of the concatenated hidden states extracted from the upper K layers of L for each subword t ij ∈ w i . More formally: where o k ij is the output state for the subword t ij from the k-th layer of L, c ij is the concatenation of the output states for the subword t ij from the upper K layers of L, W c ∈ R dw×K ·h and b c ∈ R dw .
While previous approaches often enriched their representations with lexical-level morphological or syntactic information, e.g., lemma and POS embeddings , or with sentence-level syntactic information, e.g., syntactic dependency embeddings, GCNs over syntactic trees, and syntactic pruning rules (Marcheggiani and Titov, 2017b;Cai and Lapata, 2019b), we emphasize that our word encoder relies only on raw text as an input, and is therefore independent of any linguistic feature. However, while this word encoder provides a contextualized representation e i of each word w i with respect to a sentence w, the resulting representations are still unaware of the predicates appearing in w.

Predicate-Aware Word Representation
As mentioned in Section 2, predicate-specific information is usually provided at the input level by means of a predicate indicator flag Marcheggiani and Titov, 2017b;, a predicate embedding (Cai et al., 2018), or by prepending or appending the predicate word to the input sentence (Shi and Lin, 2019). Instead, we adopt an opposite approach in which predicate-specific information is explicitly expressed only at the output level: for each sentence, the model is tasked, in a multi-objective fashion, with i) learning whether a word is a predicate, and ii) disambiguating the sense of each predicate. With this approach, predicate-specific information can back-propagate into a tailor-made sequence encoder to specialize the contextual word representations with respect to all the predicates in the input sentence simultaneously.
In particular, our model features a "fully-connected" stacked-BiLSTM sequence encoder where each BiLSTM layer l k is purposely tweaked to be directly connected not only to the layer l k−1 immediately below l k , but also to each underlying BiLSTM layer l k−i with i ∈ {1, 2, . . . , k}. More formally, the contextualized word embeddings E = e 1 , . . . , e n are re-contextualized into predicate-aware word embeddings H = h 1 , . . . , h n as follows: (Ba et al., 2016) and K is the total number of layers in the sequence encoder, whereas s p i is the unnormalized log probability that word w i is a predicate and s s i is the unnormalized log probability over the predicate sense vocabulary. We highlight that, thanks to the direct input-output connections in our fully-connected BiLSTM encoder, predicate-specific information coming from s p i and s s i is immediately back-propagated into the i-th hidden state h k i of the k-th BiLSTM layer l k .

Predicate-Argument Representation
Even if the above predicate-aware word representations encode valuable information about the position and the sense/meaning of the predicates in the input sentence, the same word may play a different semantic role for each predicate. While previous systems for dependency-based SRL relied on a single sequence encoder to capture predicate-argument relations, our model instead features a second sequence encoder dedicated to capturing predicate-argument relations. To obtain a predicate-argument representation, i.e., a specialization of a word representation with respect to a single predicate, our predicateargument encoder first projects each predicate-aware word representation h i obtained in the previous step for a word w i to two distinct vector spaces, a predicate-specific representation p i and an argumentspecific representation a i : Then, for each predicate w p with 1 ≤ p ≤ n, we obtain a w p -specific representation of the sentence by concatenating the predicate-specific representation p p to each argument-specific representation a i of every word w i in the input sentence: (P ⊕ A) p = p p ⊕ a 1 , p p ⊕ a 2 , . . . , p p ⊕ a n Finally, these representations are further refined by an argument-specific fully-connected BiLSTM encoder in order to capture the role each word w i plays with respect to the predicate w p : where K is the number of layers in the argument-specific fully-connected BiLSTM encoder and s r i|p is the score distribution over the role vocabulary for a word w i with respect to a predicate w p .

Multitask Training Objective
The model is trained to jointly minimize the sum of the categorical cross-entropy losses on predicate identification L(s p ) (see Section 3.2), predicate sense disambiguation L(s s ) (see Section 3.2), and argument identification/classification L(s r ) (see Section 3.3). The cumulative loss is the average of the individual subtask losses weighted on the number of training instances for each subtask: where N p and N r are the number of instances of predicates and arguments/roles in the training set, respectively.

Experiments
We evaluate our model in multilingual dependency-based SRL (CoNLL-2009) and English span-based SRL (CoNLL-2012). Hereafter, we provide a brief overview of the CoNLL-2009 and CoNLL-2012 benchmark datasets (Section 4.1), we describe our experimental setup (Section 4.2), and report results for each dataset and language (Section 4.3).

The CoNLL-2009 and CoNLL-2012 Benchmark Datasets
The CoNLL-2009 dataset is the most comprehensive multilingual SRL benchmark to date, featuring 6 languages from dissimilar linguistic families with 6 in-domain and 3 out-of-domain test sets. Predicates are already identified and, therefore, systems are evaluated on predicate sense disambiguation, argument identification and argument classification. Performing multilingual SRL in CoNLL-2009 is not trivial since the annotation process and methodology vary considerably from language to language: German only contemplates verbal predicates, English also considers nominal predicates, Czech includes adjectival and adverbial predicates as well. Moreover, English uses separate predicate-argument structure inventories for verbal and nominal predicates, i.e., PropBank and NomBank, while Chinese uses a unified inventory. Finally, training sets vary dramatically in size, from more than 400,000 predicate instances in Czech to less than 20,000 in German. CoNLL-2009 is therefore the ideal benchmark for the ability of SRL models to scale multilingually.
While our focus is on multilingual SRL, we also evaluate our approach on CoNLL-2012, which is, to the best of our knowledge, the largest dataset with gold annotations for span-based SRL, and show that our model can also achieve state-of-the-art results on span-based English SRL. 2

Experimental Setup
We implemented the model in PyTorch 3 and PyTorch Lightning 4 , using the Transformers library 5 (Wolf et al., 2019) for language modeling. We selected the hyperparameter values according to the best F 1 score on the English development split; for any other language, the model was trained with the same hyperparameter configuration used for the English language. We trained each model configuration for at most 30 epochs using Adam (Kingma and Ba, 2015) with an initial linear learning rate warmup followed by a linear learning rate cooldown. Table 1b reports the hyperparameters for the best configuration. All the results reported in the following sections refer to the output of the official scoring scripts of CoNLL-2009 6 and CoNLL-2005 7 , for dependency-and span-based SRL, respectively.
Language Models. We compare our model to current state-of-the-art multilingual SRL systems which, however, use contextualized word representations from different pretrained language models. For a fair comparison, we report the performance of our model when the underlying input representation is ELMo (Peters et al., 2018), BERT (Devlin et al., 2018), and multilingual BERT (m-BERT hereafter). We also compare the SRL performance when the weights of the underlying language model are kept frozen or fine-tuned during training. Lastly, we also include the results of XLM-RoBERTa (Conneau et al., 2020), a recent language model trained with an explicit cross-lingual objective.

Results
English Results. The results on the CoNLL-2009 English in-domain test sets are outlined in Table  1a, where, for completeness, we report the scores achieved by our model both when the underlying 2 For span-based SRL, we use the BIO format to convert span-level tags to token-level tags.   language model weights are kept frozen and when they are fine-tuned during training. Independently of the pretrained language model used to represent the input, our syntax-agnostic model outperforms the best scores reported by the most recent state-of-the-art systems, both syntax-agnostic and syntaxaware. In particular, among ELMo-based models, our approach is slightly more effective than the work of Cai and Lapata (2019a) (+0.2% in F 1 ), despite this latter approach being self-supervised on additional training data. Our approach achieves a new state of the art, not only when using frozen BERT weights where the previous best-performing approach was the syntax-aware system of He et al. (2019) (+0.6%) -which takes advantage of purposely-built pruning rules based on syntactic dependency trees -but also when fine-tuning BERT, where it surpasses the syntax-agnostic system of Shi and Lin (2019) (+0.2%). Our model is also able to achieve a new state of the art in span-based English SRL, as shown in Table 2.  We include all the systems that reported results in at least 3 languages. : syntax-aware system. : syntax-agnostic system.
Multilingual Results. While English is certainly the most studied language in SRL, the core of our contribution lies in bridging the gap in performance between high-resource languages and low-resource languages. Indeed, Table 3 shows that high-resource languages, e.g., Chinese, witnessed comparatively larger improvements over the years, whereas lower-resource languages lagged behind. However, as Table 3 also shows, we find that our language-agnostic approach is able to bring larger improvements on low-resource languages such as German, with a 5.4% absolute improvement in F 1 score over the state-of-the-art multilingual system of  when both systems use m-BERT to build a contextualized representation of the input sentences. We also find that our model benefits greatly from using XLM-RoBERTa, a language model pretrained with an explicit cross-lingual objective: our best result on German is remarkably close to the performance obtained on high-resource languages (89.1% vs 92.4% in F 1 score on German and English, respectively). We argue that improving the results on languages where a large margin for progress still exists should not be dismissed as an easier task, and that closing the gap between high-resource and low-resource settings should not be regarded as a foregone conclusion: as a matter of fact, among the systems that report results on all 6 languages in Table 3, the narrowest performance gap between English and German hovers around 10% in F 1 . Our model significantly closes this gap to just 3% in F 1 , while advancing the state of the art across the board.
The robustness of our model is particularly evident when considering the results on the CoNLL-2009 out-of-domain test sets (Table 4a), where, once again, it outperforms the state of the art in all 3 languages. Up until now, supervised techniques have not been able to surpass the performance of the best system based on manual feature engineering in the challenging out-of-domain German test set, since the sentences included in this split were specifically designed to contain a large number of infrequent predicates (Hajic et al., 2009). To the best of our knowledge, our approach results in the first supervised model capable of trumping manual feature engineering in such a setting.
Zero-shot Cross-Lingual Results. Having parity in the quality and quantity of data across languages is often an unrealistic expectation, especially whenever the task requires expert annotators (Pasini, 2020): this is the reason why, over the last few years, cross-lingual transfer learning techniques and benchmarks have garnered attention in NLP (Barba et al., 2020;Blloshmi et al., 2020;Hu et al., 2020). While transfer learning techniques are becoming increasingly popular, their application to SRL is not straightforward. Indeed, as mentioned in Section 4.1, in CoNLL-2009 different languages use different, non-overlapping inventories for predicate senses and their argument structures. The only exceptions in CoNLL-2009 are the Catalan and Spanish languages which both use the AnCora inventory (Taulé et al., 2008). We take advantage of these two languages to evaluate how our system performs in   Table 4b reports the results obtained when our model is trained on Catalan and evaluated on Spanish, and vice versa. While our model shows a significant drop in performance when compared to the original setting (trained and evaluated on the same language), we note that both training sets are relatively small (less than 15,000 sentences) and that the results are still promising (above 80% in both CA-ES and ES-CA), providing an estimate of the performance our model would obtain when trained on a high-resource language and evaluated on a low-resource language. To the best of our knowledge, we are the first to report the results in zero-shot cross-lingual SRL: we hope that our effort can be a basis for further development in this direction.

Analysis and Discussion
Ablation Study. We conducted an ablation study of our model architecture to better understand the individual contribution of its main components. As shown in Table 5, a baseline model, where the predicate-aware sequence encoder (see Section 3.2) is completely removed, shows a large drop in performance on the English development split of CoNLL-2009 (86.3% against 90.4% in F 1 score). Indeed, this encoder is central to our architecture and represents around 70% of the total parameter count in the model (pretrained language model excluded). Rather than removing the entire predicate-aware sequence encoder, we also evaluated the difference in performance when using simple stacked-BiLSTM layers instead of the fully-connected stacked-BiLSTM layers explained in Section 3.2 (+2.0 and +3.2% in F 1 score over the baseline, respectively). Finally, we found it beneficial to rely on features and mechanisms that do not exploit language-specific peculiarities, demonstrating that it is possible, not only to achieve and surpass the state of the art without syntax, but also, and most importantly, to bridge the gap between languages: for example, we perform layer normalization on the output of each BiLSTM layer (+0.3%), adopt Swish in place of the more traditional ReLU (+0.1%), and add predicate identification as a secondary task (+0.5%).
Do We Need More Training Data? Our model significantly narrows the gap between high-and lowresource languages, nevertheless one may wonder whether the amount of annotated data is still too limited for some of the languages in CoNLL-2009. In particular, German should be the most affected by this bottleneck, as it comes with the smallest number of annotated predicates (see Appendix A). However, surprisingly perhaps, Figure 1 shows that greatly reducing the already limited German training set does not degrade performance as much as one would expect: our model is still able to achieve stateof-the-art results in German on the in-domain evaluation with just 50% of the training data, and on the out-of-domain evaluation with just 25% of the training data.

Conclusion and Future Work
Recently, research in Semantic Role Labeling has revolved predominantly around syntax, with several studies showing the benefits of integrating syntactic features into existing models. Syntax-based innovations, however, have struggled to transfer improvements from high-to low-resource languages, as they appear to require a substantial amount of high-quality annotations.
In this paper, we have gone against the flow and proposed a model that puts both sentence-and lexicallevel syntax aside, in order to avoid relying on noisy language-specific features. Our truly syntax-agnostic approach surpasses the previous state of the art among all syntax-agnostic and syntax-aware systems in the in-and out-of-domain evaluations of all 6 languages in the CoNLL-2009 benchmark dataset. Most crucially, our model seamlessly scales across languages, finally bridging the long-standing gap between high-and low-resource settings. Our analysis delineates the strengths of our approach and highlights its exceptional robustness: our model only needs 50% and 25%, respectively, of the training data to surpass the previous best-performing system in the in-domain and out-of-domain German test sets. To the best of our knowledge, we are the first to evaluate a model on zero-shot cross-lingual SRL, where we obtain results that are promising with a view to further developments in transfer learning techniques. Finally, we demonstrate that, not only is our model able to achieve state-of-the-art results in dependency-based SRL, but it also surpasses current syntax-agnostic and syntax-aware techniques in span-based SRL on CoNLL-2012.
As previously stated, our objective has not been to dismiss the undeniable importance of syntaxbased innovation in SRL, but rather to establish a launch pad from which future syntactic developments can take off. In order to encourage future work on joint syntactic and semantic dependency parsing (Cai and Lapata, 2019b), the use of more powerful or cleverly trained language models (Lewis et al., 2020), the integration of SRL into other cross-lingual semantics-first tasks such as Semantic Parsing (Blloshmi et al., 2020) and Word Sense Disambiguation (Scarlini et al., 2020), and the exploitation and integration of newly available knowledge from recently released resources, such as VerbAtlas (Di Fabio et al., 2019) and Conception , we make available not only the code for our SRL model and experiments, but also our model checkpoints and training/validation logs at https://github.com/ SapienzaNLP/multi-srl.

Sentences
Predicates Arguments  For each dataset we report the number of sentences (Total s ), the number of sentences with at least an annotated predicate (Annotated), the average number of tokens per sentence (Avg. Len.), the number of predicates (Total p ) and predicate senses (Senses), and also the number of arguments (Total a ) and argument roles (Roles). . Table 6 shows the composition of the training sets of CoNLL-2009 and CoNLL-2012. While German is not usually considered a low-resource language, in the CoNLL-2009 dataset it is the language with the lowest amount of annotated predicate instances. As shown in the table, German has less than 50% of the predicate instances of Catalan and less than 10% of the predicate instances of English.

B Computing Infrastructure
All the experiments were performed on a x86-64 architecture with 64GB of RAM, an 8-core CPU running at 3.60GHz, and an Nvidia RTX 2080Ti.