Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources

While cross-lingual techniques are finding increasing success in a wide range of Natural Language Processing tasks, their application to Semantic Role Labeling (SRL) has been strongly limited by the fact that each language adopts its own linguistic formalism, from PropBank for English to AnCora for Spanish and PDT-Vallex for Czech, inter alia. In this work, we address this issue and present a unified model to perform cross-lingual SRL over heterogeneous linguistic resources. Our model implicitly learns a high-quality mapping for different formalisms across diverse languages without resorting to word alignment and/or translation techniques. We find that, not only is our cross-lingual system competitive with the current state of the art but that it is also robust to low-data scenarios. Most interestingly, our unified model is able to annotate a sentence in a single forward pass with all the inventories it was trained with, providing a tool for the analysis and comparison of linguistic theories across different languages. We release our code and model at https://github.com/SapienzaNLP/unify-srl.


Introduction
Semantic Role Labeling (SRL) -a long-standing open problem in Natural Language Processing (NLP) and a key building block of language understanding (Navigli, 2018) -is often defined as the task of automatically addressing the question "Who did what to whom, when, where, and how?" (Gildea and Jurafsky, 2000;Màrquez et al., 2008). While the need to manually engineer and fine-tune complex feature templates severely limited early work (Zhao et al., 2009), the great success of neural networks in NLP has resulted in impressive progress in SRL, thanks especially to the ability of recurrent networks to better capture relations over sequences (He et al., 2017;. Owing to the recent wide availability of robust multilingual representations, such as multilingual word embeddings (Grave et al., 2018) and multilingual language models (Devlin et al., 2019;Conneau et al., 2020), researchers have been able to shift their focus to the development of models that work on multiple languages (Cai and Lapata, 2019b;.
A robust multilingual representation is nevertheless just one piece of the puzzle: a key challenge in multilingual SRL is that the task is tightly bound to linguistic formalisms (Màrquez et al., 2008) which may present significant structural differences from language to language (Hajic et al., 2009). In the recent literature, it is standard practice to sidestep this issue by training and evaluating a model on each language separately (Cai and Lapata, 2019b;Chen et al., 2019;Kasai et al., 2019;. Although this strategy allows a model to adapt itself to the characteristics of a given formalism, it is burdened by the non-negligible need for training and maintaining one model instance for each language, resulting in a set of monolingual systems. Instead of dealing with heterogeneous linguistic theories, another line of research consists in actively studying the effect of using a single formalism across multiple languages through annotation projection or other transfer techniques (Akbik et al., 2015(Akbik et al., , 2016Daza and Frank, 2019;Cai and Lapata, 2020;Daza and Frank, 2020). However, such approaches often rely on word aligners and/or automatic translation tools which may introduce a considerable amount of noise, especially in lowresource languages. More importantly, they rely on the strong assumption that the linguistic formalism of choice, which may have been developed with a specific language in mind, is also suitable for other languages.
In this work, we take the best of both worlds and propose a novel approach to cross-lingual SRL. Our contributions can be summarized as follows: • We introduce a unified model to perform cross-lingual SRL with heterogeneous linguistic resources; • We find that our model is competitive against state-of-the-art systems on all the 6 languages of the CoNLL-2009 benchmark; • We show that our model is robust to lowresource scenarios, thanks to its ability to generalize across languages; • We probe our model and demonstrate that it implicitly learns to align heterogeneous linguistic resources; • We automatically build and release a crosslingual mapping that aligns linguistic formalisms from diverse languages.
We hope that our unified model will further advance cross-lingual SRL and represent a tool for the analysis and comparison of linguistic theories across multiple languages.

Related Work
End-to-end SRL. The SRL pipeline is usually divided into four steps: predicate identification, predicate sense disambiguation, argument identification, and argument classification. While early research focused its efforts on addressing each step individually (Xue and Palmer, 2004;Björkelund et al., 2009;Zhao et al., 2009), recent work has successfully demonstrated that tackling some of these subtasks jointly with multitask learning (Caruana, 1997) is beneficial. In particular,  and, subsequently, Cai et al. (2018),  and , indicate that predicate sense signals aid the identification of predicateargument relations. Therefore, we follow this line and propose an end-to-end system for cross-lingual SRL.
Multilingual SRL. Current work in multilingual SRL revolves mainly around the development of novel neural architectures, which fall into two broad categories, syntax-aware and syntax-agnostic ones. On one hand, the quality and diversity of the information encoded by syntax is an enticing prospect that has resulted in a wide range of contributions:  made use of Graph Convolutional Networks (GCNs) to better capture relations between neighboring nodes in syntactic dependency trees; Strubell et al. (2018) demonstrated the effectiveness of linguisticallyinformed self-attention layers in SRL; Cai and Lapata (2019b) (2019a) proposed a semi-supervised approach that scales across different languages. While we follow the latter trend and develop a syntax-agnostic model, we underline that both the aforementioned syntax-aware and syntax-agnostic approaches suffer from a significant drawback: they require training one model instance for each language of interest. Their two main limitations are, therefore, that i) the number of trainable parameters increases linearly with the number of languages, and ii) the information available in one language cannot be exploited to make SRL more robust in other languages. In contrast, one of the main objectives of our work is to develop a unified cross-lingual model which can mitigate the paucity of training data in some languages by exploiting the information available in other, resource-richer languages.
Cross-lingual SRL. A key challenge in performing cross-lingual SRL with a single unified model is the dissimilarity of predicate sense and semantic role inventories between languages. For example, the multilingual dataset distributed as part of the CoNLL-2009 shared task (Hajic et al., 2009) adopts the English Proposition Bank (Palmer et al., 2005) and NomBank (Meyers et al., 2004) to annotate English sentences, the Chinese Proposition Bank (Xue and Palmer, 2009) for Chinese, the AnCora (Taulé et al., 2008) predicate-argument structure inventory for Catalan and Spanish, the German Proposition Bank which, differently from the other PropBanks, is derived from FrameNet (Hajic et al., 2009), and PDT-Vallex (Hajic et al., 2003) for Czech. Many of these inventories are not aligned with each other as they follow and implement different linguistic theories which, in turn, may pose different challenges. Padó and Lapata (2009), and Akbik et al. (2015, 2016 worked around these issues by making the English PropBank act as a universal predicate sense and semantic role inventory and projecting PropBank-style annotations from English onto non-English sentences by means of word alignment techniques applied to parallel corpora such as Europarl (Koehn, 2005). These efforts resulted in the creation of the Universal PropBank, a multilingual collection of semi-automatically annotated corpora for SRL, which is actively in use today to train and evaluate novel cross-lingual methods such as word alignment techniques (Aminian et al., 2019). In the absence of parallel corpora, annotation projection techniques can still be applied by automatically translating an annotated corpus and then projecting the original labels onto the newly created silver corpus (Daza and Frank, 2020;Fei et al., 2020), whereas Daza and Frank (2019) have recently found success in training an encoder-decoder architecture to jointly tackle SRL and translation.
While the foregoing studies have greatly advanced the state of cross-lingual SRL, they suffer from an intrinsic downside: using translation and word alignment techniques may result in a considerable amount of noise, which automatically puts an upper bound to the quality of the projected labels. Moreover, they are based on the strong assumption that the English PropBank provides a suitable formalism for non-English languages, and this may not always be the case. Among the numerous studies that adopt the English PropBank as a universal predicate-argument structure inventory for crosslingual SRL, the work of Mulcaire et al. (2018) stands out for proposing a bilingual model that is able to perform SRL according to two different inventories at the same time, although with significantly lower results compared to the state of the art at the time. With our work, we go beyond current approaches to cross-lingual SRL and embrace the diversity of the various representations made available in different languages. In particular, our model has three key advantages: i) it does not rely on word alignment or machine translation tools; ii) it learns to perform SRL with multiple linguistic inventories; iii) it learns to link resources that would otherwise be disconnected from each other.

Model Description
In the wake of recent work in SRL, our model falls into the broad category of end-to-end systems as it learns to jointly tackle predicate identification, predicate sense disambiguation, argument identification and argument classification. The model architecture can be roughly divided into the following components: • A universal sentence encoder whose parameters are shared across languages and which produces word encodings that capture predicate-related information (Section 3.2); • A universal predicate-argument encoder whose parameters are also shared across languages and which models predicate-argument relations (Section 3.3); • A set of language-specific decoders which indicate whether words are predicates, select the most appropriate sense for each predicate, and assign a semantic role to every predicateargument couple, according to several different SRL inventories (Section 3.4).
Unlike previous work, our model does not require any preexisting cross-resource mappings, word alignment techniques, translation tools, other annotation transfer techniques, or parallel data, to perform high-quality cross-lingual SRL, as it relies solely on implicit cross-lingual knowledge transfer.

Input representation
Pretrained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020), inter alia, are becoming the de facto input representation method, thanks to their ability to encode vast amounts of knowledge. Following recent studies (Hewitt and Manning, 2019;Kuznetsov and Gurevych, 2020;, which show that different layers of a language model capture different syntactic and semantic characteristics, our model builds a contextual representation for an input word by concatenating the corresponding hidden states of the four top-most inner layers of a language model. More formally, given a word w i in a sentence w = w 0 , w 1 , . . . , w i , . . . , w n−1 of n words and its hidden state h k i = l k (w i |w) from the k-th inner layer l k of a language model with K layers, the model computes the word encoding e i as follows: where x ⊕ y is the concatenation of the two vectors x and y, and Swish(x) = x · sigmoid(x) is a non-linear activation which was found to produce smoother gradient landscapes than the more traditional ReLU (Ramachandran et al., 2018).

Universal sentence encoder
Expanding on the seminal intuition of Fillmore (1968), who suggests the existence of deep semantic relations between a predicate and other sentential constituents, we argue that such semantic relations may be preserved across languages. With this reasoning in mind, we devise a universal sentence encoder whose parameters are shared across languages. Intuitively, the aim of our universal sentence encoder is to capture sentence-level information that is not formalism-specific and spans across languages, such as information about predicate positions and predicate senses. In our case, we implement this universal sentence encoder as a stack of BiLSTM layers (Hochreiter and Schmidhuber, 1997), similarly to , Cai et al. (2018) and , with the difference that we concatenate the output of each layer to its input in order to mitigate the problem of vanishing gradients. More formally, given a sequence of word encodings e = e 0 , e 1 , . . . , e n−1 , the model computes a sequence of timestep encodings t as follows: where BiLSTM j i (·) is the i-th timestep of the j-th BiLSTM layer and K is the total number of layers in the stack. Starting from each timestep encoding t i , the model produces a predicate representation p i , which captures whether the corresponding word w i is a predicate, and a sense representation s i which encodes information about the sense of a predicate at position i: We stress that the vector representations obtained for each timestep, each predicate and each sense lie in three spaces that are shared across the languages and formalisms used to perform SRL.

Universal predicate-argument encoder
In the same vein, and for the same reasoning that motivated the design of the above universal sentence encoder, our model includes a universal predicate-argument encoder whose parameters are also shared across languages. The objective of this second encoder is to capture the relations between each predicate-argument couple that appears in a sentence, independently of the input language. Similarly to the universal sentence encoder, we implement this universal predicate-argument encoder as a stack of BiLSTM layers. More formally, let w p be a predicate in the input sentence w = w 0 , w 1 , . . . , w p , . . . , w n−1 , then the model computes a sequence of predicate-specific argument encodings a as follows: where t i is the i-th timestep encoding from the universal sentence encoder and K is the total number of layers in the stack. Starting from each predicatespecific argument encoding a i , the model produces a semantic role representation r i for word w i : Similarly to the predicate and sense representations p and s, since the predicate-argument encoder is one and the same for all languages, the semantic role representation r obtained must draw upon cross-lingual information in order to abstract from language-specific peculiarities.

Language-specific decoders
The aforementioned predicate encodings p, sense encodings s and semantic role encodings r are shared across languages, forcing the model to learn from semantics rather than from surface-level features such as word order, part-of-speech tags and syntactic rules, all of which may differ from language to language. Ultimately, however, we want our model to provide semantic role annotations according to an existing predicate-argument structure inventory, e.g., PropBank, AnCora, or PDT-Vallex. Our model, therefore, includes a set of linear decoders that indicate whether a word w i is a predicate, what the most appropriate sense for a predicate w p is, and what the semantic role of a word w r with respect to a specific predicate w p is, for each language l: Although we could have opted for more complex decoding strategies, in our case linear decoders have two advantages: 1) they keep the languagespecific part of the model as simple as possible, pushing the model into learning from its universal encoders; 2) they can be seen as linear probes, providing an insight into the quality of the crosslingual knowledge that the model can capture.

Training objective
The model is trained to jointly minimize the sum of the categorical cross-entropy losses on predicate identification, predicate sense disambiguation and argument identification/classification over all the languages in a multitask learning fashion. More formally, given a language l and the corresponding predicate identification loss L p|l , predicate sense disambiguation loss L s|l and argument identification/classification loss L r|l , the cumulative loss L is: where L is the set of languages -and the corresponding formalisms -in the training set.

Experiments
We evaluate our model in dependency-based multilingual SRL. The remainder of this Section describes the experimental setup (Section 4.1), provides a brief overview of the multilingual dataset we use for training, validation and testing (Section 4.2), and shows the results obtained on each language (Section 4.3).

Experimental Setup
We implemented the model in PyTorch 1 and Py-Torch Lightning 2 , and used the pretrained language models for multilingual BERT (m-BERT) and XLM-RoBERTa (XLM-R) made available by the Transformers library (Wolf et al., 2020). We trained each model configuration for 30 epochs using Adam (Kingma and Ba, 2015) with a "slanted triangle" learning rate scheduling strategy which linearly increases the learning rate for 1 epoch and then linearly decreases the value for 15 epochs. We did not perform hyperparameter tuning and opted instead for standard values used in the literature; we provide more details about our model configuration and its hyperparameter values in Appendix A. In the remainder of this Section, we report the F 1 scores of the best models selected according to the highest F 1 score obtained on the validation set at the end of a training epoch. 3

Dataset
To the best of our knowledge, the dataset provided as part of the CoNLL-2009 shared task (Hajic et al., 2009) is the largest and most diverse collection of human-annotated sentences for multilingual SRL. It comprises 6 languages 4 , namely, Catalan, Chinese, Czech, English, German and Spanish, which belong to different linguistic families and feature significantly varying amounts of training samples, from 400K predicate instances in Czech to only 17K in German; we provide an overview of the statistics of each language in Appendix B. CoNLL-2009 is the ideal testbed for evaluating the ability of our unified model to generalize across heterogeneous resources since each language adopts its own linguistic formalism, from English PropBank to PDT-Vallex, from Chinese PropBank to AnCora. We also include VerbAtlas (Di Fabio et al., 2019), a recently released resource for SRL 5 , with the aim of understanding whether our model can learn to align inventories that are based on "distant" linguistic theories; indeed, VerbAtlas is based on clustering WordNet synsets into frames that share similar semantic behavior, whereas PropBank-based resources enumerate and define the possible senses of a lexeme. As a final note, we did not evaluate our model on Universal PropBank 6 since i) it was semiautomatically generated through annotation pro-3 Hereafter, all the results of our experiments are computed by the official scorer of the CoNLL-2009 shared task, available at https://ufal.mff.cuni.cz/conll2009-st/scorer.html.
4 The CoNLL-2009 shared task originally included a seventh language, Japanese, which is not available anymore on LDC due to licensing issues. 5 We build a training set for VerbAtlas using the mapping from PropBank available at http://verbatlas.org. 6 https://github.com/System-T/ UniversalPropositions
Low-resource cross-lingual SRL. We evaluate the robustness of our model in low-resource crosslingual SRL by artificially reducing the training set of each language to 10% of its original size. Table  3 (top) reports the results obtained by our model when trained separately on the reduced training set of each language (monolingual), and the results obtained by the same model when trained on the union of the reduced training sets (cross-lingual). The improvements of our cross-lingual approach compared to the more traditional monolingual baseline are evident, especially in lower-resource scenarios, with absolute improvements in F 1 score of 25.5%, 9.7% and 26.9% on the Catalan, German and Spanish test sets, respectively. This is thanks to the ability of the model to use the knowledge from a language to improve its performance on other languages.

One-shot cross-lingual SRL.
An interesting open question in SRL is whether a system can learn to model the semantic relations between a predicate sense s and its arguments, given a limited number of training samples in which s appears. In particular in our case, we are interested in understanding how the model fares in a synthetic scenario where each sense appears at most once in the training set, that is, we evaluate our model in a one-shot learning setting. As we can see from Table 3 (bottom), our cross-lingual approach outperforms its monolingual counterpart trained on each synthetic dataset separately by a wide margin, once again providing strong absolute improvements -18.7% in Catalan, 9.2% in German and 16.1% in Spanish in terms of F 1 score -for languages where the number of training instances is smaller. It is not uncommon for supervised cross-lingual tasks to feature different amounts of data for each language, depending on how difficult it is to get manual annotations for each language of interest. We simulate this setting in SRL by training our model on 100% of the training data available for the English language, while keeping the one-shot learning setting for all the other languages. As Table 3 (bottom) shows, non-English languages exhibit further improvements as the number of English training samples increases, lending further credibility to the idea that SRL can be learnt across languages even when using heterogeneous resources. Not only do these results suggest that a cross-lingual/cross-resource approach might mitigate the need for a large training set in each language, but also that reasonable cross-lingual results may be obtained by maintaining a single large dataset for a high-resource language, together with several small datasets for low-resource languages.

Analysis and Discussion
Cross-formalism SRL. In contrast to existing multilingual systems, a key benefit of our unified cross-lingual model is its ability to provide annotations for predicate senses and semantic roles in any linguistic formalism. As we can see from Figure 1 (left), given the English sentence "the cat threw its ball out of the window", our language-specific decoders produce predicate sense and semantic role labels not only according to the English PropBank inventory, but also for all the other resources, as it correctly identifies the agentive and patientive constituents independently of the formalism of interest. And this is not all, our model may potentially work on any of the 100 languages supported by the underlying language model (m-BERT or XLM-RoBERTa), e.g., in Italian, as shown in Figure 1 (right). This is vital for those languages for which a predicate-argument structure inventory has not yet been developed -an endeavor that may take Figure 1: Thanks to its universal encoders, our unified cross-lingual model is able to provide predicate sense and semantic role labels according to several linguistic formalisms. Left: SRL labels for an English input sentence. Right: SRL labels for an Italian input sentence, which can be translated into English as "The president refuses the help of the opponents". Notice that Italian is not among the languages in the training set.  years to come to fruition -and, therefore, manually annotated data are unavailable. Thus, as long as a large amount of pretraining data is openly accessible, our system provides a robust cross-lingual tool to compare and analyze different linguistic theories and formalisms across a wide range of languages, on the one hand, and to overcome the issue of performing SRL on languages where no inventory is available, on the other.
Aligning heterogeneous resources. As briefly mentioned previously, the universal encoders in the model architecture force our system to learn cross-lingual features that are important across different formalisms. A crucial consequence of this approach is that the model learns to implicitly align the resources it is trained on, without the aid of word aligners and translation tools, even when these resources may be designed around specific languages and, therefore, present significant differences. In order to bring to light what our model implicitly learns to align in its shared crosslingual space (see Sections 3.2 and 3.3), we exploit its language-specific decoders to build a mapping from any source inventory, e.g., AnCora, to a target inventory, e.g., the English PropBank. In particular, we use our cross-lingual model to label a training set originally tagged with a source inventory to produce silver annotations according to a target inventory, similarly to what is shown in Figure 1. While producing the silver annotations, we keep track of the number of times each predicate sense in the source inventory is associated by the model with a predicate sense of the target inventory. As a result, we produce a weighted directed graph in which the nodes are predicate senses and an edge (a, b) with weight w indicates that our model maps the source predicate sense a to the target predicate sense b at least w times. A portion of this graph is displayed in Figure 2 where, for visualization purposes, we show the most frequent alignments for each language, i.e., the top-3 edges with largest weight from the nodes of each inventory to the nodes of the English PropBank (Figure 2, left) and to the nodes of VerbAtlas (Figure 2, right). 7 For example, Figure 2 (left) shows that our model learns to map the Spanish AnCora sense empezar.c1 and the German PropBank sense starten.2 to the English PropBank sense start.01, but also that, depending on the context, the Chinese Prop-Bank sense 开始.01 can correspond to both start.01 and begin.01. Figure 2 (right) also shows that our model learns to map senses from different languages and formalisms to the coarse-grained senses of VerbAtlas, even though the latter formalism is quite distant from the others as its frames are based on clustering WordNet synsets -sets of synonymous words -that share similar semantic behavior, rather than enumerating and defining all the possible senses of a lexeme as in the English and Chinese PropBanks. To the best of our knowledge, our unified model is the first transfer-based tool to automatically align diverse linguistic resources across languages without relying on human supervision.

Conclusion and Future Work
On one hand, recent research in multilingual SRL has focused mainly on proposing novel model architectures that achieve state-of-the-art results, but require a model instance to be trained on and for each language of interest. On the other hand, the latest developments in cross-lingual SRL have revolved around using the English PropBank inventory as a universal resource for other languages through annotation transfer techniques. Following our hunch that semantic relations may be deeply rooted beyond the surface realizations that distinguish one language from another, we propose a new approach to cross-lingual SRL and present a model which learns from heterogeneous linguistic resources in order to obtain a deeper understanding of sentence-level semantics. To achieve this objective, we equip our model architecture with "universal" encoders which share their weights across languages and are, therefore, forced to learn knowledge that spans across varying formalisms.
Our unified cross-lingual model, evaluated on the gold multilingual benchmark of CoNLL-2009, outperforms previous state-of-the-art multilingual systems over 6 diverse languages, ranging from Catalan to Czech, from German to Chinese, and, at the same time, also considerably reduces the amount of trainable parameters required to support different linguistic formalisms. And this is not all. We find that our approach is robust to low-resource scenarios where the model is able to exploit the complementary knowledge contained in the training set of different languages.
Most importantly, our model is able to provide predicate sense and semantic role labels according to 7 predicate-argument structure inventories in a single forward pass, facilitating comparisons between different linguistic formalisms and investigations about interlingual phenomena. Our analysis shows that, thanks to the prior knowledge encoded in recent pretrained language models and our focus on learning from cross-lingual features, our model can be used on languages that were never seen at training time, opening the door to alignment-free cross-lingual SRL on languages where a predicateargument structure inventory is not yet available. Finally, we show that our model implicitly learns to align heterogeneous resources, providing useful insights into inter-resource relations. We leave an in-depth qualitative and quantitative analysis of the learnt inter-resource mappings for future work.
We hope that our work can set a stepping stone for future developments towards the unification of heterogeneous SRL. We release the code to reproduce our experiments and the checkpoints of our best models at https://github. com/SapienzaNLP/unify-srl.

B Data Statistics
Tables 5, 6 and 7 provide an overview of the training sets provided as part of the CoNLL-2009 shared task, with statistics about sentences, predicates and arguments.

C Hardware Infrastructure
All the experiments were performed on a x86-64 architecture with 64GB of RAM, an 8-core CPU running at 3.60GHz, and a single Nvidia RTX 2080Ti with 11GB of VRAM.

E Other Results
Predicate identification. In Table 8 we report the results of our model on predicate identification.
Predicate sense disambiguation. In Table 9 we report the results of our model on predicate sense disambiguation.    Table 6: Overview of the CoNLL-2009 development datasets. For each dataset we report the number of sentences (Total s ), the number of sentences with at least an annotated predicate (Annotated), the average number of tokens per sentence (Avg. Len.), the number of predicates (Total p ) and predicate senses (Senses), and also the number of arguments (Total a ) and argument roles (Roles).    Table 9: Accuracy on the predicate sense disambiguation subtask computed by the official CoNLL-2009 scorer which, by default, takes into account only the sense numbers, e.g., 01 of eat.01. Figure 3: Output of our cross-lingual system for a French (left) and a Catalan (right) sentence.