Alignment-free Cross-lingual Semantic Role Labeling

Cross-lingual semantic role labeling (SRL) aims at leveraging resources in a source language to minimize the effort required to construct annotations or models for a new target language. Recent approaches rely on word alignments, machine translation engines, or preprocessing tools such as parsers or taggers. We propose a cross-lingual SRL model which only requires annotations in a source language and access to raw text in the form of a parallel corpus. The backbone of our model is an LSTM-based semantic role labeler jointly trained with a semantic role compressor and multilingual word embeddings. The compressor collects useful information from the output of the semantic role labeler, ﬁltering noisy and conﬂicting evidence. It lives in a multilingual embedding space and provides direct supervision for predicting semantic roles in the target language. Results on the Universal Propo-sition Bank and manually annotated datasets show that our method is highly effective, even against systems utilizing supervised features. 1


Introduction
Semantic role labeling (SRL) is the task of identifying the arguments of semantic predicates in a sentence and labeling them with a set of predefined relations (e.g., "who" did "what" to "whom," "when," and "where"). It has emerged as an important technology for a wide spectrum of applications ranging from machine translation (Aziz et al., 2011;Marcheggiani et al., 2018) to information extraction (Christensen et al., 2011), and summarization (Khan et al., 2015).
There have been considerable efforts on developing annotated resources for semantic role labeling (Palmer et al., 2005;Zaghouani et al., 2010) which in turn have greatly facilitated the development of the various models designed to automatically predict semantic roles. Recent years have seen the successful application of neural network models to SRL (Zhou and Xu, 2015;He et al., 2017;Marcheggiani et al., 2017) which forego the need for extensive feature engineering. Despite recent advances in representational learning, a perennial problem with building SLR systems lies in the paucity of training data since semantic role annotations are available for only a handful of the world's languages. As a result, much previous work has focused on cross-lingual SRL which aims at leveraging existing resources in a source language to minimize the effort required to construct a model or annotations for a new target language.
Annotation projection is a popular approach which transfers annotations from a source to a target language via automatic word alignments (Padó and Lapata, 2005;van der Plas et al., 2011;Aminian et al., 2019). Although very intuitive, it is sensitive to the quality of the parallel data, the performance of the source-language SRL model, and the accuracy of alignment tools, all of which introduce noise. Translation-based approaches (Täckström et al., 2012;Fei et al., 2020;Rasooli and Collins, 2015) aim to alleviate the noise brought by the the source-side labeler by directly translating the gold-standard data into the target language. A third alternative is model transfer where a source-language model is modified in a way that it can be directly applied to a new language, e.g., by employing cross-lingual word representations (Täckström et al., 2012;Swayamdipta et al., 2016;Daza and Frank, 2019a) and universal POS tags (McDonald et al., 2013).
Word alignment noise poses serious problems for both annotation-projection and translationbased methods (the latter still rely on alignment tools to transfer word-level labels from source to target). For example, there could be many-to-one alignments, leading to semantic role conflicts in the target language. Some form of filtering is often introduced to reduce the impact of this noise, e.g., parallel sentence pairs are discarded according to projection density (Aminian et al., 2019) or alignment confidence (Fei et al., 2020). In addition, translation-based approaches rely on high performance translation engines, which are often trained on large-scale parallel corpora. Unfortunately, neither adequate MT nor high-quality parallel data can be guaranteed when dealing with low-resource languages. Model transfer is an appealing alternative, however, it relies on accurate features based on lemmas, POS tags, and syntactic parse trees (Kozhevnikov and Titov, 2013;Fei et al., 2020) which are themselves obtained with access to additional annotation. It is not realistic to assume that treebank-style resources will be available for low-resource languages.
In this paper, we propose a novel method for cross-lingual SRL which does not rely on word alignments, machine translation or pre-processing tools such as parsers or taggers. Aside from semantic role annotations in the source language, we only assume access to raw text in the form of a parallel corpus. The backbone of our model is an LSTM-based semantic role labeler jointly trained with multi-lingual word embeddings and a semantic role compressor. The compressor distills useful information pertaining to arguments, predicates and their roles from the output of the semantic role labeler (e.g., by automatically filtering unrelated or conflicting information). Importantly, the compressor lives in a multilingual space and can provide direct supervision for predicting semantic roles in the target language, sidestepping intermediaries like word-level alignments and machine translation.
For evaluation, we make use of several multilingual benchmarks. These include the Universal Proposition Bank (UPB; Akbik et al. 2016), a recently released resource which contains semiautomatically created annotations under a unified labeling scheme for several languages, and a French corpus (van der Plas et al., 2010) which follows PropBank-style annotations (Palmer et al., 2005). We also release two additional manually labeled resources in Chinese and German, which we hope will be useful for future research. 2 Ex-2 Our annotations are available from https://github. com/RuiCaiNLP/ZH_DE_Datasets. perimental results show that our method is highly effective across languages and annotation schemes, even compared against systems making use of supervised features.
Our contributions can be summarizes as follows: (a) we propose a knowledge-lean model which does not rely on alignments, machine translation or sophisticated linguistic preprocessing; (b) we introduce the concept semantic role compressor which is important at filtering noisy information and can be potentially useful for other crosslingual tasks (e.g., dependency parsing); (3) we release two manually annotated datasets which will further advance cross-lingual semantic role labeling complementing previous work (Aminian et al., 2019;Fei et al., 2020) which reports result on semi-automatically created annotations).

Model
Figure 1 provides a schematic overview of our model. We assume we have access to semantic role annotations in a source language (e.g., English) and a parallel corpus of source-target sentences (e.g., English-French). Our model is jointly trained to predict semantic roles in the source and target languages. It has two main components, namely a semantic role labeler, and a semantic role compressor. The role labeler consists of: • an input layer which takes multilingual word embeddings and predicate indicator embeddings as input; • a bidirectional LSTM (BiLSTM) encoder which takes as input the representation of each word in a sentence and produces contextdependent representations; • a biaffine scorer to calculate the score of each semantic role for each word.
While the semantic role compressor consists of: • an input layer which again combines multilingual word embeddings and semantic role distributions for each word in the sentence; • a bidirectional LSTM (BiLSTM) encoder which produces compressed semantic role information for an input sentence; • a biaffine scorer which calculates the similarity between compressed representations of semantic roles and input words.
In the following sections we describe these two components more formally.

Semantic Role Labeler
Input Layer and Encoder For each sentence, the representation of i-th word w i is the concatenation of multilingual contextualized word embeddings e w w i and predicate indicator embedding e p w i . The former are pretrained on a large-scale unlabeled corpus and their parameters stay frozen during the training of our model. Predicate embeddings are randomly initialized and updated constantly during model training. Unlike previous supervised SRL approaches (Roth and Lapata, 2016;Cai and Lapata, 2019;He et al., 2019), our model does not make use of any syntactic information (e.g., POS-tags, dependency relations) since we cannot assume it will be available for low-resource languages.
Following Marcheggiani et al. (2017), sentences are represented using a multi-layer bi-directional LSTM (Hochreiter and Schmidhuber, 1997); the BiLSTM receives at time step t representation x for each word and recursively computes two hidden states, one for the forward pass ( − → h t ), and another one for the backward pass ( ← − h t ). Each word is the concatenation of its forward and backward LSTM state Biaffine Role Scorer Once the high-level BiL-STM encoder produces representations h for each word, two distinct non-linear transformations are applied to predicate w p (being considered at the time) and word w i , respectively: where f is a non-linear activation function (we use Leaky ReLu). The score s(r j , h w i , h w p ) of semantic role r j between current predicate w p and word w i is calculated as: where W r j , U r j , and b r j are parameters specific to role r j , and are updated during training. Both the binaffine role scorer and SRL encoder are illustrated in Figure 1 (bottom left part).

Predicate Identification and Disambiguation
The SRL labeler presented thus far assumes that predicates are known. Although in most SRL datasets predicates are explicitly annotated, such annotations are absent from unlabeled parallel data, and our model would need to automatically identify predicates if it were to be useful in practice. To this end, we run two modules on top of the sentence encoder in order to identify the predicate and disambiguate its senses. Each module is a multilayer perceptron (MLP) with a softmax layer, and is trained jointly with semantic role labeler.

Semantic Role Compressor
The semantic role compressor operates over the output of the semantic role labeler; it aims to relate each semantic role to specific words and compress this information into a fixed-size matrix.

Semantic Information Compression
Although the semantic role labeler produces a label for each word in the sentence, most words will bear the label "NULL", which indicates that they are not arguments of the predicate of interest. In order to provide useful supervision to the target language, we filter out information about non-argument words. Specifically, we compress the output of the semantic role labeler into a hidden representation which only records information about arguments. In theory, each semantic role appears no more than once in a sentence, so we propose to use a fixed-size matrix R ∈ R n r ×d r to represent compressed information, where n r is the size of semantic role set, and d r denotes the length of hidden representation for each semantic role.
The semantic role compressor will bind word w i to its corresponding role. Like the semantic role labeler, the compressor also operates over word embeddings (see upper left part in Figure 1); for sentence S, word w i is represented by P θ (r|w i , w p , S) • e w w i , where e w w i is the multilingual embedding of w i , and P θ (r|w i , w p , S) is the probability distribution over roles produced by the semantic role labeler: where θ are the parameters of the semantic role labeler. Analogously to the semantic role labeler, a multi-layer BiLSTM yields sentence representations (see upper block in Figure 1). At time step t, forward and backward hidden states − → h t and ← − h t are concatenated and then fed to a non-linear layer. A max-pooling layer thereafter gathers global information from hidden features at each time step, and compresses them into a fixed-size vector: where W 1 is a weight matrix, b 1 is a bias term for the hidden state vector, and n is the length of sentence. For the sake of decompression (see next section), R is reshaped from a vector into a matrix with n r rows and d r columns (see very top in Figure 1, left side).
Decompression Semantic roles in a sentence can be obtained by combining compressed information in R with the multilingual embedding of each word, and this process is referred to as decompression.
Concretely, for i-th word and j-th role, we use a biaffine scorer 3 to calculate the similarity between e w w i and R j . We first perform a non-linear transformation for word embedding e w w i : where z i contains hidden features for word w i . And then, use a biaffine scorer to calculate the similarity score between z i and R j : where W sim , U sim , and b sim are parameters updated during training. For word w i , the final probability distribution over semantic roles is obtained by applying a softmax operation on the scores of all semantic roles: whereθ are the parameters of the compressor. Gaussian Noise In order to improve the robustness of the compressor, we inject Gaussian noise to word embeddings. This is an effective regularization method (Liu et al., 2019) which improves the model's ability to generalize to unseen inputs from different languages. The final embeddings are: and n is the length of the sentence.

Training
In our learning setting, semantic role annotations are only available in the source-language. We therefore rely on (unlabeled) parallel data to provide cross-lingual supervision for the target-language.
During each iteration, we randomly select a batch from the annotated source-language for supervised training and a batch from the parallel data for crosslingual training.
Supervised Training We train the semantic role labeler in the source language in a supervised fashion, using a cross-entropy loss objective: where n is the length of sentence and t i ∈ R n r are one-hot ground truth representations. When training the compressor network, the objective is defined as the KL-divergence between the input distribution (produced by semantic role labeler) and the output distribution of the compressor: where D is a distance function between probability distributions (we use the Kullback-Leibler divergence). The final objective L sup for supervised learning is the sum of L ce and L com .
Cross-lingual Training Given an unlabeled parallel source-target sentence pair (S S and S T ), we first perform predicate identification on both sentences and randomly choose a predicate w S p in S S as the current predicate of interest. We then find, amongst all words identified as predicates in S T , predicate w T p which has the highest word embedding similarity with w S p . By feeding word embeddings and predicate information into our model, we obtain compressed role representations R S and R T for source and target sentences S S and S T . Recall that we must apply decompression in order to obtain role specific information for S S and S T . Since decompression operates over multilingual representations, it is relatively straightforward to obtain semantic roles for source and target sentences. In fact, we apply R S and R T on both S S and S T and compare the outcome (see Figure 1, bottom part, right side). The training objectives are defined as: where n S and n T are the length of S S and S T , respectively.
In order to improve the performance of the semantic role compressor on the source and target language, we train it using parallel sentence pairs by minimizing: The final training loss during cross-lingual training L cross is the sum of above losses: 3 Experiments

Datasets
We trained our model using English as the source language and obtained semantic role labelers in German (DE), Spanish (ES), Finish (FI), French (FR), Italian (IT), Portuguese (PT), and Chinese (ZH). For English, we used the Proposition Bank (v3; Palmer et al. 2005) and the annotations provided as part of the CoNLL-09 shared task (Hajič et al., 2009). We used the Europarl parallel corpus (Koehn, 2005) for the European languages and a large-scale EN-ZH parallel corpus (Xu, 2019) for Chinese. We provide details regarding the size of the parallel corpora in the Appendix. We compared our model against previous methods on the Universal Proposition Bank (UPB, v1.0; Akbik et al. 2016), which is built upon the Universal Dependency Treebank (UDT, v1.4) and the Proposition Bank (PB, v3.0). All languages in the UBP follow a unified dependency-based SRL annotation scheme.
In order to comply with this scheme, we converted argument spans in the English Proposition Bank to dependency-based arguments by labeling the syntactic head of each span. As UPB adopts a semi-automatic annotation procedure, it unavoidably contains a certain amount of errors. We therefore also tested our model on manually annotated datasets which are few and far between, presumably due to the labeling effort involved. An existing dataset (van der Plas et al., 2010) provides SRL labels for French following an annotation scheme similar to CoNLL-09 for English (Hajič et al., 2009). The CoNLL-09 shared task provides semantic role annotations for seven languages, but the role sets differ across languages, and it is far from trivial to unify them. To this end, we created two manual resources, by randomly sampling 258 German and 304 Chinese sentences from UPB. The manual annotation was performed by native speakers following the annotation guidelines of UPB which in turn follows the English Proposition Bank. Table 1 provides a breakdown of labeled data used in our experiments.

Model Configuration
Our model was implemented in PyTorch and optimized using the Adam optimizer (Kingma and Ba, 2014). Word embeddings were initialized using the officially released multilingual BERT (base; cased version; Devlin et al. 2019). The parameters of BERT are fixed during training in order to preserve the cross-lingual nature of the embeddings. Hyperparameter values (for all languages) are shown in Table 2.

Results on Universal Proposition Bank
We compared our model against several baselines on the UPB test set. These include two transfer methods: Bootstrap (Aminian et al., 2017) and CModel (Aminian et al., 2019), which perform annotation projection through parallel data and filter  word alignments empirically. We also report the results of two strong mixture-of-experts models which focus on combining language specific features automatically (MOE; Guo et al. 2018), and also on learning language-invariant features with a multinomial adversarial network as a shared feature extractor (MAN-MOE; Chen et al. 2019). We also include a recently proposed translation-based model (PGN; Fei et al. 2020) which performs competitively on UPB; this system directly translates the source annotated corpus into the target language, and then performs annotation projection and filtering similar to Bootstrap and CModel. Table 3 shows labeled F-scores (using automatically predicted predicate senses) on the test portion of the Universal Proposition Bank. The various languages are ordered according to their typological distance to English based on word order (Ahmad et al., 2019a) with Portuguese being closest and Finnish farthest. As can be seen, our model outperforms previous systems on DE, FR and PT, and is on average better. It is worth noting that, in addition to pretrained word-alignment tools, both Bootstrap and PGN utilize supervised part-of-speech (POS) tags for the target language. However, our model still achieves the best average F-score (61.1%) without employing any additional features. Pairwise differences in F 1 between our model MAN-MOE, CModel, and PGN) are all statistically significant (p < 0.05) using stratified shuffling (Noreen, 1989).

Results on Human-labeled Data
As UPB annotations are semi-automatic and possibly contain projection errors, we further compared   our model against manual annotations on French, German, and Chinese (see Table 1). Since previous models have not provided results on these datasets, we re-implemented three strong comparison systems, i.e., CModel, MAN-MOE, and PGN. Details on our implementation are in the Appendix. Our results are summarized in Table 4, where languages are ordered in terms of their word order distance to English (Ahmad et al., 2019a). We note that our approach significantly outperforms previously published models on these three languages. All systems perform best on French which is perhaps unsurprising given that it is closest to English and worst on Chinese which is least related to English. This suggests that transferring SRL annotations between languages with similar word orders could be an easier task.

Ablation Study and Analysis
To investigate the contribution of the semantic role compressor and cross-lingual training, we conducted a series of ablation studies on the manually annotated DE, FR, and ZH datasets. Evaluation in these experiments excludes the accuracy of predicate disambiguation, since we wish to focus on the SRL model per se.
Our experiments are summarized in Table 5. The first block shows the performance of the full model. In the second block, we assess the effect of different kinds of word representations. First, we substitute multilingual BERT embeddings with MUSE embeddings , which were obtained by aligning (monolingual) fastText embeddings for various languages onto a universal space. We can see that the performance of our model drops significantly. One important reason is that MUSE embeddings are not contextualized; as a result, a word appearing multiple times in the same sentence will receive the same embedding, even when it occupies different semantic roles, which in turn leads to conflicts during decompression. One solution is concatenating MUSE with word position embeddings during compression and decompression (see Appendix for details). This improves SRL performance from 47.7% (DE), 52.6% (FR), and 44.5% (ZH) to 55.3%, 60.5% and 53.0%, but is still inferior to the original model. Next, we remove Gaussian noise from the model and as can be seen there is a drop in performance indicating that it further boosts SRL accuracy.
In the third block, we remove cross-lingual training, and observe a significant drop in F-score over the full model. In order to verify the need for semantic role compression, we substitute the compressor with an attention-based module (Bahdanau et al., 2015) and proceed to train our model as described in Section 2.3. Specifically, we obtain soft alignments and use these to weight all annotations P θ (r|w i , w p , S), thereby obtaining an expectation over role assignments. The alignment module and the basic semantic role labeler are trained jointly during cross-lingual training. We can see that performance drops substantially for all three languages compared to the full model. The reason might be that the output of the semantic role labeler is noisy and attention often creates labeling conflicts (e. two words show high confidence for the same semantic role). However, our compressor can filter out this noise and resolve conflicts more effectively.  In Table 6, we present model performance for French and Chinese for different (gold) role labels. We compare the full model against an SRL only model without cross-lingual training. As shown in Table 6, cross-lingual training improves SRL performance in French and Chinese on all semantic roles. 4 For French, the most significant improvement comes from A1; for Chinese, cross-lingual training benefits labeling A1 and A2 significantly. Compared with A0, A1, and A2, the improvements on AM-* (modifiers for current predicate) are modest for both French and Chinese. One possible reason is that the head words of A0, A1 and A2 are usually nouns or adjectives, which tend to have fixed positions in parallel sentence pairs. However, modifiers can be optional and have more varied positions within and across languages, which increases the difficulty for cross-lingual learning.

Related Work
There has been a great deal of interest in crosslingual transfer learning for SRL (Padó and Lapata, 2009;van der Plas et al., 2011;Kozhevnikov and Titov, 2013;Tiedemann, 2015;Zhao et al., 2018;Chen et al., 2019;Aminian et al., 2019;Fei et al., 2020). The majority of previous work has focused on two types of approaches, namely annotation projection and model transfer.
A variety of methods have been proposed to improve the quality of annotation projections due to alignment noise. These range from word and argument filtering techniques Lapata, 2005, 2009), to learning syntax and semantics jointly (van der Plas et al., 2011), and iterative bootstrap-ping (Akbik et al., 2015;Aminian et al., 2017). In an attempt to reduce the reliance on supervised lexico-syntactic features for the target language, Aminian et al. (2019) make use of word and character features, and filter projected annotations according to projection density. Model transfer does not require parallel corpora or word alignment tools; nevertheless, it relies on accurate features such as POS tags (McDonald et al., 2013) or syntactic parse trees (Kozhevnikov and Titov, 2013) to enhance the ability to generalize across languages. Adversarial training is commonly used to extract language-agnostic features thereby improving the performance of cross-lingual systems (Chen et al., 2019;Ahmad et al., 2019b).
Translation-based approaches have been gaining popularity in cross-lingual dependency parsing (Rasooli and Collins, 2015;Tiedemann, 2015; and have recently been applied to SRL (Fei et al., 2020). Daza and Frank (2019b) propose a cross-lingual encoder-decoder model that simultaneously translates and generates sentences with semantic role annotations in a resource-poor target language. Rather than creating annotations or models for a target language, other work aims to exploit the similarities between languages. Mulcaire et al. (2018) combine resources for multiple languages to create polyglot semantic role labelers and show that polyglot training can result in better labeling accuracy than a monolingual labeler.
An obstacle for developing cross-lingual SRL models is the absence of a unified annotation scheme for all languages. Although the CoNLL-09 shared task (Hajič et al., 2009) provides annotations for seven languages, the labeling schemes and role sets are not shared. To this end, van der Plas et al. (2010) build a French SRL dataset, following an annotation scheme similar to CoNLL-09 for English. Some recent cross-lingual SRL models (Aminian et al., 2017(Aminian et al., , 2019Fei et al., 2020) make use of the publicly available Universal Proposition Bank (UPB; Akbik et al. 2015;Akbik and Li 2016), which annotates predicates and semantic roles following the English Proposition Bank 3.0 (Palmer et al., 2005). Since annotation projection is involved in the construction of UPB, the quality of UPB is also influenced by the quality of the parallel data, the performance of the source-language SRL model, and the accuracy of alignment tools.

Conclusions
In this paper we developed a cross-lingual SRL model and demonstrated it can effectively leverage unlabeled parallel data without relying on word alignments or any other external tools. We have also contributed two quality controlled datasets (compatible with PropBank-style guidelines) which we hope will be useful for the development of crosslingual models. Directions for future work are many and varied. Although our focus has been on dependency-based SRL, our model can be easily adapted to span-based annotations (Carreras and Màrquez, 2005;Pradhan et al., 2013). In this case, the semantic role compressor could be modified to represent entire spans rather than just head words while decompression would remain unchanged (it would still output a probability distribution for each word over all semantic roles). We also plan to extend our framework to semi-supervised learning, where a small number of annotations might also be available in the target language.  We adopt these new positional features for two reasons. Firstly, during cross-lingual training, the length of parallel sentences S S and S T is usually different. More importantly, for i-th word w i in S S , its correspondence w j in S T is not the i-th word in S T in most cases. When performing cross-lingual training, it is important that w i and w j have the same position embeddings, so that they can obtain similar result after decompression. As shown in Figure 2, he ("il" in French) appears twice in the English sentence, and its French counterpart shares the same positional features. Experimental results show that positional features can effectively improve cross-lingual training. However, there are still cases when the word order changes dramatically after translation and our position features do not work. The only solution seems to be to use contextualized embeddings like multilingual BERT or multilingual ELMo, where every word in a sentence will be assigned unique embeddings.

B External Tools
When implementing previous models, we used Google Translate 5 as our translation engine, and giza++ 6 to obtain word alignments. Besides sourcelanguage corpus, translated corpus is also used for the training of PGN and MAN-MOE. When prepossessing the Chinese part in EN-ZH parallel corpus (containing about 5 million sentence pairs), we use Jieba 7 for tokenization. The Chinese testset in UPB is in traditional Chinese, and we use Zhtools 8 to convert it to simplified Chinese to be compatible with our EN-ZH parallel corpus which is also in simplified Chinese.