Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus

Many efforts of research are devoted to semantic role labeling (SRL) which is crucial for natural language understanding. Supervised approaches have achieved impressing performances when large-scale corpora are available for resource-rich languages such as English. While for the low-resource languages with no annotated SRL dataset, it is still challenging to obtain competitive performances. Cross-lingual SRL is one promising way to address the problem, which has achieved great advances with the help of model transferring and annotation projection. In this paper, we propose a novel alternative based on corpus translation, constructing high-quality training datasets for the target languages from the source gold-standard SRL annotations. Experimental results on Universal Proposition Bank show that the translation-based method is highly effective, and the automatic pseudo datasets can improve the target-language SRL performances significantly.


Introduction
Semantic role labeling (SRL), which aims to capture the high-level meaning of a sentence, such as who did what to whom, is an underlying task for facilitating a broad range of natural language processing (NLP) tasks (Shen and Lapata, 2007;Liu and Gildea, 2010;Genest and Lapalme, 2011;Gao and Vogel, 2011;Wang et al., 2015;Khan et al., 2015). Currently, the majority of research work on SRL is dedicated to the English language, due to the availability of large quantity of labeled data. With this regard, cross-lingual SRL, especially the one transferring the advantage of the source language with affluent amount of resources (e.g., English) to the target language where the labeled data is scarce or even not available, is of great importance * Corresponding author.

Source
Gold-standard SRL Our method is in the dotted blue box. (Kozhevnikov and Titov, 2013;Aminian et al., 2019). Previous work on cross-lingual SRL can generally be divided into two categories: model transferring and annotation projection. The former builds cross-lingual models on language-independent features such as cross-lingual word representations and universal POS tags which can be transferred into target languages directly (McDonald et al., 2013;Swayamdipta et al., 2016;Daza and Frank, 2019). The latter bases on a large-scale parallel corpus between the source and target languages where the source-side sentences are annotated with SRL tags automatically by a source SRL labeler, and then the source annotations are projected onto the target-side sentences in accordance of word alignments (Yarowsky et al., 2001;Hwa et al., 2005;van der Plas et al., 2011;Kozhevnikov and Titov, 2013;Pado and Lapata, 2014;Akbik et al., 2015). In addition, the annotation projection can be combined with model transferring naturally.
Particularly, the projected SRL tags in annotation projection could contain much noise because of the source-side automatic annotations. A straightforward solution is the translation-based approach, which has been demonstrated effective for cross-lingual dependency parsing (Täckström et al., 2012;Rasooli and Collins, 2015;Guo et al., 2016;. The key idea is to translate the gold-standard source training data into target language side by translation directly, avoiding the problem of the low-quality source annotations. Fortunately, due to recent great advances of neural machine translation (NMT) (Bahdanau et al., 2015;Wu et al., 2016), this approach could have great potentials for cross-lingual transferring.
To this end, in this paper, we study the translation-based method for cross-lingual SRL. Figure 1 illustrates the differences between previous approaches. Sentences of the source language training corpus are translated into the target language, and then the source SRL annotations are projected into the target side, resulting in a set of high-quality target language SRL corpus, which is used to train the target SRL model. Further, we merge the gold-standard source corpus and the translated target together, which can be regarded as a combination of the translation-based method and the model transferring. Our baseline is a simple BiLSTM CRF model by using multilingual contextualized word representations (Peters et al., 2018;Devlin et al., 2019). For a better exploration of the blended corpus, we adopt a parameter generation network (PGN) to enhance the BiLSTM module, which can capture the language differences effectively (Platanios et al., 2018;Jia et al., 2019).
We conduct experiments based on Universal Proposition Bank corpus (v1.0) (Akbik et al., 2015;Akbik and Li, 2016) over seven languages. First, we verify the effectiveness of our method on the single-source SRL transferring, where the English language is adopted as the source language and the remaining are used as the target languages. Results show that the translation-based method is highly effective for cross-lingual SRL, and the performances are further improved when PGN-BiLSTM is used. Further, we conduct experiments on the multi-source SRL transferring, where for each target language all the remaining six languages are used as the source languages. The same tendencies as the single-source setting can be observed. We conduct detailed analysis work for both settings to understand our proposed method comprehensively.
In summary, we make the following two main contributions in this work: • We present the first work of the translationbased approach for unsupervised cross-lingual SRL. We build a high-quality of pseudo training corpus for a target language, and then verify the effectiveness of the corpus under a range of settings.
• We take advantage of the multilingual contextualized word representations, and strengthen the multilingual model training with PGN-BiLSTM model.
All codes and datasets are released publicly available for the research purpose 1 .

Related Work
There exists extensive work for cross-lingual transfer learning (van der Plas et al., 2011;Kozhevnikov and Titov, 2013;Pado and Lapata, 2014;Rasooli and Collins, 2015;Tiedemann and Agic, 2016;Chen et al., , 2019Aminian et al., 2019). Model transferring and annotation projection are two mainstream categories for the goal. The first category aims to build a model based on the source language corpus, and then adapt it to the target languages (Yarowsky et al., 2001;Hwa et al., 2005;Tiedemann, 2015). The second category attempts to produce a set of automatic training instances for the target language by a source language model and a number of parallel sentences, and then train a target model on the dataset (Björkelund et al., 2009;McDonald et al., 2013;Lei et al., 2015;Swayamdipta et al., 2016;Mulcaire et al., 2018;Daza and Frank, 2019). For cross-lingual SRL, annotation projection has received the most attention (Pado and Lapata, 2014). A range of strategies have been proposed to enhance the SRL performance of the target language, including improving the projection quality (Tiedemann, 2015), joint learning of syntax and semantics (Kozhevnikov and Titov, 2013), iterative bootstrapping to reduce the influence of noise target corpus (Akbik et al., 2015), and joint translation and SRL (Aminian et al., 2019).
Our work is mainly inspired by the recent work of treebank translation of cross-lingual dependency parsing (Tiedemann et al., 2014;Tiedemann, 2015;Rasooli and Collins, 2015;Guo et al., 2016;Tiedemann and Agic, 2016;, which is referred to as the translation-based approaches. These approaches directly project the gold-standard annotation into the target side, alleviating the problem of erroneous source annotations in standard annotation projection. In addition, we combine the approach with model transferring, which has been concerned little for cross-lingual SRL. The model transferring benefits much from the recent advance of crosslingual contextualized word representations . The development of universal annotation schemes for a variety of NLP tasks can greatly facilitate cross-lingual SRL, including POS tagging (Petrov et al., 2012), dependency parsing (Mc-Donald et al., 2013;Przepiórkowski and Patejuk, 2018), morphology (Sylak-Glassman et al., 2015) and SRL (Aminian et al., 2019). Our work makes use of the publicly available Universal Proposition Bank (UPB) (Akbik et al., 2015;Akbik and Li, 2016), which annotates the predicate and semantic roles following the frame and role schemes of the English Proposition Bank 3.0 (Kingsbury and Palmer, 2002;Palmer et al., 2005).
Supervised SRL models are also closely related to our work (He et al., 2017(He et al., , 2018a. A great deal of work attempts for an endto-end solution with sophistical neural networks, detecting the predicates as well as the corresponding argument roles in one shot (He et al., 2017;Tan et al., 2018;. Also there exist a number of studies which aims for adapting various powerful features for the task (Strubell et al., 2018;Li et al., 2018). In this work, we exploit a multilingual PGN-BiLSTM model (Jia et al., 2019) with contextualized word representations , which can obtain state-of-the-art performance for cross-lingual SRL.

SRL Translation
We induce automatic target data from the goldstandard source data by full translation and then project the SRL predicates and arguments into their corresponding words by aligning, producing the final translated SRL corpus for the target language automatically. The method has been demonstrated effective for cross-lingual dependency parsing (Tiedemann et al., 2014;Tiedemann, 2015;Tiedemann and Agic, 2016;. Compared with annotation projection, we can ensure the annotation quality at the source side, thus higher quality target corpus is also expected. In addition, dependency-based SRL could benefit more by this method, as only predicate words and their arguments are required to be projected into the target side, while dependency parsing should concern all sentential words. The overall process is accomplished by two steps: translating and projecting.  Translating. First, we use a state-of-the-art translation system to produce the target translations for the sentences of the source SRL data. Give a source sentence e 1 · · · e n , we translate it into f 1 · · · f m of the target language. It is worth noting that the recent impressive advances in NMT (Bahdanau et al., 2015;Wu et al., 2016) facilitate our work greatly, which enables our method to have highquality translations.
Projecting. Then we incrementally project the corresponding predicates or arguments of a source sentence e 1 · · · e n to its target f 1 · · · f m . We adopt two kinds of information to assist the projection: (1) the alignment probabilities a(f j |e i ) from the source word e i into f j , which can be calculated by a word-alignment tool, and (2) the POS tag distributions p(t * |f j ) of the target sentential words, which can be derived from a supervised target language POS tagger, where i ∈ [1, n], j ∈ [1, m], and t * denotes an arbitrary POS tag.
We focus on SRL-related words of the source sentence only, and perform the process gradually at the predicate level. For each predicate in a sentence, we collect the predicate word as well as its role words, and then project their role labels into the target sentence. Formally, for each of these words (i.e., e i ), we have the SRL role tag r e i as well as its POS tag t e i , both of which have been already annotated in the UPB. First, we find its target word f j with the highest alignment probability, regarding the word f j as the corresponding projection carrying the semantic role r e i . Then we calculate the confidence score of this projection by the following formula: which is a joint probability of word alignment corresponding and POS tag consistency.
The one-one target-source alignment 2(a) is the ideal condition of the projection. However, there could be many-to-one cases for the given words, leading to semantic role conflicts at the target language words. For these cases, we take precedence for the predicate projections, and otherwise keep only the highest confidence projections. Figure  2(b) shows a predicate-argument conflict example, where the predicate projection is reserved, and Figure 2(c) shows an argument-argument conflict example where the projection with the higher confidence score is reserved.
Finally, we set a threshold value α to remove low confidence projections. If the confidence score of a predicate projection is below α, all the roles of this predicate are removed as well. For the argument projections whose confidence is below α, we remove the single arguments directly, with no influence on the other projections.

The SRL Model
In this work, we focus on dependency-based SRL, recognizing semantic roles for a given predicate (He et al., 2017). The task can be treated as a standard sequence labeling problem, and a simple multi-layer BiLSTM-CRF model is exploited here, which has archived state-of-the-art performance with contextualized word representations (He et al., 2018b;. In particular, we adapt the model to better support multilingual inputs by using a PGN module on the BiLSTM (Hochreiter and Schmidhuber, 1997). Figure 3 shows the overall architecture.

Word Representation
Given an input sentence s = w 1 · · · w n of a specific language L and w p (p denotes the position) is the predicate word, we use three sources of features to represent each word: (1) the word form, (2) where t 1 · · · t n is the universal POS tag sequence for the input sentence. For the POS tags and the predicate indicators, we use the embedding method to obtain their vectorial representations. We compare three kinds of word form representations for cross-lingual SRL: (1) multilingual word embeddings, (2) multilingual ELMo representation (Peters et al., 2018), and (3) multilingual BERT representation (Devlin et al., 2019). Note that we use the averaged vector of the inner-word piece representations from BERT outputs as the full word representation.

Encoding Layer
We employ the PGN-BiLSTM (Platanios et al., 2018;Jia et al., 2019) to encode the input sequence x 1 · · · x n , which is first introduced for cross-domain transfer learning to capture domain difference. Here we use it for the multilingual setting aiming to model the language characteristics.
Compared with the vanilla BiLSTM module, PGN-BiLSTM dynamically selects the languageaware parameters for BiLSTM. Let V be the flattened vector of all the parameters of a BiLSTM cell, the language-aware V L is produced by: where W PGN denotes the parameters of vanilla BiLSTM part in the PGN-BiLSTM, including the weights of the input, forget, output gates and the cell modules, and e L is the embedding representation of language L. The mechanism of parameter generation of PGN-BiLSTM is illustrated in Figure  4. Following, we derive module parameters from V L to compute the BiLSTM outputs. The overall process can be formalized as: which differs from the vanilla BiLSTM in that e L is one extra input to obtain BiLSTM parameters. Specifically, we adopt a 3-layer bidirectional PGN-LSTM as the encoder.

Output Layer
Given the encoder output h 1 · · · h n for sentence s = w 1 · · · w n , we use CRFs (Lafferty et al., 2001) to compute the probability of each candidate output y = y 1 · · · y n : where W and T are the parameters of CRFs, and Z is a normalization factor for probability calculation. The Viterbi algorithm is used to search for the highest-probability output SRL tag sequence.   Table 1 shows the data statistics in detail.

SRL Translation
We focus on unsupervised cross-lingual SRL, assuming that no gold-standard target-language SRL corpus is available. Our goal is to construct pseudo training datasets by corpus translation from the gold-standard source-language SRL datasets. The Google Translation System 6 is adopted for sentence translation, and the fastAlign toolkit (Dyer et al., 2013) is used to obtain word alignments. In order to obtain accurate word alignment, we collect a set of parallel corpora to augment the training dataset of fastAlign. 7 The universal POS tags of the translated sentences are produced by supervised monolingual POS taggers, which are trained on the corresponding UDT v1.4 datasets, respectively. 8

Settings
Multi-lingual word representations. As mentioned in Section 4.1, we investigate three kinds of multilingual word representations: (1) Word Embedding (Emb): MUSE is exploited to align all monolingual fastText word embeddings into a universal space . 9 (2) ELMo: A blended dataset 10 of the seven languages is used to train multilingual ELMo (Mulcaire et al., 2019). 5 We merge the Spanish and Spanish-AnCora as one.  Hyperparameters. For SRL translation, there is only one hyperparameter, the projection confidence threshold α, for filtering low-quality translated SRL sentences. Figure 5 shows the performances in the preliminary experiments for each languages under different α. Accordingly, we set α universally for all languages to 0.4. For the neural SRL models, the dimension sizes of multilingual word embeddings, ELMo and BERT are 300, 1024 and 768, respectively. The POS tag, predicate-indicator and language ID embedding sizes are 100, 100 and 32, respectively. The hidden size of LSTM is set to 650. We exploit online training with a batch size of 50, and the model parameters are optimized by using the Adam algorithm with an initial rate of 0.0005. The training is performed over the whole training dataset without early-stopping for 80 iterations on bilingual transfer, and 300 iterations on multilingual transfer.
Baselines. In order to test the effectiveness of our PGN model, we compare it with several baselines as well. First, we denote our model by using the vanilla BiLSTM instead as BASIC, and in particular, this model is exploited for all monolingual training all through this work. Further, we adopt two much stronger baselines, the MoE model proposed by Guo et al. (2018)   target language. Each model is trained five times and the averaged value is reported. We conduct significance tests by using the Dan Bikel's randomized parsing evaluation comparator 12 .

Cross-Lingual Transfer from English
We first conduct experiments on cross-lingual transfer from the English source to the rest of the other six target languages, respectively, which has been a typical setting for cross-lingual investigations . The results are summarized in Table 2. We list the F-scores by using only the source corpus (SRC), only the translated target corpus (TGT) and the mixture corpus of source and target (SRC & TGT), comparing the performances of different multilingual word representations as well as different multilingual SRL models.
Multilingual word representations. First, we evaluate the effectiveness of the three different multilingual word representations exploited. We compare their performances under two settings, by using SRC and TGT corpus, respectively. According to the results, we find that the multilingual contextualized word representations (i.e. BERT and ELMo) are better in both two settings, which is consistent with previous studies (Mulcaire et al., 2019;Schuster et al., 2019). Interestingly, the multilingual BERT performs worse than the ELMo, which can be explained by that the ELMo representation is pre-trained based on the corpus which involves in the focused seven languages. This indicates that the official released multilingual BERT can be further improved, since monolingual BERT has been demonstrated to produce better performances than ELMo (Tenney et al., 2019).
Translated target. Next, We consider taking the translated target as only the training data to examine the effectiveness of the pseudo datasets. As shown in Table 2, we find that the translated datasets can bring significantly better performances than the source baseline overall languages, resulting in an averaged F1 score increase of 51.1 − 44.4 = 6.7. The results demonstrate that corpus translation is one effective way for crosslingual SRL. The observation is in line with the previous work for cross-lingual dependency parsing (Tiedemann and Agic, 2016;. By direct gold-standard corpus translation, the produced pseudo training data can not only remain high-quality SRL annotations but also capture the language divergences effectively, which leads to better performance than the source baseline model. Combining source and pseudo target. Further, we combine the pseudo translated target corpus with the source language corpus together to train the target SRL models. According to the numbers in Table 2, we see that further gains can be achieved for all languages, where the averaged improvement is 55.8-51.1=4.7 (BASIC is used for a fair comparison). Note that since several source sentences are filtered during translation which might be the reason for the gains, we make a fairer comparison offthe-line by setting α=0 (i.e., no sentence filtering). Similar gains can be achieved still. Considering that the translated sentences are semantically equal to their counterparts in the gold-standard source, the possible reasons could be two hands: (1) the translated sentences may be biased in linguistic expression due to the data-driven translation models, (2) the discarded conflicted annotations in corpus translation are important, which are complementary to our model.
Language-aware encoder. Finally, we investigate the effectiveness of PGN-BiLSTM module, which is exploited to capture language-specific information when the mixture corpus of both source and target datasets are used for training. As shown in Table 2, we can see that the language-aware encoder by PGN can boost the F1 scores significantly, achieving an averaged improvement by 60.3-55.8=4.5. In addition, we report the results of MoE and MAN-MoE, respectively, which also exploit the language information. All the results demonstrate the usefulness of language-specific informa-   tion, and our PGN model is most effective.

Multi-Source Transfer
Further, we investigate the setting of multi-source transfer learning, where all other languages except a given target language are used as the source languages, aiming to study the effectiveness of our translation-based method comprehensively.
Overall performances. The results on multiple source SRL transferring are shown in Table 3. Generally, the results share similar tendencies with the single-source cross-lingual transfer from the source English, where the multilingual ELMo performs the best, the SRL models trained on the translated target datasets show better performances than those trained with the source datasets, and the mixture corpus with both source and target language datasets bring the best performances, which can be further improved by our final PGN model with language-aware encoders. We compare the PGN model with the MoE and MAN-MoE as well, showing slightly better performances, which indicates the effectiveness of the PGN-BiLSTM module. In addition, we can see that multi-source models outperform the single-source models in all cases, which is intuitive and consistent with previous studies (Lin et al., 2019). Fine-grained bilingual transfer. Following, we investigate the individual bilingual SRL transferring by examining the performance of each sourcetarget language pair, aiming to uncover which language benefits a target most and trying to answer whether all source languages are useful for a target language. Table 4 shows the results, where the cross-lingual models are trained on the mixture corpus of the source and translated target datasets. First, we can see that the languages belonging to a single family can benefit each other greatly, bringing better performances than the other languages in the majority of cases (i.e., EN-DE, FR-IT-ES-PT). Second, the multi-source transfer as indicated by All is able to obtain better performances across all languages, which further demonstrates its advantages over the single-source transfer. Further, we look into the PGN model in detail, aiming to understand their capabilities of modeling linguistic-specific information. We examine it by simply visualizing the language ID embeddings e L of each source-target language pair, respectively, where their Euclidean distances are depicted. Intuitively, better performance can be achieved if the distance between the target and the source languages is closer. Figure 6 shows the heatmap matrix. We can see the overall tendency is highly similar to the results in Table 4, which is consistent with our intuition.

Analysis
Here we conduct detailed analysis to understand the gains from the translated target datasets. We select three representative languages for analysis, including German (DE), French (FR) and Finnish (FI), one language for each family, and compare Performances by the SRL roles. First, we investigate the cross-lingual SRL performances in terms of SRL Roles. We select four representative roles for comparison, including A0 (Agent), A1 (Patient), A2 (Instrument, Benefactive, Attribute) and AM-TMP (Temporal), and report their F1 scores. Figure 7 shows the results. As a whole, the role A0 achieves the best F1 scores across all languages and all models, A1 ranks the second, and A2 and AM-TMP are slightly worse. The tendency could be accounted for by the distribution of these labels, where A0 is the most frequent and A2 and AM-TMP have lower frequencies than A0 and A1. The second possible reason could be due to that the majority of the A0 and A1 words are notional words which could be more easily transferred by cross-lingual models.
In addition, we can see that the tendencies across different models for all three languages and all labels are identical, where multi-source transfer performs the best, single-source SRC+TGT ranks the second and our baseline model is the last. The observation is consistent with the overall tendency, demonstrating the stability and also further verifying the effectiveness of our proposed models.
Performances by the distances to the predicate. Second, we study the SRL performances in terms of the distance to the predicate word. Intuitively, long-distance relations are more difficult, thus we expect that the SRL performance would decrease as the distance increases, as SRL actually detects the relationship between the role words and their predicates. Figure 8 shows the F1 scores. First, for all the settings we can see that the SRL performance drops by longer distances, which confirms our intuition. In addition, the tendency between different models is the same as the overall results, demonstrating the effectiveness of our method.

Conclusion
We proposed a translation-based alternative for cross-lingual SRL. The key idea is to construct high-quality datasets for the target languages by corpus translation from the gold-standard SRL annotations of the source languages. In addition, we combined the gold-standard source SRL corpora and the pseudo translated target corpora together to enhance the cross-lingual SRL models. We investigated cross-lingual SRL models with different kinds of multilingual word representations. Further, we presented a PGN-BiLSTM encoder to better exploit the mixture corpora of different languages. Experimental results on the UPB v1.0 dataset show that the translation-based method is an effective method for cross-lingual SRL transferring. Significant improvements can be achieved by using the translated datasets for all selected languages, including both single-source and multi-source transfer. Experiment analysis is offered to understand the proposed method in depth.

A Bilingual Transfer by Each Source
In the paper, we investigate the individual bilingual SRL transferring of each source target language pair. We here list the detailed results of the bilingual transfer in Table 5.

B Extended SRL Analysis
We conduct detailed analysis on the detailed role labeling and the distance to the predicate, further for the Italian, Spanish and Portuguese languages. Figures 9 and Figure 10 show the results.