CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling

Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be effective with polyglot training by representing the distant vocabularies in a shared representation space. Meanwhile, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages (e.g. adjuncts have more or less similar semantic meaning across languages). To leverage such similarity in annotation space across languages, we propose a method called Cross-Lingual Argument Regularizer (CLAR). CLAR identifies such linguistic annotation similarity across languages and exploits this information to map the target language arguments using a transformation of the space on which source language arguments lie. By doing so, our experimental results show that CLAR consistently improves SRL performance on multiple languages over monolingual and polyglot baselines for low resource languages.


Introduction
Semantic Role Labeling (SRL) is the task of labeling each predicate and its corresponding arguments in a given sentence. SRL provides a more stable meaning representation across syntactically different sentences and has been seen to help a wide range of NLP applications such as question answering (Maqsud et al., 2014;Yih et al., 2016) and machine translation (Shi et al., 2016). * b Work done while at IBM Research Recent end-to-end deep neural networks for SRL, though performing well for languages with large training data Tan et al., 2018;, are much less effective for low resources languages due to very limited annotated data for these languages. Methods such as polyglot training (Mulcaire et al., 2018) seek to make these models perform better on low resource languages by combining supervision from multiple languages. The key idea in polyglot training is to combine the training data from multiple languages by using multilingual word embeddings from a shared space and a common encoder model (e.g. an LSTM). The argument sets for the languages are kept separate by using different classification layers. The arguments sets are kept separate because the semantic label spaces are usually language-specific (Mulcaire et al., 2018).
However, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages. Fig. 1 shows three different sentences from Chinese, German, and English, respectively, with defined predicate-argument structures. Although the predicates are essentially the same, their arguments are labeled differently across languages in the training data. For instance, all sentences contain words representing the same underlying semantic meaning that is temporal but with different argument labels (TMP in Chinese, A4 in German, AM-TMP in English).
We hypothesize that we can improve the SRL performance of low resource languages during cross-lingual transfer by identifying such arguments with similar semantic meaning across languages and representing them close to each other in the feature space. This requires: (1) Detecting the correspondence between the labels in different languages; and (2) Representing arguments with similar semantic meaning in the feature space for better SRL performance.
We propose a method called Cross-Lingual Argument Regularizer (CLAR) with a two-step process: Step 1: Pair Matching: Detecting a number of label pairs between the source and target languages during polyglot training. We call these arguments common arguments. Given the multilingual embedding already used in polyglot training, CLAR does not require additional cross-lingual alignments on parallel data.
Step 2: Regularization: Given the common arguments identified, find a transformation to bring the paired arguments close together. This transformation is learned and used in the poloyglot training process so that the knowledge on the labels in the source language can be better transferred to knowledge in the corresponding labels in the target language.
We evaluate CLAR on the SRL portion of the CoNLL 2009 dataset (Hajivc et al., 2009) 1 and compare its performance against baseline and polyglot training methods. The main contributions of this work are: • We propose CLAR, a simple yet effective method for better cross-lingual transfer by detecting similar semantic role arguments between languages without requiring crosslingual alignments or parallel data, and by learning a transformation for paired labels via regularization during SRL model training.
• We conduct comprehensive empirical studies and demonstrate the effectiveness of CLAR over both monolingual and polyglot baselines.
• We perform the ablation study and detailed analysis to understand why CLAR leads to better cross-lingual transfer and how its performance differs with different levels of correspondence among arguments.
The rest of the paper is organized as follows: Sec. 2 describes the base model. Sec. 3 describes CLAR. Sec. 4 demonstrate its efficacy with extensive empirical evaluation. Sec. 5 reviews the existing literature. Sec. 6 makes concluding remarks.
Model Architecture As shown in Fig. 2, our model architecture consists of four main modules: (1) sentence encoder takes the raw tokens sequentially and outputs a fixed sentence representation; (2) role labeler takes the sentence encoder output and identify and predicts roles of the tokens; (3) predicate sense disambiguator takes the sentence encoder output and predict the sense for each predicate; and (4) CLAR regularizer first detects the common arguments and then learns a manifold on which the arguments of the target languages lie. We now describe each of the modules in more details.

Sentence Encoder
Word Representation Knowing the predicate position has previously been shown to improve the argument labeling task  and since the predicate position is marked in the CoNLL 2009 dataset, we use this information and obtain the predicate-specific word representations for each word in the sentence. In addition to predicate-specific flag w f i , we represent each word w i in the sentence as a concatenation of several word features including randomly initialized word embeddings w r i , pre-trained word embeddings w p i , randomly initialized lemma embeddings w l i and randomly initialized POS tags embeddings w s i . Finally, each word is represented Since we combine Figure 2: A multitask framework for predicate sense disambiguation and argument classification with CLAR argument regularization the resources from a pair of languages similar to polyglot training (Mulcaire et al., 2018), we use the language-specific pre-trained word embeddings for w p i and train the SRL model on the source and target language simultaneously. BiLSTM Encoder To model the sequential input we use Bi-directional Long Short Term Memory neural networks (Hochreiter and Schmidhuber, 1997), which take in concatenated word representation for each word in the j-th sentence x j = (w j1 , w j2 , · · · , w jn ) and process them sequentially from both directions to obtain the contextual representations.

Semantic Role Labeler
Our role labeler consists of Multi-Layer Perceptron (MLP) layers with highway connections (Srivastava et al., 2015). It takes the contextualized word representations from the sentence encoder as an input and outputs a probability distribution over the set of argument labels for each word in the sentence. Given a sentence, we maximize the likelihood of labels for each word by minimizing where y i is the argument label, w i represents the input token, θ represents the model parameters, and N denotes the total number of samples.

The CLAR Algorithm
The underlying motivation for polyglot training Mulcaire et al. (2018) is that arguments from different languages often help enhance each other. It is reasonable to assume that if corresponding arguments from source and target languages are located closer in the feature space, their mutual enhancements can be strengthened. The possibility for doing so is based on the following observation.
In neural network models that generate labels, the last layer is usually a softmax layer of the form where y i ∈ R k , its k components corresponding to the k output argument labels. Given a i ∈ R m as a representation of the input token i calculated by previous layers, the rows h k of the weights H are responsible for distinguishing the different argument labels k from each other. During the simple polyglot training, the k argument labels consist of k s for the source language and k t for the target language. Splitting these h i s into two sets, u i for the source language and v i for the target language, we observe that for arguments labels, the Euclidean distance between u i and v j are often small if the i and j are corresponding argument labels. These can be brought even closer together by an affine transform (linear transform and translation). We therefore propose the following approach (CLAR) consisting of two steps: Step 1: Pair Matching: Detect the best pairing of the arguments between a pair of languages.
Step 2: Regularization: Find a transformation that brings the feature vectors corresponding to the paired argument labels close to each other.
These two steps are described in detail below.
Pair Matching: The goal of this step is to identify matching label pairs in the two languages. We start with the simple polyglot training (Mulcaire et al., 2018) for the first few epochs without CLAR and collect the last layer weights for all the target and source language arguments. Given the k s vectors u i and k t vectors v j , solve this constraint optimization problem (3) Intuitively, this requires finding pairings between i and j such that the total squared distance between paired vectors (u i , v j ) is minimized, subject to the constraint that each source argument matches at most one target argument and vice versa, and that at least K = min(k t , k s ) argument pairs are identified. This identifies K semantically similar argument pairs in source and target languages, represented in the binary matrix T , where T ij = 1 means that argument i in source language and argument j in target language are paired together. Later on (Sec. 4.5) we will show that in certain situations it makes sense to relax the "at most one" constraint and allow many-to-one or one-to-many matching. This is an Integer Linear Programming problem, for which many excellent solvers exist. We use GLPK solver from CVXOPT 2 .
We observe that the frequency distribution of the argument labels is quite skewed in the training dataset: a few labels (e.g., A0, A1) have much larger number of training examples than other labels. Experiments show that low-frequency labels cause noisy pair matching that degrades the output quality. Therefore, we consider only labels that have more than 1% of the total number occurrences in the respective language training data. Typically, 40 − 50% of the total labels in each language match this criterion. The k s and k t in the general algorithm are replaced byk s and k t for the number of arguments satisfying this criterion in the source and the target language, respectively.
Regularization: The goal of this step is to learn an affine transform to bring the target vectors closest to the corresponding source vectors. This step is performed iteratively during the overall training process.
Given the K pairs (u i , v i ) detected in the previous step, the objective of the overall optimization objective function is amended as follows where Ψv i + b is the affine transform to bring v i close to u i , and λ controls the strength of the amendments by the paired labels. The transformation Ψ, b is learned iteratively by minimizing (4)   . Underline shows the best performance among all methods.

Setup
We compare CLAR with several Monolingual and Polyglot methods. For monolingual baselines, we train separate SRL models for each language. For Polyglot and CLAR methods, we train the SRL model on a pair of language. We use pre-trained multilingual embeddings to allow the multilingual sharing between languages. We use Multilingual Unsupervised and Supervised Embeddings (MUSE) (Conneau et al., 2017) for all the languages except Chinese 3 , where we use fastText aligned word embeddings (Joulin et al., 2018). We also use the pre-trained BERT multilingual cased embeddings (Devlin et al., 2019) (2018) we also report the polyglot results with our model architecture keeping the same word representation to avoid any ambiguity between Polyglot and CLAR comparison.

Results
Comparison Against Polyglot and Monolingual Training: Table 1 summarizes the performance of CLAR and all baselines for SRL. As can be seen, for both MUSE and BERT embeddings, CLAR results in better SRL models than those obtained via monolingual and polyglottraining for all target languages. The improvement is particularly noticeable for the languages with much fewer (< 1/3) training samples than those of EN (e.g. DE and ES). This result confirms that CLAR can effectively transfers knowledge from a high resource language (EN) to other languages with less resource. Note that for CS, neither CLAR nor polyglot training shows performance gain over the baseline. CLAR outperforms the polyglot baseline but remains on par with the monolingual baseline. We present further investigation on this in Section 4.5.
Comparison Against SoTA: With the powerful BERT multilingual embeddings, CLAR surpasses the best previously reported results on 3 out of 6 languages (Table 1). In fact, its average performance surpasses that any previous-reported single system. The strong performance of CLAR confirms its great promise for cross-lingual transfer.  by CLAR also helps improving the performance of languages with abundant training data. As illustrated in Table 2, transferring knowledge using CLAR from other languages to EN leads to small but consistent improvements for EN.

CLAR Performance on Arguments Alone:
Since CLAR mainly affects role labeling, we conduct further analysis of its performance on argument classification alone (i.e. predicate sense disambiguation is not evaluated). The results are summarized in Table 3 for Base SRL + MUSE embedding. One can observe that for all target languages, CLAR registers small but noticeable improvements (0.24% to 1.51%) for argument classification in comparison to both monolingual and polyglot methods. The consistent improvements confirm the effectiveness of CLAR in enabling better cross-lingual transfer.

What does CLAR do?
The results of our comparison studies clearly demonstrate that CLAR outperforms both baseline and polyglot training methods. In this subsection we first explain the intuition behind CLAR and then investigate how it regularizes the arguments.
Intuition: During Polyglot training we examine the last layer weights of the base SRL model and hypothesize that there exists a mapping between source and target language argument. To evaluate this hypothesis, we plot the weights of the output layer using SVD by keeping the two directions corresponding to top two largest eigenvalues learned by Polyglot (Row I) training in Fig. 3. We draw a line between the arguments that are paired by Equation (3). As can be seen, the euclidean distance between some of the paired arguments is similar. For instance, the euclidean distance between the arguments A1 and ZH-A1 is similar to that between A2 and ZH-A2 in Fig. 3b. This pattern emerges from the training data for most of the target languages. Further, we observe that the euclidean distances among the common arguments for the source and target languages are also similar. For example, in Fig. 3b, the euclidean distance between the source (EN) arguments A1 and A2 is similar to that between the target language arguments ZH-A1 and ZH-A2. This observation holds true for most of the arguments across the target languages ( Fig. 3a -3c).
The above observations confirm that there exists similar arguments in source and target languages. The arguments in target language lie on a manifold that is similar in structure, with some translation and/or rotation, to the manifold on which the source language argument lies.

Argument Matching and Regularization:
Therefore, we first match the arguments with similar meanings in the target and the source language. We observe that almost all the matched argument pairs have similar meaning: some are syntactically visible (e.g. ES-argM-adv in ES and AM-ADV in EN), whereas others are semantically similar (e.g. ES-argM-fin and AM-PNC having the same meaning purpose). After obtaining the matched argument pairs, we regularize the output layer weights of the matched target arguments by forcing them to live on a matched source arguments manifold in (4). A list of matched arguments for various language pairs is provided in Appendix C.
We plot the CLAR learned weight vectors in Fig. 3 (Row II). We can observe the uniformity in lines (in terms of length), which are drawn between paired target to source language arguments. Further, to quantify the length of these lines, we plot the euclidean distance matrix among the matched source language arguments. Among the target language arguments, we compute the correlation coefficient between the euclidean distance for EN- DE, EN-ZH, and EN-ES to be 0.9984, 0.9531, and 0.9352 respectively. The fact that all these coefficients are close to one indicates that CLAR is indeed able to detect a manifold for the target language arguments similar to the one for the source language arguments. Our experimental results (Table 3) demonstrate that allowing the paired target language arguments to lie on the detected manifold improves the argument classification performance.

Ablation and Analysis
Effect of K We also observe the impact of K on the argument classification performance in Table 4. We find that regularizing all the arguments obtained from (3), while performing better than polyglot, is not a great choice overall. We suspect that considering all the paired arguments adds noise in the system. This is likely because some of the arguments in the target languages are language-specific and might be matched with an argument in the source language which has no close correspondence, for example, the Chinese argument ZH-C-C-A0 has no direct corresponding argument in English. Additionally, in some languages, arguments are labeled at a very granular level, and multiple arguments in these languages may correspond to a single argument in the source language. For example, multiple arguments in Czech frequently map to only one corresponding argument  in English.
Languages with Similar Linguistic Annotations: To further study the effectiveness of CLAR, we analyze the cross-lingual transfer between the languages known to have similar linguistic annotations. We expect to observe better cross-lingual transfer between such language pairs. Specifically, we examine Spanish (ES) and Catalan (CA) from the same AnCora corpus (Taulé et al., 2008). We consider ES as the source language because it has more training samples than CA.
In Table 6 we show the paired arguments detected by CLAR along with the euclidean distance between them. It can be seen that the euclidean distance for all paired arguments are close to 1, confirming that CLAR can effectively match semantically similar arguments across languages.
The experimental results are summarized in Table 5. As expected, CLAR surpasses all prior re-    sults on CA. With the semantically similar language ES, the SRL performance on CA is better than the monolingual and polyglot training methods. Further, we observe a 0.87 point absolute gain in F1 score when the cross-lingual transfer occurred from a similar linguistic annotated language (ES) than a less similar language (EN), despite of much smaller training data size (≤ 30% of EN). This observation strengthen our hypothesis that by representing the semantically similar arguments across languages on similar manifolds improves the SRL performance.
To visualize the space on which the common source and target language argument lies, we plot the heatmap of the euclidean distance between the last layer weights of the learned model in Fig. 4. We plot the separate heatmaps among the paired arguments for each language, the source language (in Fig. 4a) and the target language (in Fig. 4b). We observe these two heatmaps look very identical in distribution (a very high correlation coefficient 0.9996 and a low Frobenius norm square of the difference 1.793). This means that CLAR transforms the weight vectors of the corresponding target language arguments in such a way that the transformed weight vectors lie on a manifold, which is similar to another manifold on which source language argument weights lie but translated and/or rotated. The aforementioned is evident from Table 6 where we report the distance between these argument pairs.
Why is Czech an Exception? Though Czech (CS) has the most training samples in the CoNLL 2009 dataset, the cross-lingual transfer to and from CS is not very significant, as apparent both from Table  3 and previous work by Mulcaire et al. (2018). We observe that the arguments in CS are labeled at a significant finer granularity than those of other languages. For example, for temporal arguments alone, the argument set in Czech contains 9 different labels at the finest granularity. In contrast, each of the other languages has only one single label for temporal arguments. Since CLAR performs oneto-one mapping to and from the source language, we suspect that CLAR encounters challenges in choosing one among many fine grained arguments to map to a coarse argument in English. While it is possible to extend CLAR with many-to-one mapping, based on our preliminary study (Appendix D), it may introduce additional noise. We plan to explore this direction in the future.

Related Work
Models for SRL largely fall into two categories: syntax-agnostic and syntax-aware. For a long time, syntax was considered a prerequisite for better SRL performance (Punyakanok et al., 2008;Gildea and Jurafsky, 2002). In the absence of syntactic information, these methods struggle to capture the discriminatory features and thus perform poorly.
Recently, end-to-end deep neural models have been shown to extract useful discriminatory features even without syntactic information (Zhou and Xu, 2015;Tan et al., 2018; and achieve state-of-the-art performance. However, some works (Roth and Lapata, 2016;He et al., 2017;Strubell et al., 2018) argue that given a high-quality syntax parser, it is possible to further improve the SRL performance. Along this line,  proposed a SRL model based on graph convolutional networks which incorporates syntactic information from a parser (Kiperwasser and Goldberg, 2016). Further,  proposes a more general framework to integrate syntax into SRL tasks. All these methods have been shown to perform well on rich resource languages.
Several recent attempts have been made to transfer knowledge from rich source languages to low resource languages for SRL tasks (Mulcaire et al., 2018(Mulcaire et al., , 2019 such that the knowledge transfer helps the model to learn better feature representations for low resource languages. To some extent, in other NLP tasks such as named identity recognition (Xie et al., 2018), and syntactic dependency parsing (Ammar et al., 2016) this knowledge transfer seems to be helping low resource languages. Our experimental results further strengthen this claim and confirm that languages share knowledge at the semantic level as well.
An alternative line of work transfers crosslingual knowledge to generate semantic labels for low resource languages by exploiting the monolingual SRL model and Multilingual parallel data  with an assumption that the sentences in parallel corpora are semantically equivalent. Similarly, (Prazák and Konopík, 2017) converts the monolingual dependency tree to a universal dependency tree for crosslingual transfer. Though these methods do not require the knowledge of semantic roles in the target language, they require the availability of massive parallel corpora. On the other hand, CLAR is able to detect the similarity among arguments between the language pairs even in the presence of less data.

Conclusion
We introduces CLAR, a Cross-Lingual Argument Regularizer. It explores linguistic annotation similarity across languages and exploits this obtained information during SRL model training to map the target language arguments as the deformation of a space on which source language arguments lie. We confirm the effectiveness of CLAR for SRL on CoNLL 2009 dataset over monolingual and polyglot methods, without prior knowledge of cross-lingual alignments or parallel data. This paper demonstrates the promise of understanding and exploiting linguistic annotation similarity across languages during polyglot training. We plan to explore other ways of identifying and leveraging linguistic annotation similarity across languages.  Table 7 describes the training data statistics for each language. In the dataset, for every language, all sentences are marked with predicate-argument structures. Across the languages the argument label set is different.

B Hyperparameters
In our experiments, we randomly initialize the word and lemma embedding of dimension 100 each, the pos embedding of dimension 32, and the flag embedding of dimension 16. We use the same model parameters as mentioned in : a 4-layer BiLSTM with 512 dimensional hidden units and 0.1 dropout rate for the sentence encoder. Our role labeler has 5 MLP highway layers with ReLU activations. We train the model with Adam optimizer (Kingma and Ba, 2014) and minimize the final categorical cross-entropy objective. We train each model for 20 epochs and use early stopping with patience 5 on target language development set. For all the experiments, we repeat with 3 different initialization and report the average F1 score along with precision and recall.

C Paired Arguments
We present the list of matched arguments for source-target language pairs in Table 8. We observe that almost all the argument pairs have similar meaning: some are syntactically visible (e.g. ES-argM-adv in ES and AM-ADV in EN), whereas others are semantically similar (e.g. ES-argM-fin and AM-PNC having the same meaning purpose). We also plot the the euclidean distance matrix among the matched source language arguments and among the target language arguments. In Fig. 5 we show the distance matrix for various language pairs. We compute the correlation coefficient between these matrices and All these coefficients are close to 1 which show that CLAR is indeed able to detect a manifold for the target language arguments similar to the one for the source language arguments.

D CLAR Extension to Many-to-one Mapping
We suspect that CLAR gets a difficulty in choosing one among many fine grained arguments to map to a coarse argument in source language. Here we perform the preliminary investigation on the many to one extension of CLAR. Since CS have ORIG AM-LOC  cases with both COND and CAUS, therefore, they are mapped together to a single argument in EN.
Although CLAR with many-to-one mapping is able to match multiple target language argument labels to a single source language argument label, it actually leads to performance drop as compared to one-to-one mapping (Table 10). This drop in performance is likely because while learning manyto-one mappings, CLAR loses its discriminatory power among those multiple arguments which are mapped to a single label. To validate this phenomenon, at test time, we combine all the argument labels mapped to a single label both for the target and the prediction set; that is, we combine {TWHEN, THL, THO} and propose a new label (say TWHEN) and observe 1ppt ↑ in F 1 on these combined labels. However, how to effectively leverage CLAR with many-to-one mapping for SRL model training remains an open question and requires further exploration in the future.