Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles

We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.


Introduction
Despite considerable efforts on developing semantically annotated resources for semantic role labeling (SRL) (Palmer et al., 2005;Erk et al., 2003;Zaghouani et al., 2010), majority of languages do not have such annotated resources. The lack of annotated resources for SRL has led to a growing interest in transfer methods for developing semantic role labeling systems. The ultimate goal of transfer methods is to transfer supervised linguistic information from a rich-resource language to a target language of interest. Amongst transfer methods, annotation projection is a method that projects supervised annotation from a rich-resource language to a low-resource language through automatic word alignments in parallel data (Hwa et al., 2002;Padó and Lapata, 2009). Recent work on annotation projection for SRL (Kozhevnikov and Titov, 2013a;van der Plas et al., 2014;Akbik et al., 2015;Aminian et al., 2017) presumes the availability of accurate supervised features such as lemmas, part-of-speech (POS) tags and syntactic parse trees. However, this is not a realistic assumption for truly low-resource languages, for which (accurate) supervised features are hardly available.
This paper considers the problem of annotation projection of dependency-based SRL in a scenario for which only parallel data is available for the target language. Recent state-of-the-art SRL systems have shown a significant reliance on the predicate lemma information while in a low-resource language, a lemmatizer might not be available. We first demonstrate that unsupervised stems can be used as an alternative to supervised lemma features. We further show that we can obtain a robust and simple SRL model for the target language without relying on any explicit linguistic feature (including lemmas), either supervised or unsupervised. We achieve this goal by changing the structure of a state-of-the-art deep SRL system (Marcheggiani et al., 2017) to make it independent of supervised features. Our model solely rely on word and character level features in the target language.
The main contribution of this work is on applying annotation projection without relying on supervised features in the target language of interest. To the best of our knowledge, this is the first study that builds a cross-lingual SRL transfer model in the absence of any explicit linguistic information in the target language. We make use of the recently released Universal Proposition Banks (Akbik et al., 2016) Figure 1: An example of annotation projection for an English-German sentence pair from the Europarl corpus (Koehn, 2005). Supervised predicate-argument structure of the English sentence (edges on top) is generated using our supervised SRL system trained on PropBank 3 (Palmer et al., 2005). Dashed lines in the middle show intersected word alignments from Giza++ (Och and Ney, 2003). Dashed edges at the bottom show the projected predicate-arguments.
a semi-automatically annotated data that unifies the annotation scheme for all languages. We show the effectiveness of our method on a range of languages, namely German, Spanish, Finnish, French, Italian, Portuguese, and Chinese. We compare our model to a state-of-the-art baseline that uses a rich set of supervised features and show that our model outperforms on six out of seven languages in the Universal Proposition Banks. Furthermore, for Finnish, a morphologically rich language, our model with unsupervised features improves over the model that relies on a supervised lemmatizer. This paper is structured as the following: §2 briefly overviews the dependency-based SRL task and annotation projection, §3 describes our approach, §4 shows the experimental results and analysis, §5 gives overviews about the related work, and §6 concludes the paper and proposes suggestions for future work.

Background
In this section, we provide a brief overview of dependency-based SRL and annotation projection.
Dependency-based SRL In dependency-based SRL, the goal is to find arguments along with their roles for each predicate in a sentence. Formally, in a sentence x = [x i ] n i=1 with n words, and m predicates where ψ i is the sense of the predicate with index p i in the sentence, we find the semantic dependencies between each word in the sentence with respect to each predicate: where r is the role of the jth word as an argument for the predicate word x p i . In case that a word is not an argument, r is NULL. Evaluation of the system output is conducted on semantic dependencies (p i r − → j|ψ i ); thus the SRL system should find predicate senses as well as argument roles. During training, these dependencies are used as training instances for a machine learning algorithm. Previous work (Björkelund et al., 2009;Roth and Lapata, 2016;Marcheggiani et al., 2017) factorized this task into predicate sense disambiguation, argument identification, and argument classification.
Annotation Projection In annotation projection, we assume that we have a parallel data P = [(s (1) , t (1) ), · · · , (s (k) , t (k) )] such that each sentence s (i) is a translation of sentence t (i) . Here, we assume that s (i) belongs to a rich-resource language in which annotated resources are available. In contrast, t (i) belongs to a low-resource target language where annotated data and tools such as semantic roles, dependency trees, part-of-speech tags, word senses, and lemmas might not be available.
For every sentence s (i) , we run a supervised SRL system to obtain its supervised argument structure ], we use an automatic word alignment system to obtain one-to-one word alignments. We define 0 ≤ a (i) j ≤ l i as the index of the source word that is aligned to the jth word in the ith target sentence, where a (i) j = 0 indicates a missing alignment. We use the following conditions to project a semantic dependency from a source sentence to a target sentence: where L s (i) is the supervised argument structure and L t (i) is the projected argument structure for the ith sentence. We assume that there is a universal predicate sense that is common across languages (this is the case in the Universal Propositon Banks). Figure 1 shows an example for an English-German translation pair. We use the projected data as training data in a supervised learning system to train a SRL system in the target language. In practice, many words do not receive any projected label mainly due to missing alignments. Thus, L t (i) usually contains sentences with partially projected semantic dependencies.

Our Model
Our goal is to train a SRL system on the projected predicate-argument structures without having supervised features such as supervised lemmas, dependency parse trees, and part-of-speech tags. Our model has two main components: 1) joint argument identification and classification which we simply refer to as argument classifier , and 2) predicate sense disambiguation. Our argument classifier is inspired by the model of Marcheggiani et al. (2017): we use predicate-specific BiLSTM encoders, and a role+predicatespecific decoder. However, unlike the model of Marcheggiani et al. (2017), which relies heavily on POS tags and predicate lemmas, we do not use a supervised lemmatizer and POS tagger in any layer. Instead, we benefit from character representations and unsupervised stems to bring in unsupervised features to our model.

Joint Argument Identification and Classification
Given a sentence s = [s i ] n i=1 that contains n tokens with m predicates in the predicate set P, we run m separate predicate-specific deep BiLSTM encoders [E j ] m j=1 to extract contextualized representations for each token given a predicate index p j .
Input Representation For each encoder [E j ] m j=1 , we represent each token s i as the concatenation of its word embedding (x re i and x pe i ), character embedding (x char i ) and predicate lemma embedding (x lem i,j ): 2 where: • x re i ∈ R dw is a randomly initialized word embedding vector; • x pe i ∈ R dw is an external pre-trained word embedding that is fixed during training; • x char i ∈ R d ch is character representation of each token s i . For every token, we obtain x char i by running a deep bidirectional LSTM (Hochreiter and Schmidhuber, 1997) on top of each word. We use the concatenation of the final backward representation of the first character, and final forward representation of the last character to represent each token: where x c i ∈ R dc is a randomly initialized character embedding and |s i | is the number of characters in token s i ; • x le i,j ∈ R d le is a lemma vector for each word s i with respect to the predicate that is targeted in E j . x le i,j is active if s i is the predicate word, otherwise, a zero vector is used to represent the lemma embedding: where the concatenated zero/one value is a flag to indicate if the current token is the targeted lemma. In our model, we use one of the following options to represent predicate lemma: -Represent each lemma by a deep character BiLSTM. This BiLSTM is different from the character BiLSTM in x char .
-Use an unsupervised morphological analyzer to give the surface-form stem of each word. This way, we can use a lemma embedding dictionary without requiring a lemmatizer.
Predicate-Specific Encoder A deep BiLSTM is used to get the final representation for each token in a sentence. In the following notation, h i,j is the final hidden state from the deep BiLSTM model for the ith token with respect to the jth predicate: Role+Predicate-Specific Decoder Given the BiLSTM representations, we perform an affine transformation on the concatenation of h p j ,j (predicate representation) and h i,j (argument representation) to find the probability of having the ith token as the argument of predicate p j with role r (including the NULL role): where x j,r is a parameter matrix that encodes the information of role r and the jth predicate. This matrix is calculated as follows: where u l j ∈ R d l is another predicate lemma embedding parameter which is specifically used for the decoder layer, v r ∈ R dr is a randomly initialized role embedding, U is a parameter matrix, and RELU is the rectified linear units activation function (Nair and Hinton, 2010). Similar to the input layer, we represent u l j by 1) a different deep character BiLSTM, or 2) a surface-form stem obtained from an unsupervised morphological analyzer.
A graphical depiction of the network in a case for which lemmas are represented by character BiL-STMs is shown in figure 2. As shown in the figure, we use two different character BiLSTMs in order to represent lemmas: one for the input representation and the other for the decoder representation.

Experiments
Datasets and Tools We use English as the source language and project SRL annotations to the following languages: German, Spanish, Finnish, French, Italian, Portuguese, and Chinese. We use the Europarl parallel corpus (Koehn, 2005) for the European languages and a random sample of 2 million sentence pairs from the MultiUN corpus (Eisele and Chen, 2010) for Chinese. We use the Giza++ tool (Och and Ney, 2003) with its default setting for word alignment. We run Giza++ in source-to-target and the reverse Figure 2: A graphical depiction of our joint argument identification and classification model without using part-ofspeech tags, lemmas, and syntax.
In this example, the predicate-specific encoder considers word eats as the sentence predicate and the goal is to score the assignment of argument apple with label A 0 . Our model contains three different character BiLSTMs; at the bottom, a character BiLSTM is run to acquire a character-based representation for all the words in the sentence in the absence of POS tags. There are two character BiLSTMs for predicate lemma: one in the encoder level (next to the second word) to model predicate lemma in the input layer and the other in the decoder level (top left). In this example, we just show one layer of BiLSTM but we use a deep BiLSTM in our experiments.
concat direction and get the intersection of alignment links. For English, we use the pre-trained embedding vectors generated using the structured skip-gram model of Ling et al. (2015). For the target languages, we train Word2vec (Mikolov et al., 2013) on Wikipedia data to generate embedding vectors. We implement our deep network using the Dynet library (Neubig et al., 2017). We use the dimension of 100 for word embeddings, 50 for characters, 512 for LSTM encoders, 128 for role and lemma embeddings in the decoder, and 100 for decoder lemma embedding. We pick random minibatches of size 1000 with a fixed learning rate of 0.001 for learning the parameter values with the Adam optimizer (Kingma and Ba, 2014). The depth of BiLSTM network is set to one for character representation (x char ) and three for predicate-specific representations (x le , u l ). Predicate Disambiguation Our model is agnostic to predicate senses but since our automatic evaluation relies on automatic predicate senses, we need a disambiguation module. Predicate disambiguation systems typically contains separate classifiers for each predicate lemma (Björkelund et al., 2009). Since we do not have a reliable lemmatizer in the target language, we train a single classifier for all predicates. We encode a sentence with a three-layer deep BiLSTM and run a softmax layer on top of each predicate to disambiguate the predicate sense of each predicate.
Predicate identification on the source side For projection experiments, first of all we need to identify predicates in the source language. Input to our predicate identifier is the concatenation of word embed-

Projection Experiments
Our supervised SRL system is a reimplementation of the model of Marcheggiani et al. (2017). We generate automatic English predicate senses using a system similar to the predicate disambiguation module of Björkelund et al. (2009) except that we replace the logistic regression classifier with the averaged Perceptron algorithm (Collins, 2002). In order to comply with the Universal Proposition Bank annotation scheme, we convert the argument spans in the English PropBank v3 (Palmer et al., 2005) to dependencybased arguments by labeling the syntactic head of each span. For annotation projection, we define density of alignments to find sentences with relatively-dense alignments: where l i is the length of the ith target sentence in parallel data, a (i) j is the alignment index for the jth word in the target sentence, and I(a (i) j > 0) is an indicator for a non-NULL alignment. We prune the target sentence pairs with density less than 80% for all European languages. We set this threshold to 60% for Chinese in order to obtain a comparable number of sentences to the European languages. Table 1 summarizes the sizes of projected datasets after applying the density filter. We set the number of training epochs to 2 for all languages based on development results obtained from the English to German projections.
Since the original model of Marcheggiani et al. (2017) heavily relies on the predicate lemma information for making robust prediction, we further assess the influence of using explicit linguistic features in our model by using a) supervised lemma from the UDPipe pre-trained models (Straka and Straková, 2017), and b) unsupervised stems obtained from unsupervised morphological analyzer. We use the unsupervised morphological analyzer of Virpioja et al. (2013), and obtain morpheme classes by running Morfessor FlatCat (Grönroos et al., 2014) on the output of the analyzer. We run the fixed-affix finite-state machine of (Rasooli et al., 2014) to obtain a single stem for all words including the out-of-vocabularies.

Results
We compare our character-based approach (CModel) with three different models: 1) The crosslingual model of Aminian et al. (2017) (Bootstrap) that uses a rich set of supervised features including supervised lemmas, POS tags, and dependency parse information, 2) a variant of our model that uses supervised lemmas (SLem) generated by a lemmatizer to represent predicate lemmas in the input and the decode layers, and 3) a model similar to the second model but using unsupervised stems (UStem) generated by an unsupervised morphological analyzer to represent predicate lemmas. Here, we aim to asses   the effects of using different levels of explicit linguistic features ranging from fully specified supervised features to unsupervised features in our model. The Bootstrap model uses an iterative bootstrapping approach by utilizing a special cost function and benefiting from a rich set of supervised lexico-syntactic features, thereby, it is considered a hard baseline. Since Bootstrap has a large number of features, the model is not memory-wise scalable to our projection data sizes. Therefore we train the Bootstrap model on a random sample of 20K sentences. This number is similar to the number of sentences used in the original experiments (Aminian et al., 2017). Table 2 shows labeled F-scores using both gold and automatic predicate senses on the test portion of the Universal Proposition Banks. As shown in Table 2, our model (CModel) outperforms the Bootstrap model for all languages except French. Additionally, our model performs on par to the supervised lemma and unsupervised stem models. This demonstrates the power of our approach even though our model has access to fewer linguistic features in the target language. Using unsupervised stems outperforms supervised lemma on all languages except Portuguese and Italian. This further highlights the reliance of the model on the accuracy of lemmatizer.
Analysis As shown in Table 2, using automatic predicate senses leads to a significant reduction in accuracy. This degradation is caused by two reasons. First, training a single classifier for all predicates in the absence of explicit predicate lemma information, and second, using unified predicate senses for all languages leads to lower precision for out-of-vocabulary words. This happens due to the fact that we cannot make use of the default sense of predicate (lemma.01). Among all the languages in our experiments, French is the only language that our model underperforms the Bootstrap model. Our analysis on French shows that our model has not been able to correctly predict A0 and A1 arguments in 20% and 30% of cases, and labeled them as NULL.

Related Work
There has been a great deal of interest in using transfer methods for SRL by different techniques such as enhancing the quality of projections Lapata, 2005, 2009), joint learning of syntax and semantics (van der Plas et al., 2011;Kozhevnikov and Titov, 2013b), and iterative bootstrapping to learn a robust model from erroneous projections (Akbik et al., 2015;Aminian et al., 2017). Previous work presumes availability of a wide range of supervised lexico-syntactic features for the target language. Consequently, their performance heavily relies on accuracy of the available tagging tools (Akbik et al., 2015). For instance, Akbik et al. (2015) reports lower argument precision for languages that do not have accurate syntactic parsers such as Arabic and Hindi. In contrary to the previous studies, our work builds a cross-lingual SRL system without having any supervised features for the target language.
One obstacle for developing transfer models is the absence of a unified annotation scheme for all languages. There has been a great deal of work in developing universal annotation schemes for a variety of tasks such as POS tagging (Petrov et al., 2011), dependency parsing (Nivre et al., 2017), morphology (Kirov et al., 2018), and SRL (Kozhevnikov and Titov, 2013a;Wang et al., 2017). Our work makes use of the recently released Universal Proposition Bank (Akbik et al., 2016). This dataset maps every predicate lemma in every language to its corresponding English lemma following the frame and role label schemes of the English Proposition Bank 3.0 (Palmer et al., 2005) In the realm of supervised SRL methods, however, there have been several efforts to build SRL models that do not need a wide range of linguistic features (specifically syntactic features) (Marcheggiani et al., 2017;Zhou and Xu, 2015;He et al., 2017He et al., , 2018Cai et al., 2018;Mulcaire et al., 2018). In a more recent study, Mulcaire et al. (2018) proposed a polyglot SRL system that benefits from the similarities between the semantic structures of different languages to improve monolingual SRL. All those studies, however, assume the availability of semantically annotated datasets for the target language, thus making them non-applicable to low-resource languages.

Conclusion
We have described a method for cross-lingual transfer of dependency-based SRL systems via annotation projection. Our model is agnostic to linguistic features leading to a robust model that can be trained on projected text on a target language without annotated data. We have shown that our model achieves comparable performance in annotation projection and also supervised SRL. In addition to improving the performance of our model with the current setting, future work should study more effective ways to apply the transfer methods; e.g. combining with the direct transfer method in the absence of large parallel corpora.