Cross-Lingual Dependency Parsing with Unlabeled Auxiliary Languages

Cross-lingual transfer learning has become an important weapon to battle the unavailability of annotated resources for low-resource languages. One of the fundamental techniques to transfer across languages is learning language-agnostic representations, in the form of word embeddings or contextual encodings. In this work, we propose to leverage unannotated sentences from auxiliary languages to help learning language-agnostic representations. Specifically, we explore adversarial training for learning contextual encoders that produce invariant representations across languages to facilitate cross-lingual transfer. We conduct experiments on cross-lingual dependency parsing where we train a dependency parser on a source language and transfer it to a wide range of target languages. Experiments on 28 target languages demonstrate that adversarial training significantly improves the overall transfer performances under several different settings. We conduct a careful analysis to evaluate the language-agnostic representations resulted from adversarial training.


Introduction
Cross-lingual transfer, where a model learned from one language is transferred to another, has become an important technique to improve the quality and coverage of natural language processing (NLP) tools for languages in the world. This technique has been widely applied in many applications, including part-of-speech (POS) tagging (Kim et al., 2017), dependency parsing (Ma and Xia, 2014), named entity recognition (Xie et al., 2018), entity linking , coreference resolution , and question answering (Joty et al., 2017). Noteworthy improvements are achieved on low resource language applications due to cross-lingual transfer learning.
In this paper, we study cross-lingual transfer for dependency parsing. A dependency parser consists of (1) an encoder that transforms an input text sequence into latent representations and (2) a decoding algorithm that generates the corresponding parse tree. In cross-lingual transfer, most recent approaches assume that the inputs from different languages are aligned into the same embedding space via multilingual word embeddings or multilingual contextualized word vectors, such that the parser trained on a source language can be transferred to target languages. However, when training a parser on the source language, the encoder not only learns to embed a sentence but it also carries language-specific properties, such as word order typology. Therefore, the parser suffers when it is transferred to a language with different language properties. Motivated by this, we study how to train an encoder for generating language-agnostic representations that can be transferred across a wide variety of languages.
We propose to utilize unlabeled sentences of one or more auxiliary languages to train an encoder that learns language-agnostic contextual representations of sentences to facilitate crosslingual transfer. To utilize the unlabeled auxiliary language corpora, we adopt adversarial training  of the encoder and a classifier that predicts the language identity of an input sentence from its encoded representation produced by the encoder. The adversarial training encourages the encoder to produce language invariant representations such that the language classifier fails to predict the correct language identity. As the encoder is jointly trained with a loss for the primary task on the source language and adversarial loss on all languages, we hypothesize that it will learn to capture task-specific features as well as generic structural patterns applicable to many languages, and thus have better transferrability.
To verify the proposed approach, we conduct experiments on neural dependency parsers trained on English (source language) and directly transfer them to 28 target languages, with or without the assistance of unlabeled data from auxiliary languages. We chose dependency parsing as the primary task since it is one of the core NLP applications and the development of Universal Dependencies (Nivre et al., 2016) provides consistent annotations across languages, allowing us to investigate transfer learning in a wide range of languages. Thorough experiments and analyses are conducted to address the following research questions: • Does encoder trained with adversarial training generate language-agnostic representations? • Does language-agnostic representations improve cross-language transfer?
Experimental results show that the proposed approach consistently outperform a strong baseline parser (Ahmad et al., 2019), with a significant margin in two family of languages. In addition, we conduct experiments to consolidate our findings with different types of input representations and encoders. Our experiment code is publicly available to facilitate future research. 1

Training Language-agnostic Encoders
We train the encoder of a dependency parser in an adversarial fashion to guide it to avoid capturing language-specific information. In particular, we introduce a language identification task where a classifier predicts the language identity (id) of an input sentence from its encoded representation. Then the encoder is trained such that the classifier fails to predict the language id while the parser decoder predicts the parse tree accurately from the encoded representation. We hypothesize that such an encoder would have better cross-lingual transferability. The overall architecture of our model is illustrated in Figure 1. In the following, we present the details of the model and training method.

Architecture
Our model consists of three basic components, (1) a general encoder, (2) a decoder for parsing, and (3) a classifier for language identification. The encoder learns to generate contextualized representations for the input sentence (a word sequence) 1 https://github.com/wasiahmad/cross lingual parsing which are fed to the decoder and the classifier to predict the dependency structure and the language identity (id) of that sentence.
The encoder and the decoder jointly form the parsing model and we consider two alternatives 2 from (Ahmad et al., 2019): "SelfAtt-Graph" and "RNN-Stack". The "SelfAtt-Graph" parser consists of a modified self-attentional encoder (Shaw et al., 2018) and a graph-based deep bi-affine decoder (Dozat and Manning, 2017), while the "RNN-Stack" parser is composed of a Recurrent Neural Network (RNN) based encoder and a stack-pointer decoder (Ma et al., 2018).
We stack a classifier (a linear classifier or a multi-layer Perceptron (MLP)) on top of the encoder to perform the language identification task. The identification task can be framed as either a word-or sentence-level classification task. For the sentence-level classification, we apply average pooling 3 on the contextual word representations generated by the encoder to form a fixed-length representation of the input sequence, which is fed to the classifier. For the word-level classification, we perform language classification for each token individually.
Algorithm 1 Training procedure. Parameters to be trained: Encoder (θ g ), Decoder (θ p ), and Classifier (θ d ) X a = Annotated source language data X b = Unlabeled auxiliary language data I = Number of warm-up iterations k = Number of learning steps for the discriminator (D) at each iteration λ = Coefficient of L d α 1 , α 1 = learning rate; B = Batch size Require: 1: for j = 0, · · · , I do 2: Update θ g := θ g − α 1 ∇ θg L p 3: Update θ p := θ p − α 1 ∇ θp L p 4: for j = I, · · · , num iter do 5: for k steps do 6: Total loss L := L p − λL d 10: Update θ g := θ g − α 1 ∇ θg L 11: In this work, following the terminology in adversarial learning literature, we interchangeably call the encoder as the generator, G and the classifier as the discriminator, D.

Training
Algorithm 1 describes the training procedure. We have two types of loss functions: L p for the parsing task and L d for the language identification task. For the former, we update the encoder and the decoder as in the regular training of a parser. For the latter, we adopt adversarial training to update the encoder and the classifier. We present the detailed training schemes in the following.

Parsing
To train the parser, we adopt both cross-entropy objectives for these two types of parsers as in (Dozat and Manning, 2017;Ma et al., 2018). The encoder and the decoder are jointly trained to optimize the probability of the dependency trees (y) given sentences (x): The probability of a tree can be further factorized into the products of the probabilities of each token's (m) head decision (h(m)) for the graph-based parser, or the probabilities of each transition step decision (t i ) for the transition-based parser:

Language Identification
Our objective is to train the contextual encoder in a dependency parsing model such that it encodes language specific features as little as possible, which may help cross-lingual transfer. To achieve our goal, we utilize adversarial training by employing unlabeled auxiliary language corpora.
Setup We adopt the basic generative adversarial network (GAN) for the adversarial training. We assume that X a and X b be the corpora of the source and auxiliary language sentences, respectively. The discriminator acts as a binary classifier and is adopted to distinguish the source and auxiliary languages. For the training of the discriminator, weights are updated according to the original classification loss: For the training of dependency parsing, the generator, G collaborates with the parser but acts as an adversary with respect to the discriminator. Therefore, the generator weights (θ g ) are updated by minimizing the loss function, where λ is used to scale the discriminator loss (L d ). In this way, the generator is guided to build language-agnostic representations in order to fool the discriminator while being helpful for the parsing task. Meanwhile, the parser can be guided to rely more on the language-agnostic features.
Alternatives We also consider two alternative techniques for the adversarial training: Gradient Reversal (GR) (Ganin et al., 2016) and Wasserstein GAN (WGAN) . As opposed to GAN based training, in GR setup, the discriminator acts as a multiclass classifier that predicts language identity of the input sentence, and we use multi-class cross-entropy loss. We also study Wasserstein GAN (WGAN), which is proposed by  to improve the stability of GAN based learning. Its loss function is shown as follows. here, the annotations are similar to those in the GAN setting.

Experiments and Analysis
In this section, we discuss our experiments and analysis on cross-lingual dependency parsing transfer from a variety of perspectives and show the advantages of adversarial training.

Settings.
In our experiments, we study singlesource parsing transfer, where a parsing model is trained on one source language and directly applied to the target languages. We conduct experiments on the Universal Dependencies (UD) Treebanks (v2.2) (Nivre et al., 2018) using 29 languages, as shown in Table 1. We use the publicly available implementation 4 of the "SelfAtt-Graph" and "RNN-Stack" parsers. 5 Ahmad et al. (2019) show that the "SelfAtt-Graph" parser captures less language-specific information and performs better than the 'RNN-Stack" parser for distant target languages. Therefore, we use the "SelfAtt-Graph" parser in most of our experiments. Besides, the multilingual variant of BERT (mBERT) (Devlin et al., 2019) has shown to perform well in cross-lingual tasks (Wu and Dredze, 2019) and outperform the models trained on multilingual word embeddings by a large margin. Therefore, we consider conducting experiments with both multilingual word embeddings and mBERT. We use aligned multilingual word embeddings (Smith et al., 2017;Bojanowski et al., 2017) with 300 dimensionss or contextualized word representations provided by multilingual BERT 6 (Devlin et al., 2019) with 768 dimensions as the word representations. In addition, we use the Gold universal POS tags to form the input representations. 7 We freeze the word representations during training to avoid the risk of disarranging the multilingual representation alignments. We select six auxiliary languages 8 (French, Portuguese, Spanish, Russian, German, and Latin) for unsupervised language adaptation via adversarial training. We tune the scaling parameter λ in the range of [0.1, 0.01, 0.001] on the source language validation set and report the test performance with the best value. For gradient reversal (GR) and GAN based adversarial objectives, we use Adam (Kingma and Ba, 2015) to optimize the discriminator parameters, and for WGAN, we use RM-SProp (Tieleman and Hinton, 2012). The learning rate is set to 0.001 and 0.00005 for Adam and RM-SProp, respectively. We train the parsing models for 400 and 500 epochs with multilingual BERT and multilingual word embeddings respectively. We tune the parameter I (as shown in Algorithm 1) in the range of [50, 100, 150].
Language Test. The goal of training the contextual encoder adversarially with unlabeled data from auxiliary languages is to encourage the encoder to capture more language-agnostic representations and less language-dependent features. To test whether the contextual encoders retain language information after adversarial training, we train a multi-layer Perceptron (MLP) with softmax on top of the fixed contextual encoders to perform a 7-way classification task. 9 If a contextual encoder performs better in the language test, it indicates that the encoder retains language specific information.  Table 2: Cross-lingual transfer performances (UAS%/LAS%, excluding punctuation) of the SelfAtt-Graph parser (Ahmad et al., 2019) on the test sets. In column 1, languages are sorted by the word-ordering distance to English. (en-fr) and (en-ru) denotes the source-auxiliary language pairs. ' †' indicates that the adversarially trained model results are statistically significantly better (by permutation test, p < 0.05) than the model trained only on the source language (en). Results show that the utilization of unlabeled auxiliary language corpora improves cross-lingual transfer performance significantly. sults demonstrate that the adversarial training with the auxiliary language identification task benefits cross-lingual transfer with a small performance drop on the source language. When multi-lingual embedding is employed, the performance significantly improves, in terms of UAS of 0.48 and 0.61 over the 29 languages when French and Russian are used as the auxiliary language, respectively. When richer multilingual representation technique like mBERT is employed, adversarial training can still improve cross-lingual transfer performances (0.21 and 0.54 UAS over the 29 languages by using French and Russian, respectively).

Results and Analysis
Next, we apply adversarial training on the "RNN-Stack" parser and show the results in Table 3. Similar to the "SelfAtt-Graph"parser, the "RNN-Stack" parser resulted in significant improvements in cross-lingual transfer from unsu-pervised language adaptation. We discuss our detailed experimental analysis in the following.

Impact of Adversarial Training
To understand the impact of different adversarial training types and objectives, we apply adversarial training on both word-and sentence-level with gradient reversal (GR), GAN, and WGAN objectives. We provide the average cross-lingual transfer performances in Table 4 for different adversarial training setups. Among the adversarial training objectives, we observe that in most cases, the GAN objective results in better performances than the GR and WGAN objectives. Our finding is in contrast to Adel et al. (2018) where GR was reported to be the better objective. To further investigate, we perform the language test on the encoders trained via these two objectives. We find that the GR-based trained encoders perform consistently better than the GAN based ones on the language identification task, showing that via GAN-based training, the encoders become more language-agnostic. In a comparison between GAN and WGAN, we notice that GANbased training consistently performs better.
Comparing word-and sentence-level adversarial training, we observe that predicting language identity at the word-level is slightly more useful for the "SelfAtt-Graph" model, while the sentence-level adversarial training results in better performances for the "RNN-Stack" model. There is no clear dominant strategy.
In addition, we study the effect of using a linear classifier or a multi-layer Perceptron (MLP) as the discriminator and find that the interaction between the encoder and the linear classifier resulted in improvements. 10

Adversarial v.s. Multi-task Training
In section 3.1.1, we study the effect of learning language-agnostic representation by using auxiliary language with adversarial training. An alternative way to leverage auxiliary language corpora is by encoding language-specific information in the representation via multi-task learning. In the multi-task learning (MTL) setup, the model observes the same amount of data (both labeled and unlabeled) as the adversarially trained (AT) model. The only difference between the MTL and AT models is that in the MTL models, the contextual encoders are encouraged to capture languagedependent features while in the AT models, they are trained to encode language-agnostic features.
The experiment results using multi-task learning in comparison with the adversarial training are presented in Table 5. Interestingly, although the MTL objective sounds contradiction to adversarial learning, it has a positive effect on the crosslingual parsing, as the representations are learned with certain additional information from new (unlabeled) data. Using MTL, we sometimes observe improvements over the baseline parser, as indicated with the † sign, while the AT models consistently perform better than both the baseline and the MTL model (as shown in Columns 2-5 in Table  5). The comparisons on parsing performances do not reveal whether the contextual encoders learn to 10 This is a known issue in GAN training as the discriminator becomes too strong, it fails to provide useful signals to the generator. In our case, MLP as the discriminator predicts the language labels with higher accuracy and thus fails.  Table 3: Cross-lingual transfer results (UAS%/LAS%, excluding punctuation) of the RNN-Stack parser on the test sets. ' †' indicates that the adversarially trained model results are statistically significantly better (by permutation test, p < 0.05) than the model trained only on the source language (en). encode language-agnostic or dependent features. Therefore, we perform language test with the MTL and AT (GAN based) encoders, and the results are shown in Table 5, Columns 6-7. The results indicate that the MTL encoders consistently perform better than the AT encoders, which verifies our hypothesis that adversarial training motivates the contextual encoders to encode languageagnostic features.

Impact of Auxiliary Languages
To analyze the effects of the auxiliary languages in cross-language transfer via adversarial training, we perform experiments by pairing up 11 the source language (English) with six different lan-   Table 5: Comparison between adversarial training (AT) and multi-task learning (MTL) of the contextual encoders.
Columns 2-5 demonstrate the parsing performances (UAS%/LAS%, excluding punctuation) on the auxiliary languages and average of the 29 languages. Columns 6-7 present accuracy (%) of the language label prediction test. ' †' indicates that the performance is higher than the baseline performance (shown in the 2nd column of  guages (spanning Germanic, Romance, Slavic, and Latin language families) as the auxiliary language. The average cross-lingual transfer performances are presented in Table 6 and the results suggest that Russian (ru) and German (de) are better candidates for auxiliary languages. We then dive deeper into the effects of auxiliary languages trying to understand whether auxiliary languages particularly benefit target languages that are closer to them 12 or from the same family. Intuitively, we would assume when the auxiliary language has a smaller average distance to all the target languages, the cross-lingual transfer performance would be better. However, from the results in Table 6, we do not see such a pattern. iliary languages we tested, but it is not among the better auxiliary languages. We further zoom in the cross-lingual transfer improvements for each language families as shown in Table 7. We hypothesis that the auxiliary languages to be more helpful for the target languages in the same family. The experimental results moderately correlate with our expectation. Specifically, the Germanic family benefits the most from employing German (de) as the auxiliary language; similarly Slavic family with Russian (ru) as the auxiliary language (although German as the auxiliary language brings similar improvements). The Romance family is an exception because it benefits the least from using French (fr) as the auxiliary language. This may due to the fact that French is too closed to English, thus is less suitable to be used as an auxiliary language.

Related Work
Unsupervised Cross-lingual Parsing. Unsupervised cross-lingual transfer for dependency parsing has been studied over the past few years (Agić et al., 2014;Ma and Xia, 2014;Xiao and Guo, 2014;Tiedemann, 2015;Guo et al., 2015;Aufrant et al., 2015;Rasooli and Collins, 2015;Duong et al., 2015;Schlichtkrull and Søgaard, 2017;Ahmad et al., 2019;Rasooli and Collins, 2019;He et al., 2019). Here, "unsupervised transfer" refers to the setting where a parsing model trained only on the source language is directly Lang (en,ru) -en (en,fr) -en (en,de) -en (en,la) -en IE.   Table 7: Average cross-lingual performance difference between the SelfAtt-Graph parser trained on the source (en) and an auxiliary (x) language and the SelfAtt-Graph parser trained only on English (en) language (UAS%/LAS%, excluding punctuation). We use multilingual BERT in this set of experiments.
transferred to the target languages. In this work, we relax the setting by allowing unlabeled data from one or more auxiliary (helper) languages other than the source language. This setting has been explored in a few prior works. Cohen et al.
(2011) learn a generative target language parser with unannotated target data as a linear interpolation of the source language parsers. Täckström et al. (2013) adopt unlabeled target language data and a learning method that can incorporate diverse knowledge sources through ambiguous labeling for transfer parsing. In comparison, we leverage unlabeled auxiliary language data to learn language-agnostic contextual representations to improve cross-lingual transfer.
Multilingual Representation Learning. The basic of the unsupervised cross-lingual parsing is that we can align the representations of different languages into the same space, at least at the word level. The recent development of bilingual or multilingual word embeddings provide us with such shared representations. We refer the readers to the surveys of Ruder et al. (2017) and Glavaš et al. (2019) for details. The main idea is that we can train a model on top of the source language embeddings which are aligned to the same space as the target language embeddings and thus all the model parameters can be directly shared across languages. During transfer to a target language, we simply replace the source language embeddings with the target language embeddings. This idea is further extended to learn multilingual contextualized word representations, for example, multilingual BERT (Devlin et al., 2019), have been shown very effective for many crosslingual transfer tasks (Wu and Dredze, 2019). In this work, we show that further improvements can be achieved by adaptating the contextual encoders via unlabeled auxiliary languages even when the encoders are trained on top of multilingual BERT.
Adversarial Training. The concept of adversarial training via Generative Adversarial Networks (GANs) Szegedy et al., 2014;Goodfellow et al., 2015) was initially introduced in computer vision for image classification and received enormous success in improving model's robustness on input images with perturbations. Later many variants of GANs Gulrajani et al., 2017) were proposed to improve its' training stability. In NLP, adversarial training was first utilized for domain adaptation (Ganin et al., 2016). Since then adversarial training has started to receive an increasing interest in the NLP community and applied to many NLP applications including part-of-speech (POS) tagging (Gui et al., 2017;Yasunaga et al., 2018), dependency parsing (Sato et al., 2017), relation extraction (Wu et al., 2017), text classification (Miyato et al., 2017;, dialogue generation (Li et al., 2017).
In the context of cross-lingual NLP tasks, many recent works adopted adversarial training, such as in sequence tagging (Adel et al., 2018), text classification (Xu and Yang, 2017;, word embedding induction Lample et al., 2018), relation classification (Zou et al., 2018), opinion mining (Wang and Pan, 2018), and question-question similarity reranking (Joty et al., 2017). However, existing approaches only consider using the target language as the auxiliary language. It is unclear whether the language invariant representations learned by previously proposed methods can perform well on a wide variety of unseen languages. To the best of our knowledge, we are the first to study the effects of language-agnostic representations on a broad spectrum of languages.

Conclusion
In this paper, we study learning language invariant contextual encoders for cross-lingual transfer. Specifically, we leverage unlabeled sentences from auxiliary languages and adversarial training to induce language-agnostic encoders to improve the performances of the cross-lingual dependency parsing. Experiments and analysis using English as the source language and six foreign languages as the auxiliary languages not only show improvements on cross-lingual dependency parsing, but also demonstrates that contextual encoders successfully learns not to capture language-dependent features through adversarial training. In the future, we plan to investigate the effectiveness of adversarial training for multi-source transfer to parsing and other cross-lingual NLP applications.