Adversarial Learning with Contextual Embeddings for Zero-resource Cross-lingual Classification and NER

Contextual word embeddings (e.g. GPT, BERT, ELMo, etc.) have demonstrated state-of-the-art performance on various NLP tasks. Recent work with the multilingual version of BERT has shown that the model performs surprisingly well in cross-lingual settings, even when only labeled English data is used to finetune the model. We improve upon multilingual BERT’s zero-resource cross-lingual performance via adversarial learning. We report the magnitude of the improvement on the multilingual MLDoc text classification and CoNLL 2002/2003 named entity recognition tasks. Furthermore, we show that language-adversarial training encourages BERT to align the embeddings of English documents and their translations, which may be the cause of the observed performance gains.


Introduction
Contextual word embeddings (Devlin et al., 2019;Peters et al., 2018;Radford et al., 2019) have been successfully applied to various NLP tasks, including named entity recognition, document classification, and textual entailment. The multilingual version of BERT (which is trained on Wikipedia articles from 100 languages and equipped with a 110,000 shared wordpiece vocabulary) has also demonstrated the ability to perform 'zeroresource' cross-lingual classification on the XNLI dataset (Conneau et al., 2018). Specifically, when multilingual BERT is finetuned for XNLI with English data alone, the model also gains the ability to handle the same task in other languages. We believe that this zero-resource transfer learning can be extended to other multilingual datasets.
In this work, we explore BERT's 1 zero-resource performance on the multilingual MLDoc classification and CoNLL 2002/2003 NER tasks. We 1 'BERT' hereafter refers to multilingual BERT find that the baseline zero-resource performance of BERT exceeds the results reported in other work, even though cross-lingual resources (e.g. parallel text, dictionaries, etc.) are not used during BERT pretraining or finetuning. We apply adversarial learning to further improve upon this baseline, achieving state-of-the-art zero-resource results.
There are many recent approaches to zeroresource cross-lingual classification and NER, including adversarial learning (Chen et al., 2019;Kim et al., 2017;Xie et al., 2018;Joty et al., 2017), using a model pretrained on parallel text (Artetxe and Schwenk, 2018;Lu et al., 2018;Lample and Conneau, 2019) and self-training (Hajmohammadi et al., 2015). Due to the newness of the subject matter, the definition of 'zero-resource' varies somewhat from author to author. For the experiments that follow, 'zero-resource' means that, during model training, we do not use labels from non-English data, nor do we use human or machinegenerated parallel text. Only labeled English text and unlabeled non-English text are used during training, and hyperparameters are selected using English evaluation sets.
Our contributions are the following: • We demonstrate that the addition of a language-adversarial task during finetuning for multilingual BERT can significantly improve the zero-resource cross-lingual transfer performance.
• For both MLDoc classification and CoNLL NER, we find that, even without adversarial training, the baseline multilingual BERT performance can exceed previously published results on zero-resource performance.
• We show that adversarial techniques encourage BERT to align the representations of En- glish documents and their translations. We speculate that this alignment causes the observed improvement in zero-resource performance.
2 Related Work

Adversarial Learning
Language-adversarial training (Zhang et al., 2017) was proposed for generating bilingual dictionaries without parallel data. This idea was extended to zero-resource cross-lingual tasks in NER (Kim et al., 2017;Xie et al., 2018) and text classification (Chen et al., 2019), where we would expect that language-adversarial techniques induce features that are language-independent.

Self-training Techniques
Self-training, where an initial model is used to generate labels on an unlabeled corpus for the purpose of domain or cross-lingual adaptation, was studied in the context of text classification (Hajmohammadi et al., 2015) and parsing (McClosky et al., 2006;Zeman and Resnik, 2008). A similar idea based on expectation-maximization, where the unobserved label is treated as a latent variable, has also been applied to cross-lingual text classification in Rigutini et al. (2005).

Translation as Pretraining
Artetxe and Schwenk (2018) and Lu et al. (2018) use the encoders from machine translation models as a starting point for task-specific finetuning, which permits various degrees of multilingual transfer. Lample and Conneau (2019) add an additional masked translation task to the BERT pre-training process, and the authors observed an improvement in the cross-lingual setting over using the monolingual masked text task alone.

Model Training
We present an overview of the adversarial training process in Figure 1. We used the pretrained cased multilingual BERT model 2 as the initialization for all of our experiments. Note that the BERT model has 768 units. We always use the labeled English data of each corpus. We use the non-English text portion (without the labels) for the adversarial training.
We formulate the adversarial task as a binary classification problem (i.e. English versus non-English.) We add a language discriminator module which uses the BERT embeddings to classify whether the input sentence was written in English or the non-English language. We also add a generator loss which encourages BERT to produce embeddings that are difficult for the discriminator to classify correctly. In this way, the BERT model learns to generate embeddings that do not contain language-specific information.
The pseudocode for our procedure can be found in Algorithm 1. In the description that follows, we use a batch size of 1 for clarity.
For language-adversarial training for the classification task, we have 3 loss functions: the taskspecific loss L T , the generator loss L G , and the discriminator loss L D : where K is the number of classes for the task, are the output projections for the task-specific loss and discriminator respectively, y T (dim: K × 1) is the one-hot vector representation for the task label and y A (dim: scalar) is the binary label for the adversarial task (i.e. 1 or 0 for English or non-English).
In the case of NER, the task-specific loss has an additional summation over the length of the sequence: where p(Y t |x) (dim: K × 1) is the prediction for the t th word, L is the number of words in the sentence, y T (dim: K ×L) is the matrix of one-hot entity labels, and h θ (x) t (dim: 768 × 1) refers to the BERT embedding of the t th word.
The generator and discriminator losses remain the same for NER, and we continue to use the mean-pooled BERT embedding during adversarial training.
We then take the gradients with respect to the 3 losses and the relevant parameter subsets. The parameter subsets are θ D = {w D , b D }, θ T = {θ, W T , b T }, and θ G = {θ}. We apply the gradient updates sequentially at a 1:1:1 ratio.
During BERT finetuning, the learning rates for the task loss, generator loss and discriminator loss were kept constant; we do not apply a learning rate decay.
All hyperparameters were tuned on the English dev sets only, and we use the Adam optimizer in all experiments. We report results based on the average of 4 training runs. We finetuned BERT on the English portion of the MLDoc corpus (Schwenk and Li, 2018).

MLDoc Classification Results
The MLDoc task is a 4-class classification problem, where the data is a class-balanced subset of the Reuters News RCV1 and RCV2 datasets.
We used the english.train.1000 dataset for the classification loss, which contains 1000 labeled documents. For language-adversarial training, we used the text portion of german.train.10000, french.train.10000, etc. without the labels.
Algorithm 1 Pseudocode for adversarial training on the multilingual text classification task. The batch size is set at 1 for clarity. The parameter subsets are θ D = {w D , b D }, θ T = {θ, W T , b T }, and θ G = {θ}.
Input: pre-trained BERT model h θ , data iterators for English and the non-English language L, learning rates η D , η G , η T for each loss function, initializations for discriminator output projection w D , b D , task-specific output projection W T , b T , and BERT parameters θ 1: while not converged do 2: x En , y T ← DataIterator(En) get English text and task-specific labels 3:h En ← MeanPool(h θ (x En )) 4: compute the prediction for the task 5: L T ← −y T · logp T compute task-specific loss 6: θ, W T , b T += −η T ∇ θ T L T update model based on task-specific loss 7: x L , x En ← DataIterator(L), DataIterator(En) get non-English and English text discriminator prediction on non-English text 10: discriminator prediction on English text 11: We used a learning rate of 2 × 10 −6 for the task loss, 2 × 10 −8 for the generator loss and 5 × 10 −5 for the discriminator loss.
In Table 1, we report the classification accuracy for all of the languages in MLDoc. Generally, adversarial training improves the accuracy across all languages, and the improvement is sometimes dramatic versus the BERT non-adversarial baseline.
In Figure 2, we plot the zero-resource German and Japanese test set accuracy as a function of the number of steps taken, with and without adversarial training. The plot shows that the variation in the test accuracy is reduced with adversarial training, which suggests that the cross-lingual performance is more consistent when adversarial training is applied. (We note that the batch size and learning rates are the same for all the languages in MLDoc, so the variation seen in Figure 2 are not affected by those factors.)

CoNLL NER Results
We finetuned BERT on the English portion of the CoNLL 2002/2003NER corpus (Sang and De Meulder, 2003. We followed the text preprocessing in Devlin et al. (2019).
We used a learning rate of 6 × 10 −6 for the task loss, 6 × 10 −8 for the generator loss and 5 × 10 −4 for the discriminator loss.
In Table 2, we report the F1 scores for all of the CoNLL NER languages. When combined with adversarial learning, the BERT crosslingual F1 scores increased for German over the  non-adversarial baseline, and the scores remained largely the same for Spanish and Dutch. Regardless, the BERT zero-resource performance far exceeds the results published in previous work. Mayhew et al. (2017) and Ni et al. (2017) do use some cross-lingual resources (like bilingual dictionaries) in their experiments, but it appears that BERT with multilingual pretraining performs better, even though it does not have access to crosslingual information.

Alignment of Embeddings for Parallel Documents
Source Target Table 3: Median cosine similarity between the mean-pooled BERT embeddings of MLDoc English documents and their translations, with and without language-adversarial training. The median cosine similarity increased with adversarial training for every language pair, which suggests that the adversarial loss encourages BERT to learn language-independent representations.
If language-adversarial training encourages language-independent features, then the English documents and their translations should be close in the embedding space. To examine this hypothesis, we take the English documents from the MLDoc training corpus and translate them into German, Spanish, French, etc. using Amazon Translate.
We construct the embeddings for each document using BERT models finetuned on MLDoc. We mean-pool each document embedding to create a single vector per document. We then calculate the cosine similarity between the embeddings for the English document and its translation. In Table 3, we observe that the median cosine similarity increases dramatically with adversarial training, which suggests that the embeddings became more language-independent.

Discussion
For many of the languages examined, we were able to improve on BERT's zero-resource crosslingual performance on the MLDoc classification and CoNLL NER tasks. Language-adversarial training was generally effective, though the size of the effect appears to depend on the task. We observed that adversarial training moves the embeddings of English text and their non-English translations closer together, which may explain why it improves cross-lingual performance.
Future directions include adding the languageadversarial task during BERT pre-training on the multilingual Wikipedia corpus, which may further improve zero-resource performance, and finding better stopping criteria for zero-resource crosslingual tasks besides using the English dev set.