Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study

Multilingual BERT (mBERT) has shown reasonable capability for zero-shot cross-lingual transfer when fine-tuned on downstream tasks. Since mBERT is not pre-trained with explicit cross-lingual supervision, transfer performance can further be improved by aligning mBERT with cross-lingual signal. Prior work propose several approaches to align contextualised embeddings. In this paper we analyse how different forms of cross-lingual supervision and various alignment methods influence the transfer capability of mBERT in zero-shot setting. Specifically, we compare parallel corpora vs dictionary-based supervision and rotational vs fine-tuning based alignment methods. We evaluate the performance of different alignment methodologies across eight languages on two tasks: Name Entity Recognition and Semantic Slot Filling. In addition, we propose a novel normalisation method which consistently improves the performance of rotation-based alignment including a notable 3% F1 improvement for distant and typologically dissimilar languages. Importantly we identify the biases of the alignment methods to the type of task and proximity to the transfer language. We also find that supervision from parallel corpus is generally superior to dictionary alignments.


Introduction
Multilingual BERT (mBERT) (Devlin et al., 2019) is the BERT architecture trained on data from 104 languages where all languages are embedded in the same vector space. Due to the multilingual and contextual representation properties of mBERT, it has gained popularity in various multilingual and cross-lingual tasks (Karthikeyan et al., 2020;Wu and Dredze, 2019). In particular, it has demonstrated good zero-shot cross-lingual transfer perfor- * Work done during an internship at Amazon. mance on many downstream tasks, such as Document Classification, NLI, NER, POS tagging, and Dependency Parsing (Wu and Dredze, 2019), when the source and the target languages are similar.
Many experiments (Ahmad et al., 2019) suggest that to achieve reasonable performance in the zeroshot setup, the source and the target languages need to share similar grammatical structure or lie in the same language family. In addition, since mBERT is not trained with explicit language signal, mBERT's multilingual representations are less effective for languages with little lexical overlap (Patra et al., 2019). One branch of work is therefore dedicated to improve the multilingual properties of mBERT by aligning the embeddings of different languages with cross-lingual supervision.
Broadly, two methods have been proposed in prior work to induce cross-lingual signals in contextual embeddings: 1) Rotation Alignment as described in Section 2 aims at learning a linear rotation transformation to project source language embeddings into their respective locations in the target language space (Schuster et al., 2019b;Wang et al., 2019;Aldarmaki and Diab, 2019); 2) Fine-tuning Alignment as explained in Section 3 internally aligns language sub-spaces in mBERT through tuning its weights such that distances between embeddings of word translations decrease while not losing the informativity of the embeddings (Cao et al., 2020). Additionally, two sources of crosslingual signal have been considered in literature to align languages: parallel corpora and bilingual dictionaries. While the choice of each alignment method and source of supervision have a variety of advantages and disadvantages, it is unclear how these affect the performance of the aligned spaces across languages and various tasks.
In this paper, we empirically investigate the effect of these cross-lingual alignment methodologies and applicable sources of cross-lingual super-vision by evaluating their performance on zero-shot Named Entity Recognition (NER), a structured prediction task, and Semantic Slot-filling (SF), a semantic labelling task, across eight language pairs. The motivation for choice of these tasks to evaluate are two-fold: 1. Prior work has already studied alignment methods on sentence level tasks. Cao et al. (2020) show the effectiveness of mBERT alignment methods on XNLI (2018). 2. Wordlevel tasks do not benefit from more pre-training unlike other language tasks that improve by simply supplementing with more pre-training data. In experiments over the XTREME benchmark, Hu et al. (2020) find that transfer performance improves across all tasks when multilingual language models are pre-trained with more data, with the sole exception of word-level tasks. They note that this indicates current deep pre-trained models do not fully exploit the pre-training data to transfer to word-level tasks. We believe that NER and Slotfilling tasks are strong candidate tasks to assess alignment methods due to limited cross-lingual transfer capacity of current models to these tasks.
To the authors' knowledge, this is the first paper exploring the comparison of alignment methods for contextual embedding spaces: rotation vs. finetuning alignment and two sources of cross-lingual supervision: dictionary vs. parallel corpus supervision on a set of tasks of structural and semantic nature over a wide range of languages. From the results, we find that parallel corpora are better suited for aligning contextual embeddings. In addition, we find that rotation alignment is more robust for primarily structural NER downstream tasks while the fine-tuning alignment is found to improve performance across semantic SF tasks. In addition, we propose a novel normalisation procedure which consistently improves rotation alignment, motivated by the structure of mBERT space and how languages are distributed across it. We also find the effect of language proximity on transfer improvement for these alignment methods.
2 Rotation-based Alignment Mikolov et al. (2013) proposed to learn a linear transformation W s → t which would project an embedding in the source language e s to its translation in the target language space e t , by minimising the distances between the projected source embeddings and their corresponding target embeddings: (1) X s and X t are matrices of size d×K where d is the dimensionality of embeddings and K is the number of parallel words from word-aligned corpora, or word pairs from a bilingual dictionary between the source and target languages. Further work Xing et al. (2015) demonstrated that restricting W to a purely rotational transform improves cross-lingual transfer across similar languages. The orthogonality assumption reduces Eq.(1) into the so-called Procrustes problem with the closed form solution: where and the SVD operator stands for Singular Value Decomposition.

Language Centering Normalization
A purely rotational transformation can align two embedding spaces only if the two spaces are roughly isometric and are distributed about the same mean. In case the two embedding distributions are not centered around the same mean, meaning the two spaces have little overlap and are shifted by a translation offset in the space, they cannot be aligned solely through rotation.
Since the linear transformation W s → t derived from solving the Procrustus problem only rotates the vector space, it assumes the embeddings of two languages are zero-centered. However Libovický et al. (2019) observe that languages distributions in mBERT have distinct and separable centroids and different language families have well separated sub-spaces in the mBERT embedding vector space. To address this discrepancy, we propose a new normalisation mechanism which entails: Step 1. Normalising the embeddings of both languages so that they have zero mean: whereX s andX t are centroids of source and target embeddings X s and X t ; andX s andX t are meancentered source and target language embeddings their rows correspond to word translations. Next, X s andX t are used to compute the transformation matrixŴ s → t by solving Eq.(2) and Eq.(3).
Step 2. During training a downstream task, embedding of a source language word e s needs to be re-centered, rotated and finally translated to the target language subspace to derive the projection e t * : This helps the task specific model, particularly in zero-shot setting, by projecting the source language task data to the same locality as the target language.

Supervision Signals for Rotation Alignment
In this section we describe how existing work utilises two different cross-lingual signals, bilingual dictionaries and parallel corpora, to supervise rotation alignment. Additionally, we analyse the advantages and disadvantages of the two choices.

Bilingual Dictionary Supervision
In order to utilise a bilingual dictionary to supervise the embedding alignment, each word in the dictionary needs to have a single representation. However the same word can have many representations in the contextualised language model vector space depending on the context it occurs in. Schuster et al. (2019b) observes that the contextual embeddings of the same word form a tight cluster -word cloud, the centroid of this word cloud is distinct and separable for individual words. They further propose that centroid of a word cloud can be considered as the context-independent representation of a word, called average word anchor. These word anchors are computed by averaging embeddings over all occurrences of a word in a monolingual corpora, where words occur in a variety of contexts. Formally the mBERT embedding of a source language word s m in context c h is denoted as e sm,c h . If this word occurs a total of p times in the monolingual corpus, that is in contexts c 1 , c 2 , ...c p , the anchor word embedding A sm for word s m across all the contexts is the average: Average word anchor pair (A i sm , A i t m * ) , where i is the mBERT layer, for all word pairs from the dictionary (s m , t m * ) form the rows of matrices X i s and X i t respectively, which are then used to solve Eq.(2) and Eq.(3), resulting in an alignment transformation matrix W i s → t . However, there are limitations to this approach.  found that the word cloud of multi-sense words, such as the word "bank", which can mean either the financial institution or the edge of a river depending on the context, are further composed of clearly separable clusters, for every word sense. Averaging over multiple contextual embeddings infers losing certain degree of contextual information at both the source and target language words. Figure 1a visualises word anchor calculation and also highlights this limitation. On the other hand, one of the advantages of this method is that bilingual dictionaries are available for even very low resource languages.

Parallel Corpus Supervision
Word-aligned parallel sentences can be utilised as a source of cross-lingual signal to align contextual embeddings (Aldarmaki and Diab, 2019;Wang et al., 2019). Given a parallel corpora, s m and t m * are aligned source and the target language words appearing in context c h and c h * , respectively. The parallel word embedding matrices X i s and X i t for mBERT layer-i are composed from the contextual embeddings e i sm,c h and e i t m * ,c h * respectively, and are used to solve Eq.(2) and Eq.(3) to derive an alignment transformation matrix W i s → t . Figure 1a and 1b illustrate how parallel supervision is more suited to align contextual embeddings compared to dictionary supervision where multiple senses of a word are compressed into a single word anchor. However, parallel corpora rarely come with word-alignment annotations that are often automatically generated by off-the-shelf tools such as fast align (Dyer et al., 2013), which can be noisy. It is worth noting that word alignment error rate of an off-the-shelf tool drops when number of parallel sentences increases, therefore parallel corpus supervision is favourable for languages where more parallel data is available.

Fine-tuning Alignment with Parallel Corpora
Rotation alignment has a strong assumption that the two language spaces (or sub-spaces in case of mBERT) are approximately isometric (Søgaard et al., 2018). Patra et al. (2019) reported that the geometry of language embeddings becomes dissimilar for distant languages, and the isometry assumption degrades the alignment performance in such cases. In addition, as explained in Section 2.1 rotation alignment alone cannot achieve effective mapping when two languages spaces have separate centroids. Therefore, next we consider existing work to non-linearly align two language spaces. Cao et al. (2020) proposed to directly align languages within mBERT model through fine-tuning.   The objective of the fine-tuning is to minimise the distance between the two contextual representations of an aligned word pair in parallel corpora: However, fine-tuning with only the above objective would led to lose the semantic information in mBERT learnt during pre-training, since a trivial solution to the Eq. (7) can be simply to make all the embeddings equal. To deal with this, Cao et al. (2020) also proposed a regularisation loss that does not allow the embedding of a source language word to stray too far away from its original location e i sm in the pre-trained mBERT model, namely: Note that e i sm is generated from a copy of the original pre-trained mBERT model where the parameters are kept frozen. Both of the alignment and the regularization losses are combined and jointly optimised in order to align the two language subspaces while maintaining informativity of embeddings: Here n s to n e is the range of mBERT layers aligned. We experimented with two variants of the finetuning approach: 1) moving target language towards source language while keeping the source embeddings approximately fixed through the regularization term in Eq.(8); 2) moving the source language embeddings towards the target space while keeping the target language space relatively fixed, then the regularisation loss changes to:

Experimental Setup
In this section, we firstly describe the resources and implementation details of the alignment methods followed by the zero-shot NER and SF tasks used to evaluate the alignments. In addition, we briefly explain the datasets used in the experiments.

Learning Alignments
Our baseline model is a pre-trained mBERT * -12 transformer layers, 12 attention heads, 768 hidden dimensions -denoted as mBERT Baseline. When a word is tokenised into multiple subwords by the tokeniser, we average their corresponding subword embeddings to obtain embedding for the word.  (Abdelali et al., 2014). We obtain contextual and average anchor embeddings described in Section 2.2.1 by passing the corpora described above through pre-trained mBERT. We use the bilingual dictionaries provided with the MUSE framework  as the source for dictionary supervision. As for the parallel corpus supervision, since none of the collected parallel sentences contains word-level alignment information, we utilise fast align (Dyer et al., 2013) to automatically derive word alignment signals. * Available for download at: https://github. com/google-research/bert/blob/master/ multilingual.md For the rotation alignment, we compute four independent transformation matrices for each of the last four transformer layers similar to Wang et al. (2019). We use RotateAlign and NormRotateAlign to refer the rotation alignment learnt without and with the proposed language centering normalisation, respectively. To be consistent, for the finetuning alignment we align the word representations in the last four transformer layers of the mBERT model, denoted as FineTuneAlign.

Evaluation of the Alignments
We evaluate the learnt alignments using two downstream tasks: Named Entity Recognition (NER) and Semantic Slot Filling (SF), both of which aim to predict a label for each token in a sentence. NER is a more structural task with fewer entity types and involves less semantic understanding of the context compared to SF. Examples of the tasks can be found in Table 2.
We use the same model architecture and hyperparameters as Wang et al. (2019), two BiLSTM layers followed by a CRF layer, where learning rate is set to 10 −4 for European languages and 10 −5 for the other languages determined by the validation set. In order to measure the effectiveness of a learnt alignment, all the experiments are conducted with zero-shot settings similar to Wang et al. (2019), where the source language data is first transformed to the target language space and then used to train a BiLSTM-CRF model. The target language validation set is used for hyper-parameter tuning and reporting the evaluation results. For each experiment we report F1 scores averaged over 5 runs.

NER and SF Datasets
We use the following four families of datasets, each of which has the same set of labels. A summary of the datasets can be found in Table 1. Example utterances and annotations and shown in Table 2. CoNLL-NER: This includes CoNLL 2002, 2003 NER benchmark task (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) containing entity annotations for news articles in English, German, Spanish and Dutch. We also include in this family PioNER † (Ghukasyan et al., 2018), a manually annotated dataset in Armenian, which is typographically different from the other languages † PioNER data only has PER, LOC and ORG labels and does not contain MISC. in this family. In this dataset-family, target language data is sourced from local news articles, and not generated through translation from source data. (Price, 1990) is an English dataset containing conversational queries about flight booking. Upadhyay et al. (2018) manually translated a subset of the data into two languages, Turkish and Hindi, along with crowdsourced phrase-level annotations.

Results and Analysis
The evaluation results of each alignment method on the downstream NER and SF tasks are reported in Table 3 and Figure 2. In addition to the mBERT Baseline and for comparison purposes, we also list relevant results found in literature (Wu and Dredze, 2019;Wang et al., 2019;Upadhyay et al., 2018;Schuster et al., 2019a;Bellomaria et al., 2019) that have been evaluated on the same datasets.   However, in the case of Thai, which is a distant language from English, RotateAlign does not improve performance over the mBERT Baseline. This suggests that Thai and English's embedding spaces are structurally dissimilar.

Rotation Alignment with vs./ without Language Centering Normalisation
Applying the proposed language centering normalisation in Section 2.1 before performing the rotation alignment, namely NormRotateAlign in Table 3, is found to further improve downstream performance across all tasks and languages. The improvement over RotateAlign is up to 3% absolute F1 for Thai, around 1% absolute for moderately closer languages like Hindi, Turkish and Armenian, and around 0.5% absolute F1 for closer target languages such as German. Note that Thai, which does not benefit from rotation alignment alone, improves by an average of 2.3 points after applying the normalisation. These results corroborate that language families that are further away from each other have more separable sub-spaces in the mBERT Baseline, and bringing the language distributions closer helps the downstream task's performance.

Parallel Corpus vs./ Dictionary Supervision
Amongst the cases where RotateAlign improves performance over the mBERT Baseline, parallelcorpus supervised RotateAlign is superior to dictionary supervision, with the exception of Hindi. This could be explained by the fact that word anchors are independent of multiple word senses, thereby the cross-lingual signal is poorer compared to parallel word alignments. This is in line with observations from .

Rotation vs./ Fine-tuning Alignment
From Table 3 and Figure 2 we can see that Fine-TuneAlign explained in Section 3 improves performance over RotateAlign for semantic tasks (SF), with the only exception of ATIS-Hindi.
On the other hand, FineTuneAlign underperforms RotateAlign for structural tasks (NER), and in some cases even fall behind mBERT Baseline. Note that we notice no clear trend between FineTuneAlign src→tgt and FineTuneAlign tgt→src .

Dataset-Task
CoNLL-NER ATIS-SF FB-SF SNIPS-SF Transfer Pair en to de en to nl en to es en to hy en to hi en to tk en to es en to th en to it Baselines from Literature mBERT (Wu and Dredze, 2019) 69  FineTuneAlign src→tgt improves over the best rotation alignment NormRotateAlign parallel by 7.8% absolute for the ATIS-Turkish task from 38.18 to 45.98. It significantly outperforms mBERT Baseline by 24 points.

For FB-Thai
FineTuneAlign src→tgt surpasses NormRotateAlign dict by 8.39% absolute F1 from 12.38 to 20.77, 11 points higher than mBERT Baseline. For FB-Spanish we observe an improvement from 74.73 to 80.90 (6% absolute) compared to RotateAlign and similarly +6 points compared to mBERT Baseline. For SNIPS-Italian, FineTuneAlign improves performance over Norm-RotateAlign from 77.87 to 80.21 (2.5 points) and is 3.5 points better than mBERT Baseline.
All SF tasks considered are generated by translation from the source language data. This may indicate that the fine-tuning approach performs better than rotation-based methods for translated datasets, where there is high correlation between utterance structure of training data in source language and evaluation data in target language. On the other hand, rotation-based alignments generalise better when the downstream target sentence distribution is dissimilar from the source sentence distribution, as is the case for non-translated NER tasks.

Aligned Source Language vs./ Target Language Training
FineTuneAlign src→tgt achieves top F1 score of 80.21 on SNIPS-Italian dataset which is not far from the score of 83 from a BERT-based model trained on 1400 manually-annotated Italian utterances (2019). Also, our best alignment score of 80.90 for FB-Spanish (FineTuneAlign src→tgt ) surpasses translate-train baseline (2019a) where the annotations are automatically inferred from a NMT model. This suggests that for closer target languages, fine-tuning based alignment are not far behind from unaligned models trained on additional target language labelled examples. Performance improvement from fine-tuning alignment for translated datasets should not be attributed to superficial transfer of entity information from source language. An evidence to support this claim is the strong performance on the SNIPS Italian-SF dataset, which has been translated from SNIPS dataset (Bellomaria et al., 2019), where English entities have been replaced with Italian entities collected from the Web during dataset preparation. Therefore, during validation, the model came across utterances with similar structure but different entities, which shows that improvement from fine-tuning alignment is largely independent of language specific entity memorisation.
6 Related Work Aldarmaki and Diab (2019) propose to align ELMo embeddings (Peters et al., 2018) with word-level and sentence-level alignments. They compare the aligned ELMo with static character-level embeddings with similar alignments. Cao et al. (2020) originally proposed fine-tuning alignment of mBERT language sub-spaces. They claim these methods are strictly stronger to rotation alignments methods based solely on zero-shot experimentation on XNLI task   Figure 2: Trend of improvement from various alignment methods. Rotation alignment improves performance for NER, while fine-tuning alignment is found to be better for SF tasks. Improvements increase initially with distance between source and target languages and diminish for distant languages. ated through translation from source language. On the contrary, we observe that fine-tuning does not improve performance across all tasks, particularly structural tasks, where utterance structure changes and there is higher incidence of domain shift. This raises the question whether translated datasets are biased to fine-tuning alignment, and whether such datasets are a good evaluation test-bed for general cross-lingual transfer. Wang et al. (2019) applies rotational alignment to mBERT and reports results on CoNLL NER tasks, however the main focus of their work is on the overlap of static bilingual embeddings. They do not extend similar analysis on contextualised embeddings. In our work, drawing from the observations made by Libovický et al. (2019) on the distribution of languages in mBERT space, we propose a normalization mechanism to increase the overlap of two languages distributions prior to computing rotational alignment. Schuster et al. (2019b) originally proposed dictionary supervision to align ELMo with rotational transform. They claim supervision from dictionary is superior to using parallel word aligned corpora, however they do not substantiate these through comparative experiments. We observe that parallel corpus supervision is stronger than dictionary supervision possibly because of considering contextual alignment.

Conclusion
In this paper, we investigate cross-lingual alignment methods for multilingual BERT. We empirically evaluate their effect on zero-shot transfer for downstream tasks of two types: structural NER and semantic Slot-filling, across a set of diverse languages. Specifically, we compare rotation alignment and fine-tuning cross-lingual alignment. We compare the effect of dictionary and parallel corpora supervision across all tasks. We also propose a novel normalisation technique that improves state-of-the-art performance on zero-shot NER and Semantic Slot-filling downstream tasks, motivated by how languages are distributed across the mBERT space. Our experimental settings cover four datasets families (one for NER and three for SF) across eight language pairs. Key findings of this paper are as follows: (1) rotation-based alignments show large performance improvements (up to +19% absolute for Turkish ATIS-SF) on moderately close languages, only a small improvement for very close target languages and no improvement for very distant languages; (2) we propose a novel normalisation which centers language distributions prior to learning rotation maps and is consistently shown to improve rotation alignment across all tasks particularly for Thai, by up to 3% absolute; (3) rotational alignments are more robust and generalise well for structural tasks such as NER which may have higher utterance variability and domain shift; (4) supervision from parallel corpus generally leads to better alignment than dictionary-based, since it offers the possibility of generating contextualised alignments; (5) fine-tuning alignment improves performance for semantic tasks such as slot-filling where the source language data has minimal shift in utterance structure or domain from target language data and particularly improves performance for extremely distant languages (up to +8.39% absolute higher for Thai FB-SF) compared to rotation alignment; (6) for close languages and tasks with similar utterance structure, zero-shot fine-tuning alignment is competitive versus unaligned models trained on additional annotated data in target language.
This work aims to pave the way for optimising language transfer capability in contextual multilingual models. In the future, we would like to further investigate patterns in the embedding space and apply alignment methods into specific regions of the multilingual hyperspace to obtain more tailorsuited alignments between language pairs. We would also like to evaluate zero-shot capabilities of alignments when applied to other language tasks.