Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

This paper investigates the problem of learning cross-lingual representations in a contextual space. We propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient approach to generate cross-lingual contextualized word embeddings based on publicly available pre-trained BERT models (Devlin et al., 2018). In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages. We demonstrate the effectiveness of this approach on zero-shot cross-lingual transfer parsing. Experiments show that our embeddings substantially outperform the previous state-of-the-art that uses static embeddings. We further compare our approach with XLM (Lample and Conneau, 2019), a recently proposed cross-lingual language model trained with massive parallel data, and achieve highly competitive results.


Introduction
One of the most promising directions for crosslingual dependency parsing, which also remains a challenge, is to bridge the gap of lexical features. Prior works (Xiao and Guo, 2014;Guo et al., 2015) have shown that cross-lingual word embeddings are able to significantly improve the transfer performance compared to delexicalized models (McDonald et al., 2011(McDonald et al., , 2013. These crosslingual word embeddings are static in the sense that they do not change with the context. 2 Recently, contextualized word embeddings derived from large-scale pre-trained language models (McCann et al., 2017;Peters et al., 2017Peters et al., , 2018; * Email corresponding 1 Our code is released at https://github.com/ WangYuxuan93/CLBT 2 In this paper, we refer to these embeddings as static as opposed to contextualized ones.

WX Y
He loves the movie  Figure 1: A toy illustration of the method, where contextualized embeddings of the word canal from Spanish is transformed to the semantic space of English. Devlin et al., 2018) have demonstrated dramatic superiority over traditional static word embeddings, establishing new state-of-the-arts in various monolingual NLP tasks (Ilić et al., 2018;Schuster et al., 2018). The success has also been recognized in dependency parsing (Che et al., 2018). The great potential of these contextualized embeddings has inspired us to extend its power to crosslingual scenarios.
Several recent works have been proposed to learn contextualized cross-lingual embeddings by training cross-lingual language models from scratch with parallel data as supervision, and has been demonstrated effective in several downstream tasks (Schuster et al., 2018;Mulcaire et al., 2019;Lample and Conneau, 2019). These methods are typically resource-demanding and timeconsuming. 3 In this paper, we propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient off-line approach that learns a linear transformation from contextual word alignments. With CLBT, contextualized embeddings from pre-trained BERT models in different languages are projected into a shared semantic space. The learned transformation is then used on top of the BERT encodings for each sentence, which are further fed as input to a parser.
Our approach utilizes the semantic equivalence in word alignments, and thus is supposed to be word sense-preserving. Figure 1 illustrates our approach, where contextualized embeddings of the Spanish word "canal" are transformed to the corresponding semantic space in English.
Experiments on the Universal Dependencies (UD) treebanks (v2.2) (Nivre et al., 2018) show that our approach substantially outperforms previous models that use static cross-lingual embeddings, with an absolute gain of 2.91% in averaged LAS. We further compare to XLM (Lample and Conneau, 2019), a recently proposed large-scale cross-lingual language model. Results demonstrate that our approach requires significantly fewer training data, computing resources and less training time than XLM, yet achieving highly competitive results.

Related Work
Static cross-lingual embedding learning methods can be roughly categorized as on-line and off-line methods. Typically, on-line approaches integrate monolingual and cross-lingual objectives to learn cross-lingual word embeddings in a joint manner (Klementiev et al., 2012;Kočiský et al., 2014;Guo et al., 2016), while off-line approaches take pretrained monolingual word embeddings of different languages as input and retrofit them into a shared semantic space (Xing et al., 2015;Lample et al., 2018;Chen and Cardie, 2018).
Several approaches have been proposed recently to connect the rich expressiveness of contextualized word embeddings with cross-lingual transfer. Mulcaire et al. (2019) based their model on ELMo (Peters et al., 2018) and proposed a polyglot contextual representation model by capturing character-level information from multilingual data. Lample and Conneau (2019) adapted the objectives of BERT (Devlin et al., 2018) to incorporate cross-lingual supervision from parallel data to learn cross-lingual language models (XLMs), which have obtained state-of-the-art results on several cross-lingual tasks. Similar to our approach, Schuster et al. (2019) also aligned pretrained contextualized word embeddings through linear transformation in an off-line fashion. They used the averaged contextualized embeddings as an anchor for each word type, and learn a transformation in the anchor space. Our approach, however, learns this transformation directly in the contextual space, and hence is explicitly designed to be word sense-preserving.

Cross-Lingual BERT Transformation
This section describes our proposed approach, namely CLBT, to transform pre-trained monolingual contextualized embeddings to a shared semantic space.

Contextual Word Alignment
Traditional methods of learning static crosslingual word embeddings have been relying on various sources of supervision such as bilingual dictionaries (Lazaridou et al., 2015;Smith et al., 2017), parallel corpus (Guo et al., 2015) or online Google Translate (Mikolov et al., 2013;Xing et al., 2015). To learn contextualized cross-lingual word embeddings, however, we require supervision at word token-level (or context-level) rather than type-level (i.e. dictionaries). Therefore, we assume a parallel corpus as our supervision, analogous to on-line methods such as XLM (Lample and Conneau, 2019).
In our approach, unsupervised bidirectional word alignment is first applied to the parallel corpus to obtain a set of aligned word pairs with their contexts, or contextual word pairs for short. For one-to-many and many-to-one alignments, we use the left-most aligned word, 4 such that all the resulting word pairs are one-to-one. In practice, since WordPiece embeddings (Wu et al., 2016) are used in BERT, all the parallel sentences are tokenized using BERT's wordpiece vocabulary before being aligned.

Off-Line Transformation
Given a set of contextual word pairs, their BERT representations {x i , y i } n i=1 can be easily obtained from pre-trained BERT models, 5 where x i ∈ R d 1 is the contextualized embedding of token i in the target language, and y i ∈ R d 2 is the representation of its alignment in the source language.
In our experiments, a parser is trained on source language data and applied directly to all the target languages. Therefore, we propose to project the embeddings of target languages to the space of the source language, instead of the opposite direction. Specifically, we aim at finding an appropriate linear transformation W, such that Wx i approximates y i . 6 This can be achieved by solving the following optimization problem: Previous works on static cross-lingual embeddings have shown that an orthogonal W (i.e. W W = I) is helpful for the word translation task (Xing et al., 2015). In this case, an analytical solution can be found through singular value decomposition (SVD) of Y X: Here X ∈ R n×d and Y ∈ R n×d are the contextualized embedding matrices, where n is the number of aligned contextual word pairs, d is the dimension of monolingual contextualized embeddings. Each pair of rows (x i , y i ) indicates an aligned contextual word pair.
Although this can be computed in CPUs within several minutes, more memories will be required with the growth of the amount of training data. Therefore, we present an approximate solution, where W is optimized with gradient decent (GD) and is not constrained to be orthogonal. 7 This GDbased approach can be trained on a single GPU and typically converges in several hours.
To validate the effectiveness of our approach in cross-lingual dependency parsing, we first obtain the CLBT embeddings with the proposed approach, and then use them as input to a modern graph-based neural parser (described in next section), in replacement of the pre-trained static embeddings. Note that BERT produces embeddings in wordpiece-level, so we only use the left-most wordpiece embedding of each word. 8

Data and Settings
In our experiments, the contextual word pairs are obtained from the Europarl corpora (Koehn, 2005) using the fast align toolkit (Dyer et al., 2010). Only 10,000 sentence pairs are used for each target language. For the parsing datasets, we use the Universal Dependencies(UD) Treebanks (v2.2) (Nivre et al., 2018), 9 following the settings of the previous state-of-the-art system (Ahmad et al., 2018). From the 31 languages they have analyzed, we select 18 whose Europarl data is publicly available. 10 Statistics of the selected languages and treebanks can be found in the Appendix. We employ the Biaffine Graph-based Parser of Dozat and Manning (2017) and adopt their hyper-parameters for all of our models.
In all the experiments, English is used as the source language, and the other 17 languages as targets. The model is trained on the English treebank and applied directly to target languages with the transformed contextualized embeddings. We train our models using the Adam optimizer (Kingma and Ba, 2015), and most of the them converge within a few thousand epochs in several hours. More implementation details are reported in the Appendix.

Baseline Systems
We compare our method with the following three baseline models: • mBERT (contextualized). Embeddings generated by the mBERT model are directly used in the training and testing procedures.  (Bojanowski et al., 2017) to obtain cross-lingual static embeddings, which represents the previous state-ofthe-art. We report results from their paper of the RNNGraph model which used the same architecture as ours.
• XLM (Lample and Conneau, 2019, on-line, contextualized). A strong method which learns contextualized cross-lingual embeddings from scratch with cross-lingual data.  For the XLM model, we employ the XNLI-15 model 11 they released to generate embeddings and apply them to cross-lingual dependency parsing in the same way as we do with our own model. We compare with them in the 4 overlapped languages both works have researched on.

Comparison with Off-Line Methods
Results on the test sets are shown in Table 1. 12 Languages are grouped by language families. Overall, our approach with either SVD or GD outperforms both FT-SVD and mBERT by a substantial margin (+2.91% in averaged LAS), among which GD turns out to be slightly better than SVD in most of the languages. When combined with FT-SVD, the performances can be further improved by 0.33% in LAS for the GD method and 0.51% for SVD (see the Appendix for more details). Interestingly, the mBERT model which is trained without any cross-lingual supervision but 11 github.com/facebookresearch/XLM 12 UAS results are listed in the Appendix due to space limit. Note that since we have no access to the parsed files of the FT-SVD model, we only report statistical significant tests between our methods and the mBERT model, which is highly comparable to the FT-SVD model on average.  using a shared multilingual wordpiece vocabulary works surprisingly well in some languages, especially in those linguistically close to English. Similar observations have also been identified in other works (Pires et al., 2019;Wu and Dredze, 2019).

Comparison with On-Line Methods
Comparison of our approach and a cross-lingual language model pre-training (XLM) method (Lample and Conneau, 2019) in the 4 overlapped languages is shown in Table 2. CLBT outperforms XLM in 3 out of the 4 languages but lower in German (de). The amount of training data used in each method is also shown in the bottom: the number of parallel sentences used by XLM ranges from 0.2 million (10 million tokens) for Bulgarian to 13.1 million (682 million tokens) for French. In comparison, only 10,000 parallel sentences (0.4 million tokens) are used for each language in CLBT, demonstrating the data-efficiency of our approach. Moreover, given the efficiency in both data and training, CLBT can be readily scaled to new language pairs in hours.

Transformation of Cross-lingual BERT Embedding
In order to investigate the properties of contextualized representations before and after the linear transformation, we employ the SENSEVAL2 data (Edmonds and Cotton, 2001), 13 where words from different languages are tagged by their word senses in different contexts. We took contextualized representations of the English word nature and its Spanish transla- tion naturaleza in different contexts from pretrained English and multilingual BERT respectively and visualize their distributions in Figure 2(a), where we can observe obvious clustering of word senses. Specifically, words with sense nature-1 and naturaleza-1 mean the physical world, whereas nature-2 and naturaleza-2 mean inherent features. We then apply our GDbased method to embeddings of naturaleza and depict the resulting cross-lingual embeddings in Figure 2(b). The distance between embeddings from English and Spanish is effectively reduced after the transformation. And it is apparent that embeddings of Spanish words are closer to those with similar meanings from English, which indicates the effectiveness of our approach.

Effect of Training Data Size
We select several languages from each language family, and investigate the effect of the amount of training data on the performances of zero-shot cross-lingual dependency parsing. Specifically, we take the SVD-based approach, since it is faster than the GD-based one, and trained different transformation models with different amount of parallel sentences from Europarl dataset on each of the 13 selected languages. As shown in Figure 3, for most of the languages, the best performance is achieved with only 5000 parallel sentences. It is also worth noting that for most of Germanic (e.g. German, Danish, Swedish and Dutch) and Romance (e.g. French, Italian, Spanish and Romanian) languages, which are typologically closer to English, a rather small training set of merely 100 sentences is capable of achieving comparative results.

Conclusion
We propose the Cross-Lingual BERT Transformation (CLBT) approach for contextualized crosslingual embedding learning, which substantially outperforms the previous state-of-the-art in zeroshot cross-lingual dependency parsing. By exploiting publicly available pre-trained BERT models, our approach provides a fast and data-efficient solution to learning cross-lingual contextualized embeddings. Compared to the XLM, our method requires much fewer parallel data and less training time, yet achieving comparable performance.
For future work, we are interested in unsupervised cross-lingual alignment, inspired by prior success on static embeddings (Lample et al., 2018; Alvarez-Melis and Jaakkola, 2018), which demands a deeper understanding to the geometry of the multilingual contextualized embedding space.