Low-Resource Parsing with Crosslingual Contextualized Representations

Despite advances in dependency parsing, languages with small treebanks still present challenges. We assess recent approaches to multilingual contextual word representations (CWRs), and compare them for crosslingual transfer from a language with a large treebank to a language with a small or nonexistent treebank, by sharing parameters between languages in the parser itself. We experiment with a diverse selection of languages in both simulated and truly low-resource scenarios, and show that multilingual CWRs greatly facilitate low-resource dependency parsing even without crosslingual supervision such as dictionaries or parallel text. Furthermore, we examine the non-contextual part of the learned language models (which we call a “decontextual probe”) to demonstrate that polyglot language models better encode crosslingual lexical correspondence compared to aligned monolingual language models. This analysis provides further evidence that polyglot training is an effective approach to crosslingual transfer.


Introduction
Dependency parsing has achieved new states of the art using distributed word representations in neural networks, trained with large amounts of annotated data Ma et al., 2018;Che et al., 2018). However, many languages are low-resource, with small or no treebanks, which presents a severe challenge in developing accurate parsing systems in those languages. One way to address this problem is with a crosslingual solution that makes use of a language with a large treebank and raw text in both languages. The hypothesis behind this approach is that, although each language is unique, different languages manifest similar char-⇤ Equal contribution. Random order.
acteristics (e.g., morphological, lexical, syntactic) which can be exploited by training a single polyglot model with data from multiple languages (Ammar, 2016).
Recent work has extended contextual word representations (CWRs) multilingually either by training a polyglot language model (LM) on a mixture of data from multiple languages (joint training approach; Mulcaire et al., 2019;Lample and Conneau, 2019) or by aligning multiple monolingual language models crosslingually (retrofitting approach; Schuster et al., 2019;Aldarmaki and Diab, 2019). These multilingual representations have been shown to facilitate crosslingual transfer on several tasks, including Universal Dependencies parsing and natural language inference. In this work, we assess these two types of methods by using them for low-resource dependency parsing, and discover that the joint training approach substantially outperforms the retrofitting approach. We further apply multilingual CWRs produced by the joint training approach to diverse languages, and show that it is still effective in transfer between distant languages, though we find that phylogenetically related source languages are generally more helpful.
We hypothesize that joint polyglot training is more successful than retrofitting because it induces a degree of lexical correspondence between languages that the linear transformation used in retrofitting methods cannot capture. To test this hypothesis, we design a decontextual probe. We decontextualize CWRs into non-contextual word vectors that retain much of CWRs' taskperformance benefit, and evaluate the crosslingual transferability of language models via word translation. In our decontextualization framework, we use a single LSTM cell without recurrence to obtain a context-independent vector, thereby allowing for a direct probe into the LSTM networks in-dependent of a particular corpus. We show that decontextualized vectors from the joint training approach yield representations that score higher on a word translation task than the retrofitting approach or word type vectors such as fastText (Bojanowski et al., 2017). This finding provides evidence that polyglot language models encode crosslingual similarity, specifically crosslingual lexical correspondence, that a linear alignment between monolingual language models does not.

Models
We examine crosslingual solutions to lowresource dependency parsing, which make crucial use of multilingual CWRs. All models are implemented in AllenNLP, version 0.7.2  and the hyperparameters and training details are given in the appendix.

Multilingual CWRs
Prior methods to produce multilingual contextual word representations (CWRs) can be categorized into two major classes, which we call joint training and retrofitting. 1 The joint training approach trains a single polgylot language model (LM) on a mixture of texts in multiple languages (Mulcaire et al., 2019;Lample and Conneau, 2019;Devlin et al., 2019), 2 while the retrofitting approach trains separate LMs on each language and aligns the learned representations later (Schuster et al., 2019;Aldarmaki and Diab, 2019). We compare example approaches from these two classes using the same LM training data, and discover that the joint training approach generally yields better performance in low-resource dependency parsing, even without crosslingual supervision. Following Schuster et al. (2019), we first train a bidirectional LM with two-layer LSTMs on top of character CNNs for each language (ELMo, , and then align the monolingual LMs across languages. Denote the hidden state in the jth layer for word i in context c by h (j) i,c . We use a trainable weighted average of the three layers (character-CNN and 1 This term was originally used by Faruqui et al. (2015) to describe updates to word vectors, after estimating them from corpora, using semantic lexicons. We generalize it to capture the notion of a separate update to fit something other than the original data, applied after conventional training.

Retrofitting Approach
2 Multilingual BERT is documented in https: //github.com/google-research/bert/blob/ master/multilingual.md. two LSTM layers) to compute the contextual representation e i,c for the word: e i,c = P 2 j=0 j h (j) i,c . 3 In the first step, we compute an "anchor" h (j) i for each word by averaging h (j) i,c over all occurrences in an LM corpus. We then apply a standard dictionary-based technique 4 to create multilingual word embeddings (Mikolov et al., 2013;Conneau et al., 2018). In particular, suppose that we have a word-translation dictionary from source language s to target language t. Let H t be matrices whose columns are the anchors in the jth layer for the source and corresponding target words in the dictionary. For each layer j, find the linear transformation W ⇤(j) such that The linear transformations are then used to map the LM hidden states for the source language to the target LM space. Specifically, contextual representations for the source and target languages are computed by i,c respectively. We use publicly available dictionaries from Conneau et al. (2018) 5 and align all languages to the English LM space, again following Schuster et al. (2019).
Joint Training Approach Another approach to multilingual CWRs is to train a single LM on multiple languages (Tsvetkov et al., 2016;Ragni et al., 2016;Östling and Tiedemann, 2017). We train a single bidirectional LM with charater CNNs and two-layer LSTMs on multiple languages (Rosita, Mulcaire et al., 2019). We then use the polyglot LM to provide contextual representations. Similarly to the retrofitting approach, we represent word i in context c as a trainable weighted average of the hidden states in the trained polyglot LM: i,c . In contrast to retrofitting, crosslinguality is learned implicitly by sharing all network parameters during LM training; no crosslingual dictionaries are used. 3 Schuster et al. (2019) only used the first LSTM layer, but we found a performance benefit from using all layers in preliminary results. 4 Conneau et al. (2018) developed an unsupervised alignment technique that does not require a dictionary. We found that their unsupervised alignment yielded substantially degraded performance in downstream parsing in line with the findings of Schuster et al. (2019). 5 https://github.com/facebookresearch/ MUSE#ground-truth-bilingual-dictionaries Refinement after Joint Training It is possible to combine the two approaches above; the alignment procedure used in the retrofitting approach can serve as a refinement step on top of an alreadypolyglot language model. We will see only a limited gain in parsing performance from this refinement in our experiments, suggesting that polyglot LMs are already producing high-quality multilingual CWRs even without crosslingual dictionary supervision.
FastText Baseline We also compare the multilingual CWRs to a subword-based, non-contextual word embedding baseline.
We train 300dimensional word vectors on the same LM data using the fastText method (Bojanowski et al., 2017), and use the same bilingual dictionaries to align them (Conneau et al., 2018).

Dependency Parsers
We train polyglot parsers for multiple languages  on top of multilingual CWRs. All parser parameters are shared between the source and target languages.  suggest that sharing parameters between languages can alleviate the low-resource problem in syntactic parsing, but their experiments are limited to (relatively similar) European languages. Mulcaire et al. (2019) also include experiments with dependency parsing using polyglot contextual representations between two language pairs (English/Chinese and English/Arabic), but focus on high-resource tasks. Here we explore a wider range of languages, and analyze the particular efficacy of a crosslingual approach to dependency parsing in a low-resource setting.
We use a strong graph-based dependency parser with BiLSTM and biaffine attention , which is also used in related work (Schuster et al., 2019;Mulcaire et al., 2019). Crucially, our parser only takes as input word representations. Universal parts of speech have been shown useful for low-resource dependency parsing (Duong et al., 2015;Ahmad et al., 2019), but many realistic lowresource scenarios lack reliable part-of-speech taggers; here, we do not use parts of speech as input, and thus avoid the error-prone part-of-speech tagging pipeline. For the fastText baseline, word embeddings are not updated during training, to preserve crosslingual alignment .

Experiments
We first conduct a set of experiments to assess the efficacy of multilingual CWRs for low-resource dependency parsing.

Zero-Target Dependency Parsing
Following prior work on low-resource dependency parsing and crosslingual transfer (Zhang and Barzilay, 2015;Guo et al., 2015;Schuster et al., 2019), we conduct multi-source experiments on six languages (German, Spanish, French, Italian, Portuguese, and Swedish) from Google universal dependency treebank version 2.0 . 6 We train language models on the six languages and English to produce multilingual CWRs. For each tested language, we train a polyglot parser with the multilingual CWRs on the five other languages and English, and apply the parser to the test data for the target language. Importantly, the parsing annotation scheme is shared among the seven languages. Our results will show that the joint training approach for CWRs substantially outperforms the retrofitting approach.

Diverse Low-Resource Parsing
The previous experiment compares the joint training and retrofitting approaches in low-resource dependency parsing only for relatively similar languages. In order to study the effectiveness more extensively, we apply it to a more typologically diverse set of languages. We use five pairs of languages for "low-resource simulations," in which we reduce the size of a large treebank, and four languages for "true low-resource experiments," where only small UD treebanks are available, allowing us to compare to other work in the low-resource condition (Table 1). Following de Lhoneux et al. (2018), we selected these language pairs to represent linguistic diversity. For each target language, we produce multilingual CWRs by training a polyglot language model with its related language (e.g., Arabic and Hebrew) as well as English (e.g., Arabic and English). We then train a polyglot dependency parser on each language pair and assess the crosslingual transfer in terms of target parsing accuracy. Each pair of related languages shares features like word order, morphology, or script. For example, Arabic and Hebrew are similar in their rich transfixing morphology , and Dutch and German share most of their word order features. We chose Chinese and Japanese as an example of a language pair which does not share a language family but does share characters.
We chose Hungarian, Vietnamese, Uyghur, and Kazakh as true low-resource target languages because they had comparatively small amounts of annotated text in the UD corpus (Vietnamese: 1,400 sentences, 20,285 tokens; Hungarian: 910 sentences, 20,166 tokens; Uyghur: 1,656 sentences, 19,262 tokens; Kazakh: 31 sentences, 529 tokens;), yet had convenient sources of text for LM pretraining (Zeman et al., 2018). 7 Other small treebanks exist, but in most cases another larger treebank exists for the same language, making domain adaptation a more likely option than crosslingual transfer. Also, recent work (Che et al., 2018) using contextual embeddings was topranked for most of these languages in the CoNLL 2018 shared task on UD parsing (Zeman et al., 2018). 8 7 The one exception is Uyghur where we only have 3M words in the raw LM data from Zeman et al. (2018). 8 In Kazakh, Che et al. (2018) did not use CWRs due to the extremely small treebank size. We use the same Universal Dependencies (UD) treebanks  and train/development/test splits as the CoNLL 2018 shared task (Zeman et al., 2018). 9 The annotation scheme is again shared across languages, which facilitates crosslingual transfer. For each triple of two related languages and English, we downsample training and development data to match the language with the smallest treebank size. This allows for fairer comparisons because within each triple, the source language for any parser will have the same amount of training data. We further downsample sentences from the target train/development data to simulate low-resource scenarios. The ratio of training and development data is kept 5:1 throughout the simulations, and we denote the number of sentences in training data by |D ⌧ |. For testing, we use the CoNLL 2018 script on the gold word segmentations. For the truly low-resource languages, we also present results with word segmentations from the system outputs of Che et al. (2018)

Results and Discussion
In this section we describe the results of the various parsing experiments. Table 2 shows results on zero-target dependency parsing. First, we see that all CWRs greatly improve upon the fastText baseline. The joint training approach (Rosita), which uses no dictionaries, consistently outperforms the dictionary-dependent retrofitting approach (ELMos+Alignment). As discussed in the previous section, we can apply the alignment method to refine the alreadypolyglot Rosita using dictionaries. However, we observe a relatively limited gain in overall performance (74.5 vs. 73.9 LAS points), suggesting that Rosita (polyglot language model) is already developing useful multilingual CWRs for parsing without crosslingual supervision. Note that the degraded overall performance of our ELMo+Alignment compared to Schuster et al.    (Conneau et al., 2018). The absence of a dictionary yields much worse performance (69.2 vs. 73.1) in contrast with the joint training approach of Rosita, which also does not use a dictionary (73.9). We also present results using gold universal part of speech to compare to previous work in Table 3. We again see Rosita's effectiveness and a marginal benefit from refinement with dictionaries. It should also be noted that the reported results for French, Italian and German in Schuster et al.

Zero-Target Parsing
(2019) outperform all results from our controlled comparison; this may be due to the use of abundant LM training data. Nevertheless, joint training, with or without refinement, performs best on average in both gold and predicted POS settings.

Diverse Low-Resource Parsing
Low-Resource Simulations Figure 1 shows simulated low-resource results. 11 Of greatest interest are the significant improvements over monolingual parsers when adding English or relatedlanguage data. This improvement is consistent across languages and suggests that crosslingual 11 A table with full details including different size simulations is provided in the appendix.  transfer is a viable solution for a wide range of languages, even when (as in our case) languagespecific tuning or annotated resources like parallel corpora or bilingual dictionaries are not available. See Figure 2 for a visualization of the differences in performance with varying training size. The polyglot advantage is minor when the target language treebank is large, but dramatic in the condition where the target language has only 100 sentences. The fastText approaches consistently underperform the language model approaches, but show the same pattern. In addition, related-language polyglot ("+rel.") outperforms English polyglot in most cases in the low-resource condition. The exceptions to this pattern are Italian (whose treebank is of a different genre from the Spanish one), and Japanese and Chinese, which differ significantly in morphology and word order. The CMN/JPN result suggests that such typological features influence the degree of crosslingual transfer more than orthographic  properties like shared characters. This result in crosslingual transfer also mirrors the observation from prior work (Gerz et al., 2018) that typological features of the language are predictive of monolingual LM performance. The related-language improvement also vanishes in the full-data condition (Figure 2), implying that the importance of shared linguistic features can be overcome with sufficient annotated data. It is also noteworthy that variations in word order, such as the order of adjective and noun, do not affect performance: Italian, Arabic, and others use a noun-adjective order while English uses an adjective-noun order, but their +ENG and +rel. results are comparable. The Croatian and Russian results are notable because of shared heritage but different scripts. Though Croatian uses the Latin alphabet and Russian uses Cyrillic, transfer between HRV+RUS is clearly more effective than HRV+ENG (82.00 vs. 79.21 LAS points when |D ⌧ | = 100). This suggests that character-based LMs can implicitly learn to transliterate between related languages with different scripts, even without parallel supervision.
Truly Low Resource Languages Finally we present "true low-resource" experiments for four languages in which little UD data is available (see Section 3.2).  in Hungarian, Vietnamese, and Kazakh. Consistent with our simulations, we see that training parsers with the target's related language is more effective than with the more distant language, English. It is particularly noteworthy that the Rosita models, which do not use a parallel corpus or dictionary, dramatically improve over the best previously reported result from Schuster et al. (2019) (2018) derived from parallel text. This result further corroborates our finding that the joint training approach to multilingual CWRs is more effective than retrofitting monolingual LMs.

Comparison to Multilingual BERT Embeddings
We also evaluate the diverse low-resource language pairs using pretrained multilingual BERT (Devlin et al., 2019) as text embeddings ( Figure  3). Here, the same language model (multilingual cased BERT, 12 covering 104 languages) is used for all parsers, with the only variation being in the training treebanks provided to each parser. Parsers are trained using the same hyperparameters and data as in Section 3.2. 13 There are two critical differences from our previous experiments: multilingual BERT is trained on much larger amounts of Wikipedia data compared to other LMs used in this work, and the WordPiece vocabulary (Wu et al., 2016) used in the cased multilingual BERT model has been shown to have a distribution skewed toward Latin alphabets (Ács, 2019). These results are thus not directly comparable to those in Figure 1; nevertheless, it is interesting to see that the results obtained with ELMo-like LMs are comparable to and in some cases better than results using a BERT model trained on over a hundred languages. Our results broadly fit with those of Pires et al. (2019), who found that multilingual BERT was useful for zero-shot crosslingual syntactic transfer. In particular, we find nearly no performance benefit from cross-script transfer using BERT in a language pair (English-Japanese) for which they reported 12 Available at https://github.com/googleresearch/bert/ 13 AllenNLP version 0.9.0 was used for these experiments. mono. +ENG +rel. Figure 3: LAS for UD parsing results in a simulated low-resource setting ((|D ⌧ | = 100) using multilingual BERT embeddings in place of Rosita. Cf. Figure 1. poor performance in zero-shot transfer, contrary to our results using Rosita (Section 4.2).

Decontextual Probe
We saw the success of the joint polyglot training for multilingual CWRs over the retrofitting approach in the previous section. We hypothesize that CWRs from joint training provide useful representations for parsers by inducing nonlinear similarity in the vector spaces of different languages that we cannot retrieve with a simple alignment of monolingual pretrained language models. In order to test this hypothesis, we conduct a decontextual probe comprised of two steps. The decontextualization step effectively distills CWRs into word type vectors, where each unique word is mapped to exactly one embedding regardless of the context. We then conduct linear transformation-based word translation (Mikolov et al., 2013) on the decontextualized vectors to quantify the degree of crosslingual similarity in the multilingual CWRs.

Decontextualization
Recall from Section 2 that we produce CWRs from bidirectional LMs with character CNNs and twolayer LSTMs. We propose a method to remove the dependence on context c for the two LSTM layers (the CNN layer is already context-independent by design). During LM training, the hidden states of each layer h t are computed by the standard LSTM equations:  We produce contextless vectors from pretrained LMs by removing recursion in the computation (i.e. setting h t 1 and c t 1 to 0): This method is fast to compute, as it does not require recurrent computation and only needs to see each word once. This way, each word is associated with a set of exactly three vectors from the three layers.

Performance of decontextualized vectors
We perform a brief experiment to find what information is successfully retained by the decontextualized vectors, by using them as inputs to three tasks (in a monolingual English setting, for simplicity). For Universal Dependencies (UD) parsing, semantic role labeling (SRL), and named entity recognition (NER), we used the standard train/development/test splits from UD English EWT (Zeman et al., 2018) and Ontonotes (Pradhan et al., 2013). Following Mulcaire et al. (2019), we use strong existing neural models for each task:  for UD parsing, He et al. (2017) for SRL, and Peters et al. (2017) for NER. Table 5 compares the decontextualized vectors with the original CWRs (ELMo) and the conventional word type vectors, GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017). In all three tasks, the decontextualized vectors substantially improve over fastText and GloVe vectors, and perform nearly on par with contextual  Table 6: Crosslingual alignment results (precision at 1) from decontextual probe. Layers 0, 1, and 2 denote the character CNN, first LSTM, and second LSTM layers in the language models respectively.
ELMo. This suggests that while part of the advantage of CWRs is in the incorporation of context, they also benefit from rich context-independent representations present in deeper networks.

Word Translation Test
Given the decontextualized vectors from each layer of the bidirectional language models, we can measure the crosslingual lexical correspondence in the multilingual CWRs by performing word translation. Concretely, suppose that we have training and evaluation word translation pairs from the source to the target language. Using the same word alignment objective discussed as in Section 2.1, we find a linear transform by aligning the decontextualized vectors for the training source-target word pairs. Then, we apply this linear transform to the decontextualized vector for each source word in the evaluation pairs. The closest target vector is found using the cross-domain similarity local scaling (CSLS) measure (Conneau et al., 2018), which is designed to remedy the hubness problem (where a few "hub" points are nearest neighbors to many other points each) in word translation by normalizing the cosine similarity according to the degree of hubness.
We again take the dictionaries from Conneau et al. (2018) with the given train/test split, and always use English as the target language. For each language, we take all words that appear three times or more in our LM training data and compute decontextualized vectors for them. Word translation is evaluated by choosing the closest vector among the English decontextualized vectors.

Results
We present word translation results from our decontextual probe in Table 6. We see that the first LSTM layer generally achieves the best crosslingual alignment both in ELMos and Rosita. This finding mirrors recent studies on layerwise transferability; representations from the first LSTM layer in a language model are most transferable across a range of tasks (Liu et al., 2019). Our decontextual probe demonstrates that the first LSTM layer learns the most generalizable representations not only across tasks but also across languages. In all six languages, Rosita (joint LM training approach) outperforms ELMos (retrofitting approach) and the fastText vectors. This shows that for the polyglot (jointly trained) LMs, there is a preexisting similarity between languages' vector spaces beyond what a linear transform provides. The resulting language-agnostic representations lead to polyglot training's success in lowresource dependency parsing.

Further Related Work
In addition to the work mentioned above, much previous work has proposed techniques to transfer knowledge from a high-resource to a lowresource language for dependency parsing. Many of these methods use an essentially (either lexicalized or delexicalized) joint polyglot training setup (e.g., McDonald et al., 2011;Cohen et al., 2011;Duong et al., 2015;Guo et al., 2016;Vilares et al., 2016;Falenska and Ç etinoglu, 2017 as well as many of the CoNLL 2017/2018 shared task participants: Lim and Poibeau (2017) 2018)). Some use typological information to facilitate crosslingual transfer (e.g., Naseem et al., 2012;Zhang and Barzilay, 2015;Wang and Eisner, 2016;Rasooli and Collins, 2017;. Others use bitext (Zeman et al., 2018), manually-specified rules (Naseem et al., 2012), or surface statistics from gold universal part of speech (Wang and Eisner, 2018a,b) to map the source to target. The methods examined in this work to produce multilingual CWRs do not rely on such external information about the languages, and instead use relatively abundant LM data to learn crosslinguality that abstracts away from typological divergence.
Recent work has developed several probing methods for (monolingual) contextual representations (Liu et al., 2019;Hewitt and Manning, 2019;Tenney et al., 2019). Wada and Iwata (2018) showed that the (contextless) input and output word vectors in a polyglot word-based language model manifest a certain level of lexical correspondence between languages. Our decontextual probe demonstrated that the internal layers of polyglot language models capture crosslinguality and produce useful multilingual CWRs for downstream low-resource dependency parsing.

Conclusion
We assessed recent approaches to multilingual contextual word representations, and compared them in the context of low-resource dependency parsing. Our parsing results illustrate that a joint training approach for polyglot language models outperforms a retrofitting approach of aligning monolingual language models. Our decontextual probe showed that jointly trained LMs learn a better crosslingual lexical correspondence than the one produced by aligning monolingual language models or word type vectors. Our results provide a strong basis for multilingual representation learning and for further study of crosslingual transfer in a low-resource setting beyond dependency parsing.