Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations

We investigate whether off-the-shelf deep bidirectional sentence representations (Devlin et al., 2019) trained on a massively multilingual corpus (multilingual BERT) enable the development of an unsupervised universal dependency parser. This approach only leverages a mix of monolingual corpora in many languages and does not require any translation data making it applicable to low-resource languages. In our experiments we outperform the best CoNLL 2018 language-specific systems in all of the shared task’s six truly low-resource languages while using a single system. However, we also find that (i) parsing accuracy still varies dramatically when changing the training languages and (ii) in some target languages zero-shot transfer fails under all tested conditions, raising concerns on the ‘universality’ of the whole approach.


Introduction
Pretrained sentence representations (Howard and Ruder, 2018;Radford et al., 2018;Peters et al., 2018;Devlin et al., 2018) have recently set the new state of the art in many language understanding tasks (Wang et al., 2018).An appealing avenue for this line of work is to use a mix of training data in several languages and a shared subword vocabulary leading to general-purpose multilingual representations.In turn, this opens the way to a number of promising cross-lingual transfer techniques that can address the lack of annotated data in the large majority of world languages.
In this paper, we investigate whether deep bidirectional sentence representations (Devlin et al., 2018) trained on a massively multilingual corpus (m-BERT) allow for the development of a universal dependency parser that is able to parse sentences in a diverse range of languages without receiving any supervision in those language.Our parser is fully lexicalized, in contrast to a successful approach based on delexicalized parsers (Zeman and Resnik, 2008;McDonald et al., 2011).Building on the delexicalized approach, previous work employed additional features such as typological properties (Naseem et al., 2012), syntactic embeddings (Duong et al., 2015), and crosslingual word clusters (Täckström et al., 2012) to boost parsing performance.More recent work by Ammar et al. (2016); Guo et al. (2016) requires translation data for projecting word embeddings into a shared multilingual space.
Among lexicalized systems in CoNLL18, the top system (Che et al., 2018) utilizes contextualized vectors from ELMo.However, they train each ELMo for each language in the shared task.While their approach achieves the best LAS score on average, for low resource languages, the performance of their parser lags behind other systems that do not use pre-trained models (Zeman et al., 2018).By contrast, we build our dependency parser on top of general-purpose context-dependent word representations pretrained on a multilingual corpus.This approach does not require any translation data making it applicable to truly low-resource languages ( §3.3).While m-BERT's training objective is inherently monolingual (predict a word in language given its sentence context, also in language ), we hypothesize that cross-lingual syntactic transfer occurs via the shared subword vocabulary and hidden layer parameters.Indeed, on the challenging task of universal dependency parsing from raw text, we outperform by a large margin the best CoNLL18 language-specific systems (Zeman et al., 2018) on the shared task's truly low-resource arXiv:1910.05479v1[cs.CL] 12 Oct 2019 languages while using a single system.
The effectiveness of m-BERT for cross-lingual transfer of UD parsers has also been demonstrated in concurrent work by Wu and Dredze (2019) and Kondratyuk (2019).While the former utilizes only English as the training language, the latter trains on a concatentation of all available UD treebanks.We additionally experiment with three different sets of training languages beyond English-only and make interesting observations on the resulting large, and sometimes unexplicable, variation of performance among test languages.

Model
We use the representations produced by BERT (Devlin et al., 2018) which is a self-attentive deep bidirectional network (Vaswani et al., 2017) trained with a masked language model objective.Specifically we use BERT's multilingual cased version1 which was trained on the 100 languages with the largest available Wikipedias.Exponentially smoothed weighting was applied to prevent highresource languages from dominating the training data, and a shared vocabulary of 110k shared WordPieces (Wu et al., 2016) was used.
For parsing we employ a modification of the graphbased dependency parser of Dozat and Manning (2016).We use deep biaffine attention to score arcs and their label from the head to its dependent.While our label prediction model is similar to that of Dozat and Manning (2016), our arc prediction model is a globally normalized model which computes partition functions of non-projective dependency structures using Kirchhoff's Matrix-Tree Theorem (Koo et al., 2007).
Let x = w 1 , w 2 , . . ., w n be an input sentence of n tokens, which are given by the gold segmentation in training or by an automatic tokenizer in testing ( §3.1).To obtain the m-BERT representation of x, we first obtain a sequence t = t 1 , . . ., t m of m ≥ n subwords from x using the WordPiece algorithm.Then we feed t to m-BERT and extract the representations e 1 , . . ., e m from the last layer.
If word w i is tokenized into (t j , . . ., t k ) then the representation h i of w i is computed as the mean of (e j , . . ., e k ).
The arc score is computed similar to Dozat and Manning (2016): The log probability of the dependency tree y of x is given by where Z(x) is the partition function.Our objective function for predicting dependency arcs therefore is globally normalized.We compute Z(x) via matrix determinant (Koo et al., 2007).In our experiments, we find that training with a global objective is more stable if the score s (arc) [h, c] is locally normalized2 such that h exp(s (arc) [h, c]) = 1.During training, we update both m-BERT and parsing layer parameters.

Experiments
While most previous work on parser transfer, including the closely related (Duong et al., 2015) relies on gold tokenization and POS tags, we adopt the more realistic scenario of parsing from raw text (Zeman et al., 2018) and adopt the automatic sentence segmenter and tokenizer provided as baselines by the shared task organizers.

Data
We use the UDpipe-tokenized test data3 (Straka and Straková, 2017) and the CoNLL18 official script for evaluation.Gold tokenization is only used for the training data, while POS information is never used.All of our experiments are carried out on the Universal Dependencies (UD) corpus version 2.2 (Nivre et al., 2018) for a fair comparison with previous work.
While our sentence representations are always initialized from m-BERT, we experiment with four sets of parser training (i.e.fine-tuning) languages, namely: expEn only English (200K words); ex-pLatin a mix of four Latin-script European languages: English, Italian, Norwegian, Czech (50K each, 200K in total); expSOV a mix of two SOV languages: Hindi and Korean (100K each, 200K in total); expMix a larger mix of eight languages including different language families and scripts: English, Italian, Norwegian, Czech, Russian, Hindi, Korean, Arabic (50K each, 400K in total).For high resource languages that have more than one treebank, we choose the treebank that has the best LAS score in ConLL18 for training and the lowest LAS score for zero-shot evaluation.

Training details
Similar to Dozat et al. (2017), we use a neural network output size of 400 for arc prediction and 100 for label prediction.We use the Adam optimizer with learning rate 5e −6 to update the parameters of our models.The model is evaluated every 500 updates and we stop training if the score LAS does not increase in ten consecutive validations.

Results
To put our results into perspective, we report the accuracy of the best CoNLL18 system for each language and that of the Stanford system submitted at the same evaluation (Qi et al., 2018).The latter is also based on the deep biaffine parser of Dozat and Manning (2016), it does not use ensembles and was ranked 2 nd on official evaluation metric LAS4 .Both these parsers receive supervision in most of the languages, therefore comparison to our parser is only fair for the low-resource languages where training data is not available (or negligible, i.e. less than 1K tokens).
Results for a subset of UD languages are presented in Table 1.Beside common European languages, we choose languages with different writing scripts than those presented in the parser training data.We also include SOV (e.g., Korean, Persian) and VSO (e.g., Arabic, Breton) languages.Parser training languages for each experiment are highlighted in grey in Table 1.
In the high resource setting, there is a considerable gap between zero-shot and supervised parsers with Swedish as an exception (slightly better than Stanford's parser and 2 points below CoNLL18).By contrast, the benefit of multilingual transfer becomes evident in the low resource setting.Here, most CoNLL18 systems including Stanford's use knowledge of each target language to customize the parser, e.g., to choose the optimal training language(s).Nevertheless, our single parser trained on the largest mix of languages (expMix) beats the best CoNLL18 language-specific systems on all six languages, even though three of these languages are not represented in m-BERT's training data5 .This result highlights the advantage of multilingual pre-trained model in the truly low resource scenario.
We notice the poor performance of our parser on spoken French in comparison to other European languages.While there is sufficient amount of Wikipedia text for French, it seems that zero-shot parsing on a different domain remains a challenge even with a large pre-trained model.

Analysis
By varying the set of parser training languages we analyze our results with respect to two factors: parser training language diversity and word order similarity.

Training language diversity
Increasing language diversity (expEn→expLatin and expLatin→expMix) leads to improvements in most test languages, even when the total amount of training data is fixed (expEn→expLatin).The only exceptions are the languages for which training data is reduced (English in expLatin) or becomes a smaller proportion of the total training data (Czech, Italian, Norwegian in expMix), which confirms previous findings (Ammar et al., 2016).Swedish and Upper Sorbian being related to Norwegian and Czech respectively also lose some accuracy in expMix.On the other hand, newly included languages (Czech, Italian, Norwegian in ex-pLatin and Arabic, Hindi, Korean, Russian in exp-Mix) show the biggest improvements, which was also expected.
More interestingly, some large gains are reported for languages that are unrelated from all training languages of expLatin.We hypothesize that such languages (Arabic, Armenian, Hungarian) may benefit from an exposure of the parser to a more diverse set of word orders ( §4.2).For instance, Arabic being head initial is closer to Italian than to English in terms of word order.
Actual language relatedness does not always play a clear role: For instance, Upper Sorbian seems to benefit largely from its closeness to Czech in ex-pLatin and expMix, while Faroese (related to Norwegian) does not improve as much.
In summary, language diversity in training is clearly a great asset.However, there is a large variation in gains among test languages, for which language family relatedness can only offer a partial explanation.

Training language typology
Training on languages with similar typological features has been shown beneficial for parsing target languages in the delexicalized setup.In particular, word order similarities have been proved beneficial to select source languages for parsing model transfer (Naseem et al., 2012;Duong et al., 2015).
Indeed, when Hindi and Korean are presented in expSOV, we report better LAS scores in various SOV languages (Japanese, Tamil, Urdu, Buryat) however some other SOV languages (Persian and Armenian) perform much worse than in expLatin showing that word order is not a reliable criterion for training language selection.
Given these observations, we construct our largest training data (expMix) by merging all the languages of expEn, expLatin, and expSOV and adding two more languages with diverse word order profiles for which large treebanks exist, namely Russian and Arabic.
Concurrently to this work, Lin et al. (2019) have proposed an automatic method to choose the optimal transfer languages in various tasks including parsing, based on a variety of typological but also data-dependent features.We leave adoption of their method to future work.6

Towards explaining transfer performance
Even when keeping the training languages fixed, for instance in expMix, we observe a large variation of zero-shot parsing transfer accuracy among test languages which does not often correlate with supervised parsing accuracy.As an attempt to explain this variation we look at the overlap of test vocabulary with (i) parser's training data vocabulary τ and (ii) m-BERT's training data vocabulary.Because m-BERT uses a subword vocabulary that also includes characters we resort to measuring the unsegmented word score η: where type_w(D) and token_w(D) are sets of WordPieces types and tokens in dataset D respectively, and token_g(D) is the set of gold tokens in D before applying WordPieces.A higher η indicates a less segmented text.
To account for typological features, we also plot the average syntactic similarity σ of each test language to the eight expSOV training languages as computed by the URIEL database 7 (Littell et al., 2017).We observe a correlation between LAS, η and τ for test languages that have a relative in the training data, like Urdu and Hindi.For test languages 7 Specifically, we compute 1 − d where d is the precomputed syntactic distance in lang2vec.
that belong to a different family than all training languages, no correlation appears.A similar observation is also reported by Pires et al. (2019): namely, they find that the performance of crosslingual named entity recognition with m-BERT is largely independent of vocabulary overlap.
Although typological features have been shown to be useful when incorporated into the parser (Naseem et al., 2012;Ammar et al., 2016), we do not find a clear correlation between σ and LAS in our setup.Thus none of our investigated factors can explain transfer performance in a systematic way.

Language outliers
While massively pre-trained language models promise a more inclusive future for NLP, we find it important to note that cross-lingual transfer performs very badly for some languages.
For instance, in our experiments, Japanese and Vietnamese stand out as strikingly negative outliers.Wu and Dredze (2019) also report a very low performance on Japanese in their zero-shot dependency parsing experiments. 8In (Lin et al., 2019) Japanese is completely excluded from the parsing experiments because of unstable results.
Japanese and Vietnamese are language isolates in an NLP sense, meaning that they do not enjoy the presence of a closely related language among the high-resourced training languages.9For this class of languages, transfer performance is overall very inconsistent and hard to explain.The case of Japanese is particularly interesting for its relation to Korean.Family relatedness between these two languages is very controversial but their syntactic features are extremely similar.To put our parser in optimal transfer conditions, we perform one last experiment by training only on Korean (all available data) and testing on Japanese, and vice versa.As shown in Table 2, Japanese performance becomes even lower in this setup.We can also see that transferring in the opposite direction leads to a much better result, despite the fact that state-ofthe-art supervised systems in these two languages achieve similar results (Japanese: 83.11, Korean: 86.92 by the best CoNLL18 systems).To rule out the impact of unsupervised sentence and token segmentation, which may be performing particularly poorly on some languages, we retrain the parser with gold segmentation and find that it explains only a small part of the gap.
While Pires et al. (2019) hypothesize word order is the main culprit for the poor zero-shot performance for Japanese when transferring a POStagger from English, our experiments with Korean and Japanese show a different picture.

Conclusions
We have built a Universal Dependency parser on top of deep bidirectional sentence representations pre-trained on a massively multilingual corpus (m-BERT) without any need for parallel data, treebanks or other linguistic resources in the test languages.
Evaluated in the challenging scenario of parsing from raw text, our best parser trained on a mix of languages representing both language family and word order diversity outperforms the best CoNLL18 language-specific systems on the six truly low-resource languages presented at the shared task.
Our experiments show that language diversity in the training treebank is a great asset for transfer to low-resource languages.Moreover, the massively multilingual nature of m-BERT does not neutralize the impact of transfer languages on parsing accuracy, which is only partially explained by language relatedness and word order similarity.
Finally we have raised the issue of language outliers that perform very poorly in all our tested conditions and that, given our analysis, are unlikely to benefit even from automatic methods of transfer language selection (Lin et al., 2019).

Figure 1 :
Figure 1: Relationship between parsing accuracy (expMix), parser training-test vocabulary overlap τ , m-BERT unsegmented word score η, and average typological syntactic similarity σ.Purple bar indicates there is no language that belongs to the same family presented in training data.Languages in the training set of expMix are not shown.

Table 1 :
LAS scores of our parser in the raw text setup.Languages not in m-BERT's training corpus are marked with *.SVO and SOV languages are indicated by purple and green respectively.Stanford and CoNLL18's best submitted systems are provided as representative state-of-the-art supervised systems.#TrWrds = Total training data made available at CoNLL18.The amount of training used in each experiment is specified in §3.1.Training languages for each experiment are highlighted in grey.

Table 2 :
LAS scores when transferring between Korean and Japanese in two tokenization conditions.