Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled \emph{and unlabeled} data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.


Introduction
Contextual word representations (CWRs) from pretrained language models have improved many NLP systems. Such language models include BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), which are conventionally "pretrained" on large unlabeled datasets before their internal representations are "finetuned" during supervised training on downstream tasks like parsing. However, many language varieties 1 lack large annotated and even unannotated datasets, raising questions about the broad applicability of such data-hungry methods.
One exciting way to compensate for the lack of unlabeled data in low-resource language varieties is to finetune a large, multilingual language model that has been pretrained on the union of many languages' data (Devlin et al., 2019;Lample and Conneau, 2019). This enables the model to transfer some of what it learns from high-resource languages to low-resource ones, demonstrating benefits over monolingual methods in some cases (Conneau et al., 2020a;Tsai et al., 2019), though not always (Agerri et al., 2020;Rönnqvist et al., 2019).
Specifically, multilingual models face the transfer-dilution tradeoff (Conneau et al., 2020a): increasing the number of languages during pretraining improves positive crosslingual transfer but decreases the model capacity allocated to each language. Furthermore, such models are only pretrained on a finite amount of data and may lack exposure to specialized domains of certain languages or even entire low-resource language varieties. The result is a challenge for these language varieties, which must rely on positive transfer from a sufficient number of similar high-resource languages. Indeed, Wu and Dredze (2020) find that multilingual models often underperform monolingual baselines for such languages and question their off-theshelf viability.
We take inspiration from previous work on domain adaptation, where general-purpose monolingual models have been effectively adapted to specialized domains through additional pretraining on domain-specific corpora (Gururangan et al., 2020). We hypothesize that we can improve the performance of multilingual models on low-resource language varieties analogously, through additional pretraining on language-specific corpora.
However, additional pretraining on more data in the target language does not ensure its full representation in the model's vocabulary, which is constructed to maximally represent the model's original pretraining data (Sennrich et al., 2016;Wu et al., 2016). Artetxe et al. (2020) find that target languages' representation in the vocabulary affects these models' transferability, suggesting that language varieties on the fringes of the vocabulary may not be sufficiently well-modeled. Can we incorporate vocabulary from the target language into multilingual models' existing alignment?
We introduce the use of additional languagespecific pretraining for multilingual CWRs in a low-resource setting, before use in a downstream task; to better model language-specific tokens, we also augment the existing vocabulary with frequent tokens from the low-resource language ( §2). Our experiments consider dependency parsing in four typologically diverse low-resource language varieties with different degrees of relatedness to a multilingual model's pretraining data ( §3). Our results show that these methods consistently improve performance on each target variety, especially in the lowest-resource cases ( §4). In doing so, we demonstrate the importance of accounting for the relationship between a multilingual model's pretraining data and the target language variety.
Because the pretraining-finetuning paradigm is now ubiquitous, many experimental findings for one task can now inform work on other tasks. Thus, our findings on dependency parsing-whose annotated datasets cover many more low-resource language varieties than those of other NLP tasks-are expected to interest researchers and practitioners facing low-resource situations for other tasks. To this end, we make our code, data, and hyperparameters publicly available. 2

Overview
We are chiefly concerned with the adaptation of pretrained multilingual models to a target language by optimally using available data. As a case study, we use the multilingual cased BERT model (MBERT) of Devlin et al. (2019), a transformerbased (Vaswani et al., 2017) language model which has produced strong CWRs for many languages (Kondratyuk and Straka, 2019, inter alia). MBERT is pretrained on the 104 languages with the most Wikipedia data and encodes input tokens using a fixed wordpiece vocabulary (Wu et al., 2016) learned from this data. Low-resource languages are slightly oversampled in its pretraining data, but high resource languages are still more prevalent, resulting in a language imbalance. 3 We observe that two types of target language varieties may be disadvantaged by this training scheme: the lowest-resource languages in MBERT's pretraining data (which we call Type 1); and unseen low-resource languages (Type 2). Although Type 1 languages are oversampled during training, they are still overshadowed by high-resource languages. Type 2 languages must rely purely on crosslingual vocabulary overlap. In both cases, the wordpieces that encode the input tokens in these languages may not fully capture the senses in which they are used, or they may be completely unseen. 4 However, other low-resource varieties with more representation in MBERT's pretraining data (Type 0) may not be as disadvantaged. Optimally using MBERT in low-resource settings thus requires accounting for limitations with respect to a target language variety.

Methods
We evaluate three methods of adapting MBERT to better model target language varieties.
Language-Adaptive Pretraining (LAPT) Under the assumption that language varieties function analagously to domains for MBERT, we adapt the domain-adaptive pretraining method of Gururangan et al. (2020) to a multilingual setting. With language-adaptive pretraining, MBERT is pretrained for additional epochs on monolingual data in the target language variety to improve the alignment of the wordpiece embeddings.
Vocabulary Augmentation (VA) To better model unseen or language-specific wordpieces, we explore performing LAPT after augmenting MBERT's vocabulary from a target language variety. We train a new wordpiece vocabulary on monolingual data in the target language, tokenize the monolingual data with the new vocabulary, and augment MBERT's vocabulary with the 99 most common wordpieces 5 in the new vocabulary that replaced the "unknown" wordpiece token. Full details of this process are given in the Appendix.
Tiered Vocabulary Augmentation (TVA) We consider a variant of VA with a larger learning rate blob/master/multilingual.md for more details. 4 Wordpiece tokenization is done greedily based on a fixed vocabulary. The model returns a special "unknown" token for unseen characters and other subword units that cannot be represented by the vocabulary. 5 MBERT's fixed-size vocabulary contains 99 tokens designated as "unused," whose representations were not updated during initial pretraining and can be repurposed for vocabulary augmentation without modifying the pretrained model. for the embeddings of the 99 new wordpieces than for the other parameters, to explore these embeddings' potential to be learned more thoroughly without overfitting the model's remaining parameters.
Learning rate details are given in the Appendix.

Evaluation
We perform evaluation on dependency parsing. Following Kondratyuk and Straka (2019), we take a weighted sum of the activations at each MBERT layer as the CWR for each token. We then pass the representations into the graph-based dependency parser of Dozat and Manning (2017). This parser, which is also used in related work (Kondratyuk and Straka, 2019; Mulcaire et al., 2019a; Schuster et al., 2019), uses a biaffine attention mechanism between word representations to score a parse tree.

Experiments
We consider two variants of each MBERT method: one in which the pretrained CWRs are frozen; and one where they are further finetuned during parser training (FT). Following prior work involving these two variants (Beltagy et al., 2019), FT variants perform biaffine attention directly on the outputs of MBERT instead of first passing them through a BiL-STM, as in Dozat and Manning (2017). We perform additional pretraining for up to 20 epochs, selecting our final models based on average validation LAS downstream. Full training details are given in the Appendix. We report average scores and standard errors based on five random initializations. Code and data are publicly available (see footnote 2).

Languages and Datasets
We perform experiments on four typologically diverse low-resource languages: Irish (GA), Maltese (MT), Vietnamese (VI), and Singlish (Singapore Colloquial English; SING). Singlish is an Englishbased creole spoken in Singapore, which incorporates lexical and syntactic borrowings from other languages spoken in Singapore: Chinese, Malay, and Tamil. Wang et al. (2017) provide an extended motivation for evaluating on Singlish.
These language varieties are examplars of the three types discussed in §2. MBERT is trained on the 104 largest Wikipedias, which includes Irish and Vietnamese but excludes Maltese and Singlish. However, the Irish Wikipedia is several orders of magnitude smaller than the full Vietnamese one. So, we view Irish and Maltese as Type 1 and Type 2 language varieties, respectively. Though Singlish lacks its own Wikipedia and is likely not included in MBERT's pretraining data per se, its component languages (English, Chinese, Malay, and Tamil) are all well-represented in the data. We thus consider it to be a Type 0 variety along with Vietnamese.
Unlabeled Datasets Additional pretraining for Irish, Maltese, and Vietnamese uses unlabeled articles from Wikipedia. To simulate a truly lowresource setting for Vietnamese, we use a random sample of 5% of the articles. Singlish data is crawled from the SG Talk Forum 6 online forum and provided by Wang et al. (2017). To ensure robust evaluation, we remove all sentences that appear in the labeled validation and test sets from the unlabeled data. Full details are provided in the Appendix.
Tab. 1 gives the average number of wordpieces per token and the number of tokens with unknown wordpieces in each of the unlabeled datasets, computed based on the original MBERT vocabulary. While the high number of wordpieces per token for Irish and Maltese may be due in part to morphological richness, it also suggests that these languages stand to benefit more from improved alignment of the wordpieces' embeddings. Furthermore, the higher rates of unknown wordpieces leave room for enhanced performance with an improved vocabulary.   (2019), which we randomly partition into train (80%), valid. (10%), and test sets (10%). 7 We use the provided gold word segmentation but no POS tag features.

Baselines
For each language, we evaluate the performance of MBERT in frozen and FT variants, without any adaptations. We additionally benchmark each method against strong prior work that represents conventional approaches for representing low-resource languages: static fastText embeddings (FASTT; Bojanowski et al., 2017), which can be learned effectively even on small datasets; and monolingual ELMo models (ELMO; Peters et al., 2018), a monolingual contextual approach. We choose ELMo over training a new BERT model because the high computational and data requirements of the latter make it unviable in a low-resource setting. Both baselines are trained on our unlabeled datasets.

Results and Discussion
Tab. 2 shows the performance of each of the method variants on the four Universal Dependencies datasets, with standard deviations from five different initializations. Our experiments demonstrate that additional language-specific pretraining results in more effective representations. LAPT 7 Our partition of the data is available at https:// github.com/ethch18/parsing-mbert.
consistently outperforms baselines, especially for Irish and Maltese, where overlap with the original pretraining data is low and frozen MBERT underperforms ELMO. This suggests that the insights of Gururangan et al. (2020) on additional pretraining for domain adaptation are also applicable to transferring multilingual models to low-resource languages, even without much additional data.
LAPT with our vocabulary augmentation methods yield small but significant improvements over LAPT alone, especially for FT configurations and Type 1/2 languages. This demonstrates that accurate vocabulary modeling is important for improving representations in the target language, and that VA is an effective methods for doing so while maintaining overall alignment. However, TVA rarely outperforms VA significantly, suggesting that accelerated learning of the new embeddings does not benefit the model overall.
The relative error reductions between unadapted MBERT and each of our methods correlates with each language variety's relationship to MBERT pretraining data. Maltese (Type 2) improves by up to 39% and Irish (Type 1) by up to 15%, compared to 11% for Singlish and 5% for Vietnamese (both Type 0). While this trend is by no means comprehensive, it suggests that effective use of MBERT requires considering the target language variety.
Our results thus support our hypotheses and give insight to the limitations of MBERT. Wordpieces appear in different contexts in different languages, and MBERT initially lacks enough exposure to wordpiece usage in Type 1/2 target languages to outperform baselines. However, increased exposure through additional language-specific pretrain-ing can ameliorate this issue. Likewise, despite MBERT's attempt to balance its pretraining data, the existing vocabulary still favors languages that have been seen more. Augmenting the vocabulary can produce additional improvement for languages with greater proportions of unseen wordpieces. Overall, our findings are promising for lowresource language varieties, demonstrating that large improvements in performance are possible with the help of a little unlabeled data, and that the performance discrepancy of multilingual models for low-resource languages (Wu and Dredze, 2020) can be overcome.

Further Related Work
Our work builds on prior empirical studies on multilingual models, which probe the behavior and components of existing models to explain why they are effective. Cao et al. (2020) Unlike these studies, we primarily consider how to improve the performance of multilingual models for a given target language variety. Though our experiments do not directly probe the impact of vocabulary overlap, we contribute further evaluation of the importance of improved modeling of the target variety.
Recent work has also proposed additional pretraining for general-purpose language models, especially with respect to domain (Alsentzer et al., 2019;Chakrabarty et al., 2019;Gururangan et al., 2020;Han and Eisenstein, 2019;Howard and Ruder, 2018;Logeswaran et al., 2019;Sun et al., 2019). Lakew et al. (2018) andZoph et al. (2016) perform additional training on parallel data to adapt bilingual translation models to unseen target languages, while Mueller et al. (2020) improve a polyglot task-specific model by finetuning on labeled monolingual data in the target variety. To the best of our knowledge, our work is the first to demonstrate the effectiveness of additional pretraining for massively multilingual language models toward a target low-resource language variety, using only unlabeled data in the target variety.

Conclusion
We explore additional language-specific pretraining and vocabulary augmentation for multilingual contextual word representations in low-resource settings and find them to be effective for dependency parsing, especially in the lowest-resource cases. Our results demonstrate the significance of the relationship between a multilingual model's pretraining data and a target language. We expect that our findings can benefit practitioners in lowresource settings, and our data, code, and models are publicly available to accelerate further study.  We choose the vocabulary size to minimize the number of unknown wordpieces while maintaining a similar wordpiece-per-token ratio as the original MBERT vocabulary. Empirically, we find a vocabulary size of 5000 to best meet these criteria. Then, we tokenize the unlabeled data using both the new and original vocabularies. We compare the tokenizations of each word and note cases where the new vocabulary yields a tokenization with fewer unknown wordpieces than the original one. We select the 99 most common wordpieces that occur in these cases and use them to fill the 99 unused slots in MBERT's vocabulary. For Singlish, 99 such wordpieces are not available; we fill the remaining slots with the most common wordpieces in the new vocabulary.
Tab. 3 gives a comparison of the number of tokens with unknown wordpieces under the original and augmented MBERT vocabularies. The augmented vocabulary significantly decreases the number of unknowns, resulting in a specific embedding for most of the wordpieces.

A.2 Data Extraction and Preprocessing
In this section, we detail the steps used to obtain the pretraining data. After dataset-specific preprocessing, all datasets are tokenized with the multilingual spaCy tokenizer. 8 We then generate pretraining shards in a format acceptable by MBERT using scripts provided by Devlin et al. (2019) and the parameters listed in Tab. 8, which includes artificially  augmenting each dataset five times by masking different words with a probability of 0.15. Statistics for labeled datasets, which we use without modification, are provided in Tab. 4.

Wikipedia Data
We draw data from the newest available Wikipedia dump 9 for the language at the time it was obtained: October 20, 2019 (Irish) and January 1, 2020 (Maltese, Vietnamese). We use WikiExtractor 10 to extract the article text, split sentences at periods, and remove the following items: • Document start and end line • Article titles and section headers

• Categories
• HTML content (e.g., <br>) Articles are kept contiguous. The full Vietnamese Wikipedia consists of nearly 6.5 million sentences (141 million tokens); to simulate a truly low-resource setting, we randomly select 5% of the articles without replacement to use in our pretraining.
Singlish Data Beginning with the raw crawled sentences from Wang et al. (2017), we remove any sentences that appear verbatim in the validation or test sets of either their original treebank or our partition. Furthermore, we remove any sentences with fewer than five tokens or more than 50 tokens, as we observe that a large proportion of these sentences are either nonsensical or extended quotes 9 https://dumps.wikimedia.org/ 10 https://github.com/attardi/ wikiextractor from Standard English. We note that this dataset is non-contiguous: most sentences do not appear in a larger context.

A.3 Training Procedure
During pretraining, we use the original implementation of Devlin et al. (2019) but modify it to optimize based only on the masked language modeling (MLM) loss. Although Devlin et al. (2019) also trained on a next sentence prediction (NSP) loss, subsequent work has found joint optimization of NSP and MLM to be less effective than MLM alone (K et al., 2020;Lample and Conneau, 2019;Liu et al., 2019). Furthermore, in certain low-resource language varieties, fully contiguous data may not be available, rendering the NSP task ill-posed. We perform additional pretraining for up to 20 epochs, selecting our final model based on average validation LAS downstream.
Following prior work on parsing with MBERT (Kondratyuk and Straka, 2019), parsers are trained with a inverse square root learning rate decay and linear warmup, and gradual unfreezing and discriminative finetuning of the layers. These models are trained for up to 200 epochs with early stopping based on the validation performance. All parsers are implemented in AllenNLP, version 0.9.0 (Gardner et al., 2018).
Tab. 8 gives all hyperparameters kept constant during MBERT pretraining and parser training. The values for these hyperparameters largely reflect the defaults or recommendations specified in the implementations we used. For instance, the base learning rate for LAPT, VA, and TVA reflect recommendations in the code of Devlin et al. (2019), and the TVA embedding learning rate is equal to the learning rate used in the original pretraining of MBERT. Due to the large number of parameters in MBERT, large batch sizes are sometimes infeasible. We reduce the batch size until training is able to complete succesfully on our GPU. ELMO    In cases where a hyperparameter assignment yields exploding gradients and/or trends toward an infinite loss, we rerun the experiment to yield a feasible initialization.

A.5 TVA Revision
In April 2022, GitHub user thnkinbtfly reported that the original implementation of TVA contained an error that caused it to be equivalent to  that of VA. The main paper has been updated to report corrected results and best epochs for all TVA configurations. Original results for all other configurations are unchanged. We preserve the full set of original results in Tab. 9 and the original best epochs for TVA configurations in Tab. 7.
The original experimental conclusions about additional language-specific pretraining and vocabulary augmentation at large are unaffected: both methods are still effective as described. Similarly, observations about the correlation between relative error reduction and the target language variety's relationship to MBERT's pretraining data are unchanged.
One finding on the details of vocabulary augmentation is affected: rather than slightly benefiting learning of the newly added wordpieces, TVA does not consistently improve over VA. Our observations in §2 and §4 have been updated accordingly.