Multilingual Constituency Parsing with Self-Attention and Pre-Training

We show that constituency parsing benefits from unsupervised pre-training across a variety of languages and a range of pre-training conditions. We first compare the benefits of no pre-training, fastText, ELMo, and BERT for English and find that BERT outperforms ELMo, in large part due to increased model capacity, whereas ELMo in turn outperforms the non-contextual fastText embeddings. We also find that pre-training is beneficial across all 11 languages tested; however, large model sizes (more than 100 million parameters) make it computationally expensive to train separate models for each language. To address this shortcoming, we show that joint multilingual pre-training and fine-tuning allows sharing all but a small number of parameters between ten languages in the final model. The 10x reduction in model size compared to fine-tuning one model per language causes only a 3.2% relative error increase in aggregate. We further explore the idea of joint fine-tuning and show that it gives low-resource languages a way to benefit from the larger datasets of other languages. Finally, we demonstrate new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1).


Introduction
There has recently been rapid progress in developing contextual word representations that improve accuracy across a range of natural language tasks Howard and Ruder, 2018;Radford et al., 2018;Devlin et al., 2018). In our earlier work (Kitaev and Klein, 2018), we showed that such representations are helpful for constituency parsing. However, these results only considered the LSTM-based ELMo representations , and only for the English language. We now extend this work to show that using only self-attention also works by substituting BERT (Devlin et al., 2018). We further demonstrate that pre-training and self-attention are effective across languages by applying our parsing architecture to ten additional languages.
Our parser code and trained models for 11 languages are publicly available. 1

Model
Our parser as described in Kitaev and Klein (2018) accepts as input a sequence of vectors corresponding to words in a sentence, transforms these repre-1 https://github.com/nikitakit/self-attentive-parser sentations using one or more self-attention layers, and finally uses these representations to output a parse tree. We incorporate BERT by taking the token representations from the last layer of a BERT model and projecting them to 512 dimensions (the default size used by our parser) using a learned projection matrix. While our parser operates on vectors aligned to words in a sentence, BERT associates vectors to sub-word units based on Word-Piece tokenization (Wu et al., 2016). We bridge this difference by only retaining the BERT vectors corresponding to the last sub-word unit for each word in the sentence. We briefly experimented with other alternatives, such as using only the first sub-word instead, but did not find that this choice had a substantial effect on English parsing accuracy.
The fact that additional layers are applied to the output of BERT -which itself uses a selfattentive architecture -may at first seem redundant, but there are important differences between these two portions of the architecture. The extra layers on top of BERT use word-based tokenization instead of sub-words, apply the factored version of self-attention proposed in Kitaev and Klein (2018), and are randomly-initialized instead of being pre-trained. We found that omitting these additional layers and using the BERT vectors directly hurt parsing accuracies.
We also extend the parser to predict part-ofspeech tags in addition to constituent labels, a feature we include based on feedback from users of our previous parser. Tags are predicted using a small feed-forward network (with only one ReLU nonlinearity) after the final layer of self-attention. This differs slightly from Joshi et al. (2018), where tags are predicted based on span representations instead. The tagging head is trained jointly with the parser by adding an auxiliary softmax crossentropy loss, averaged over all words present in a given batch.  (2018) We train our parser with a learning rate of 5 × 10 −5 and batch size 32, where BERT parameters are fine-tuned as part of training. All other hyperparameters are unchanged from Kitaev and Klein (2018) and Devlin et al. (2018).

Comparison of Pre-Training Methods
In this section, we compare using BERT, ELMo, and training a parser from scratch on treebank data alone. Our comparison of the different methods for English is shown in Table 1. BERT with the "base" hyperparameter settings (12 layers, 12 attention heads per layer, and 768dimensional hidden vectors) performs comparably or slightly better than ELMo (95.32 vs. 95.21 F1), while a larger version of BERT (24 layers, 16 attention heads per layer, and 1024-dimensional hidden vectors) leads to better parsing accuracy (95.70 F1). These results show that both the LSTM-based architecture of ELMo and the selfattentive architecture of BERT are viable for parsing, and that pre-training benefits from having a high model capacity. We did not observe a sizable difference between using a version of BERT that converts all text to lowercase and a version of BERT that retains case information.
We found that pre-training on only English outperformed multilingual pre-training given the same model capacity, but the relative decrease in error rate was less than 6% (95.24 vs. 94.97 F1). This is a promising result because it supports the idea of using joint multilingual pre-training as a way to provide support for many languages in a resource-efficient manner.  We also conduct a control experiment to try to tease apart the benefits of the BERT architecture and training setup from the effects of the data used for pre-training. We originally attempted to use a randomly-initialized BERT model, but found that it would not train effectively within the range of hyperparameters we tried. 2 Instead, we trained an English parser using a version of BERT that was pre-trained on the Chinese Wikipedia. Neither the pre-training domain nor the subword vocabulary used are a good fit for the target task; however, English does occur sporadically throughout the Chinese Wikipedia, and the model can represent losslessly English text -all English letters are present in its subword vocabulary, so in the worst case it will decompose an English word into its individual letters. We found that this model achieved comparable performance to a version of our parser that was designed to be trained on treebank data alone (93.57 vs. 93.61 F1). This result suggests that even when the pre-training data is a poor fit for the target domain, fine-tuning can still produce results comparable to purely supervised training starting with randomly-initialized parameters.

Results
We train and evaluate our model on treebanks for eleven languages: English (see Table 2), the nine languages represented in the SPMRL 2013/2014 shared tasks (Seddah et al., 2013) (see Table 3), and Chinese (see Table 4). For each of these languages, our parser obtains a higher F1 score than any past systems we are aware of. The English and Chinese parsers use monolingual pre-training,

Conclusion
The remarkable effectiveness of unsupervised pretraining of vector representations of language suggests that future advances in this area can continue the ability of machine learning methods to model syntax (as well as other aspects of language.) At the same time, syntactic annotations remain a useful tool due to their interpretability, and we hope that our parsing software may be of use to others.