Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation

Pretrained contextual and non-contextual subword embeddings have become available in over 250 languages, allowing massively multilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic evaluations makes it difficult for practitioners to choose between them. In this work, we conduct an extensive evaluation comparing non-contextual subword embeddings, namely FastText and BPEmb, and a contextual representation method, namely BERT, on multilingual named entity recognition and part-of-speech tagging. We find that overall, a combination of BERT, BPEmb, and character representations works best across languages and tasks. A more detailed analysis reveals different strengths and weaknesses: Multilingual BERT performs well in medium- to high-resource languages, but is outperformed by non-contextual subword embeddings in a low-resource setting.


Introduction
Rare and unknown words pose a difficult challenge for embedding methods that rely on seeing a word frequently during training (Bullinaria and Levy, 2007;Luong et al., 2013). Subword segmentation methods avoid this problem by assuming a word's meaning can be inferred from the meaning of its parts. Linguistically motivated subword approaches first split words into morphemes and then represent word meaning by composing morpheme embeddings (Luong et al., 2013). More recently, character-ngram approaches (Luong and Manning, 2016;Bojanowski et al., 2017) and Byte Pair Encoding (BPE) (Sennrich et al., 2016) have grown in popularity, likely due to their computational simplicity and language-agnosticity. 1 * Work done while at HITS. 1 While language-agnostic, these approaches are not language-independent. See Appendix B for a discussion. Sequence tagging with subwords. Subword information has long been recognized as an important feature in sequence tagging tasks such as named entity recognition (NER) and part-ofspeech (POS) tagging. For example, the suffix -ly often indicates adverbs in English POS tagging and English NER may exploit that professions often end in suffixes like -ist (journalist, cyclist) or companies in suffixes like -tech orsoft. In early systems, these observations were operationalized with manually compiled lists of such word endings or with character-ngram features (Nadeau and Sekine, 2007). Since the advent of neural sequence tagging (Graves, 2012; M a g n u s C a r l s e n p l a y e d a g a i n s t V i s w a n a t h a n A n a n d   Huang et al., 2015), the predominant way of incorporating character-level subword information is learning embeddings for each character in a word, which are then composed into a fixedsize representation using a character- CNN (Chiu and Nichols, 2016) or character-RNN (char-RNN) (Lample et al., 2016). Moving beyond single characters, pretrained subword representations such as FastText, BPEmb, and those provided by BERT (see 2) have become available.
While there now exist several pretrained subword representations in many languages, a practitioner faced with these options has a simple question: Which subword embeddings should I use?
In this work, we answer this question for multilingual named entity recognition and part-of-speech tagging and make the following contributions: • We present a large-scale evaluation of multilingual subword representations on two sequence tagging tasks; • We find that subword vocabulary size matters and give recommendations for choosing it; • We find that different methods have different strengths: Monolingual BPEmb works best in medium-and high-resource settings, multilingual non-contextual subword embeddings are best in low-resource languages, while multilingual BERT gives good or best results across languages.

Subword Embeddings
We now introduce the three kinds of multilingual subword embeddings compared in our evaluation: FastText and BPEmb are collections of pretrained, monolingual, non-contextual subword embeddings available in many languages, while BERT provides contextual subword embeddings for many languages in a single pretrained language model with a vocabulary shared among all languages. Table 1 shows examples of the subword segmentations these methods produce.

FastText: Character-ngram Embeddings
FastText (Bojanowski et al., 2017) represents a word w as the sum of the learned embeddings z g of its constituting character-ngrams g and, in case of in-vocabulary words, an embedding z w of the word itself: w = z w + g∈Gw z g , where G w is the set of all constituting character n-grams for 3 ≤ n ≤ 6. Bojanowski et al. provide embeddings trained on Wikipedia editions in 294 languages. 2

BPEmb: Byte-Pair Embeddings
Byte Pair Encoding (BPE) is an unsupervised segmentation method which operates by iteratively merging frequent pairs of adjacent symbols into new symbols. E.g., when applied to English text, BPE merges the characters h and e into the new byte-pair symbol he, then the pair consisting of the character t and the byte-pair symbol he into the new symbol the and so on. These merge operations are learned from a large background corpus. The set of byte-pair symbols learned in this fashion is called the BPE vocabulary.
Applying BPE, i.e. iteratively performing learned merge operations, segments a text into subwords (see BPE segmentations for vocabulary sizes vs1000 to vs100000 in Table 1). By employing an embedding algorithm, e.g. GloVe (Pennington et al., 2014), to train embeddings on such a subword-segmented text, one obtains embeddings for all byte-pair symbols in the BPE vocabulary. In this work, we evaluate BPEmb (Heinzerling and Strube, 2018), a collection of byte-pair embeddings trained on Wikipedia editions in 275 languages. 3

BERT: Contextual Subword Embeddings
One of the drawbacks of the subword embeddings introduced above, and of pretrained word embeddings in general, is their lack of context. For example, with a non-contextual representation, the embedding of the word play will be the same both in the phrase a play by Shakespeare and the phrase to play Chess, even though play in the first phrase is a noun with a distinctly different meaning than the verb play in the second phrase. Contextual word representations (Dai and Le, 2015;Melamud et al., 2016;Ramachandran et al., 2017;Peters et al., 2018;Radford et al., 2018;Howard and Ruder, 2018) overcome this shortcoming via pretrained language models.
Instead of representing a word or subword by a lookup of a learned embedding, which is the same regardless of context, a contextual representation is obtained by encoding the word in context using a neural language model (Bengio et al., 2003). Neural language models typically employ a sequence encoder such as a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017). In such a model, each word or subword in the input sequence is encoded into a vector representation. With a bidirectional LSTM, this representation is influenced by its left and right context through state updates when encoding the sequence from left to right and from right to left. With a Transformer, context influences a word's or subword's representation via an attention mechanism (Bahdanau et al., 2015).
In this work we evaluate BERT (Devlin et al., 2019), a Transformer-based pretrained language model operating on subwords similar to BPE (see last row in Table 1). We choose BERT among the pretrained language models mentioned above since it is the only one for which a multilingual version is publicly available. Multilingual BERT 4 has been trained on the 104 largest Wikipedia editions, so that, in contrast to FastText and BPEmb, many low-resource languages are not supported.   (Nivre et al., 2016). These annotations take the form of language-universal POS tags (Petrov et al., 2012), such as noun, verb, adjective, determiner, and numeral.

Sequence Tagging Architecture
Our sequence tagging architecture is depicted in Figure 1. The architecture is modular and allows encoding text using one or more subword embedding methods. The model receives a sequence of tokens as input, here Magnus Carlsen played.
After subword segmentation and an embedding lookup, subword embeddings are encoded with an encoder specific to the respective subword method. For BERT, this is a pretrained Transformer, which is finetuned during training. For all other methods we train bidirectional LSTMs. Depending on the particular subword method, input tokens are segmented into different subwords.
Here, BERT splits Carlsen into two subwords resulting in two encoder states for this token, while BPEmb with an LSTM encoder splits this word into three. FastText (not depicted) and character RNNs yield one encoder state per token. To match subword representations with the tokenization of the gold data, we arbitrarily select the encoder state corresponding to the first subword in each token. A meta-LSTM combines the token representations produced by each encoder before classification. 5 Decoding the sequence of a neural model's pre-classification states with a conditional random field (CRF) (Lafferty et al., 2001) has been shown to improve NER performance by 0.7 to 1.8 F1 points (Ma and Hovy, 2016; Reimers and Gurevych, 2017) on a benchmark dataset. In our preliminary experiments on WikiAnn, CRFs considerably increased training time but did not show consistent improvements across languages. 6 Since our study involves a large number of experiments comparing several subword representations with cross-validation in over 250 languages, we omit the CRF in order to reduce model training time. Implementation details. Our sequence tagging architecture is implemented in PyTorch (Paszke et al., 2017). All model hyper-parameters for a given subword representation are tuned in preliminary experiments on development sets and then kept the same for all languages (see Appendix D). For many low-resource languages, WikiAnn provides only a few hundred instances with skewed entity type distributions. In order to mitigate the impact of variance from random train-devtest splits in such cases, we report averages of n-fold cross-validation runs, with n=10 for lowresource, n=5 for medium-resource, and n=3 for high-resource languages. 7 For experiments in- Best BPE vocabulary size Figure 2: The best BPE vocabulary size varies with dataset size. For each of the different vocabulary sizes, the box plot shows means and quartiles of the dataset sizes for which this vocabulary size is optimal, according to the NER F1 score on the respective development set in WikiAnn. E.g., the bottom, pink box records the sizes of the datasets (languages) for which BPE vocabulary size 1000 was best, and the top, blue box the dataset sizes for which vocabulary size 100k was best.
volving FastText, we precompute a 300d embedding for each word and update embeddings during training. We use BERT in a finetuning setting, that is, we start training with a pretrained model and then update that model's weights by backpropagating through all of BERT's layers. Finetuning is computationally more expensive, but gives better results than feature extraction, i.e. using one or more of BERT's layers for classification without finetuning (Devlin et al., 2019). For BPEmb, we use 100d embeddings and choose the best BPE vocabulary size as described in the next subsection.

Tuning BPE
In subword segmentation with BPE, performing only a small number of byte-pair merge operations results in a small vocabulary. This leads to oversegmentation, i.e., words are split into many short subwords (see BPE vs1000 in Table 1). With more merge operations, both the vocabulary size and the average subword length increase. As the byte-pair vocabulary grows larger it adds symbols corresponding to frequent words, resulting in such words not being split into subwords. Note, for example, that the common English preposition against is not split even with the smallest vocabulary size, or that played is split into the stem play and suffix ed with a vocabulary of size 1000, but is not split with larger vocabulary sizes. The choice of vocabulary size involves a tradeoff. On the one hand, a small vocabulary re-  Conversely, a larger BPE vocabulary tends to yield longer, more meaningful subwords so that subword composition becomes easier -or in case of frequent words even unnecessary -in downstream applications, but a larger vocabulary also requires a larger text corpus for pre-training good embeddings for all symbols in the vocabulary. Furthermore, a larger vocabulary size requires more annotated data for training larger neural models and increases training time.
Since the optimal BPE vocabulary size for a given dataset and a given language is not a priori clear, we determine this hyper-parameter empirically. To do so, we train NER models with varying BPE vocabulary sizes 8 for each language and record the best vocabulary size on the language's development set as a function of dataset size (Figure 2). This data shows that larger vocabulary sizes are better for high-resource languages with more training data, and smaller vocabulary sizes are better for low-resource languages with smaller datasets. In all experiments involving byte-pair embeddings, we choose the BPE vocabulary size for the given language according to this data. 9

NER with FastText and BPEmb
In this section, we evaluate FastText and BPEmb on NER in 265 languages. As baseline, we com- 8 We perform experiments with vocabulary sizes in {1000, 3000, 5000, 10000, 25000, 50000, 100000}. 9 The procedure for selecting BPE vocabulary size is given in Appendix C.  Table 3). Averaged over all languages, FastText performs 4.1 F1 points worse than this baseline. BPEmb is on par overall, with higher scores for medium-and high-resource languages, but a worse F1 score on low-resource languages. BPEmb combined with character embeddings (+char) yields the overall highest scores for medium-and high-resource languages among monolingual methods. Word shape. When training word embeddings, lowercasing is a common preprocessing step (Pennington et al., 2014) that on the one hand reduces vocabulary size, but on the other loses information in writing systems with a distinction between upper and lower case letters. As a more expressive alternative to restoring case information via a binary feature indicating capitalized or lowercased words (Curran and Clark, 2003), word shapes (Collins, 2002;Finkel et al., 2005) map   Table 3) yields similar improvements as character embeddings.
Since capitalization is not important in all languages, we heuristically decide whether shape embeddings should be added for a given language or not. We define the capitalization ratio of a language as the ratio of upper case characters among all characters in a written sample. As Figure 3 shows, capitalization ratios vary between languages, with shape embeddings tending to be more beneficial in languages with higher ratios. By thresholding on the capitalization ratio, we only add shape embeddings for languages with a high ratio (+someshape). This leads to an overall higher average F1 score of 85.3 among monolingual models, due to improved performance (81.9 vs. 81.5) on low-resource languages. One NER model for 265 languages. The reduction in vocabulary size achieved by BPE is a crucial advantage in neural machine translation (Johnson et al., 2017) and other tasks which involve the costly operation of taking a softmax over the entire output vocabulary (see Morin and Bengio, 2005;Li et al., 2019). BPE vocabulary sizes between 8k and 64k are common in neural machine translation. Multilingual BERT operates on a subword vocabulary of size 100k which is shared among 104 languages. Even with shared sym-bols among languages, this allots at best only a few thousand byte-pair symbols to each language. Given that sequence tagging does not involve taking a softmax over the vocabulary, much larger vocabulary sizes are feasible, and as §3.2 shows, a larger BPE vocabulary is better when enough training data is available. To study the effect of a large BPE vocabulary size in a multilingual setting, we train BPE models and byte-pair embeddings with subword vocabularies of up to 1000k BPE symbols, which are shared among all languages in our evaluation. 10 The shared BPE vocabulary and corresponding byte-pair embeddings allow training a single NER model for all 265 languages. To do so, we first encode WikiAnn in all languages using the shared BPE vocabulary and then train a single multilingual NER model in the same fashion as a monolingual model. As the vocabulary size has a large effect on the distribution of BPE symbol lengths ( Figure 4, also see §3.2) and model quality, we determine this hyper-parameter empirically (Table 4). To reduce the disparity between dataset sizes of different languages, and to keep training time short, we limit training data to a maximum of 3000 instances per language. 11 Results for this multilingual model (MultiBPEmb) with shared character embeddings (+char) and without further finetuning -finetune show a strong improvement in low-resource languages (89.7 vs. 81.9 with +someshape), while performance degrades drastically on high-resource languages. Since the 188 low-resource languages in WikiAnn are typologically and genealogically diverse, the improvement suggests that low-resource languages not only profit from cross-lingual transfer from similar languages (Cotterell and Heigold, 2017), but that multilingual training brings other benefits, as well. In multilingual training, certain aspects of the task at hand, such as tag distribution and BIO constraints have to be learned only once, while they have to be separately learned on each language in monolingual training. Furthermore, multilingual training may prevent overfitting to biases in small monolingual datasets, such as a skewed tag distri- Figure 5: Shared multilingual byte-pair embedding space pretrained (left) and after NER model training (right), 2-d UMAP projection (McInnes et al., 2018). As there is no 1-to-1 correspondence between BPE symbols and languages in a shared multilingual vocabulary, it is not possible to color BPE symbols by language. Instead, we color symbols by Unicode code point. This yields a coloring in which, for example, BPE symbols consisting of characters from the Latin alphabet are green (large cluster in the center), symbols in Cyrillic script blue (large cluster at 11 o'clock), and symbols in Arabic script purple (cluster at 5 o'clock). Best viewed in color.  butions. A visualization of the multilingual subword embedding space ( Figure 5) gives evidence for this view. Before training, distinct clusters of subword embeddings from the same language are visible. After training, some of these clusters are more spread out and show more overlap, which indicates that some embeddings from different languages appear to have moved "closer together", as one would expect embeddings of semanticallyrelated words to do. However, the overall structure of the embedding space remains largely unchanged. The model maintains language-specific subspaces and does not appear to create an interlingual semantic space which could facilitate cross-lingual transfer.
Having trained a multilingual model on all languages, we can further train this model on a single language (Table 3, +finetune). This finetuning further improves performance, giving the best overall score (91.4) and an 8.8 point improvement over Pan et al. on low-resource languages (90.4 vs. 81.6). These results show that multilingual training followed by monolingual finetuning is an ef-fective method for low-resource sequence tagging. Table 5 shows NER results on the intersection of languages supported by all methods in our evaluation. As in §3.3, FastText performs worst overall, monolingual BPEmb with character embeddings performs best on high-resource languages (93.6 F1), and multilingual BPEmb best on lowresource languages (91.1). Multilingual BERT outperforms the Pan17 baseline and shows strong results in comparison to monolingual BPEmb. The combination of multilingual BERT, monolingual BPEmb, and character embeddings is best overall (92.0) among models trained only on monolingual NER data. However, this ensemble of contextual and non-contextual subword embeddings is inferior to MultiBPEmb (93.2), which was first trained on multilingual data from all languages collectively, and then separately finetuned to each language. Score distributions and detailed NER results for each language and method are shown in Appendix E and Appendix F.

POS Tagging in 27 Languages
We perform POS tagging experiments in the 21 high-resource (Table 6) and 6 low-resource languages (

Limitations and Conclusions
Limitations. While extensive, our evaluation is not without limitations. Throughout this study, we have used a Wikipedia edition in a given language as a sample of that language. The degree to which this sample is representative varies, and low-resource Wikipedias in particular contain large fractions of "foreign" text and noise, which propagates into embeddings and datasets.
Our evaluation did not include other subword representations, most notably ELMo (Peters et al., 2018) and contextual string embeddings (Akbik et al., 2018), since, even though they are languageagnostic in principle, pretrained models are only available in a few languages. Conclusions. We have presented a large-scale study of contextual and non-contextual subword embeddings, in which we trained monolingual and multilingual NER models in 265 languages and POS-tagging models in 27 languages. BPE vocabulary size has a large effect on model quality, both in monolingual settings and with a large vocabulary shared among 265 languages. As a rule of thumb, a smaller vocabulary size is better for small datasets and larger vocabulary sizes better for larger datasets. Large improvements over monolingual training showed that low-resource languages benefit from multilingual model training with shared subword embeddings. Such improvements are likely not solely caused by cross-lingual transfer, but also by the prevention of overfitting and mitigation of noise in small monolingual datasets. Monolingual finetuning of a multilingual model improves performance in almost all cases (compare -finetune and +finetune columns in Table 9 in Appendix F). For high-resource languages, we found that monolingual embeddings and monolingual training perform better than multilingual approaches with a shared vocabulary. This is likely due to the fact that a high-resource language provides large background corpora for learning good embeddings of a large vocabulary and also provides so much training data for the task at hand that little additional information can be gained from training data in other languages. Our experiments also show that even a large multilingual contextual model like BERT benefits from character embeddings and additional monolingual embeddings. Finally, and while asking the reader to bear above limitations in mind, we make the following practical recommendations for multilingual sequence tagging with subword representations: • Choose the largest feasible subword vocabulary size when a large amount of data is available.
• Choose smaller subword vocabulary sizes in low-resource settings.
• Multilingual BERT is a robust choice across tasks and languages if the computational requirements can be met.
• With limited computational resources, use small monolingual, non-contextual representations, such as BPEmb combined with character embeddings.
• Combine different subword representations for better results.
• In low-resource scenarios, first perform multilingual pretraining with a shared subword vocabulary, then finetune to the language of interest.  scores (middle) and each language's dataset size (bottom). Languages are sorted from left to right from highest to lowest tag distribution entropy. That is, the NER tags in WikiAnn for the language in question are well-balanced for higher-ranked languages on the left and become more skewed for lower-ranked languages towards the right. Pan et al. achieve NER F1 scores up to 100 percent on some languages, which can be explained by the highly skewed, i.e. low-entropy, tag distribution in these languages (compare F1 scores >99% in middle subfigure with skewed tag distributions in top subfigure). Better balance, i.e. higher entropy, of tag distribution tends to be found in languages for which WikiAnn provides more data (compare top and bottom subfigures).

B BPE and character-ngrams are not language-independent
Some methods proposed in NLP are unjustifiedly claimed to be language-independent (Bender, 2011). Subword segmentation with BPE or character-ngrams is language-agnostic, i.e., such a segmentation can be applied to any sequence of symbols, regardless of the language or meaning of these symbols. However, BPE and characterngrams are based on the assumption that meaningful subwords consist of adjacent characters, such as the suffix -ed indicating past tense in English or the copular negation nai in Japanese. This assumption does not hold in languages with nonconcatenative morphology. For example, Semitic roots in languages such as Arabic and Hebrew are patterns of discontinuous sequences of consonants which form words by insertion of vowels and other consonants. For instance, words related to writing are derived from the root k-t-b: kataba "he wrote" or kitab "book". BPE and characterngrams are not suited to efficiently capture such patterns of non-adjacent characters, and hence are not language-independent.
C Procedure for selecting the best BPE vocabulary size We determine the best BPE vocabulary size for each language according to the following procedure.
1. For each language l in the set of all languages L and each BPE vocabulary size v ∈ V , run n-fold cross-validation with each fold comprising a random split into training, development, and test set. 12 2. Find the best BPE vocabulary size v l for each language, according to the mean evaluation score on the development set of each crossvalidation fold.
3. Determine the dataset size, measured in number of instances N l , for each language.
4. For each vocabulary size v, compute the median number of training instances of the languages for which v gives the maximum evaluation score on the development set, i.e. N v = median({N l |v = v l ∀l ∈ L}).
12 V = {1000, 3000, 5000, 10000, 25000, 50000, 100000} in our experiments. 5. Given a language with dataset size N l , the best BPE vocabulary sizev l is the one whose N v is closest to N l :     Table 9: Per-language NER F1 scores on WikiAnn.