Byte Pair Encoding is Suboptimal for Language Model Pretraining

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.


Introduction
Large transformers (Vaswani et al., 2017) pretrained with variants of a language modeling objective, such as BERT (Devlin et al., 2019), have proven their effectiveness at flexibly transferring to a variety of domains and tasks. One design decision that makes them particularly adaptable is their graceful handling of the open vocabulary problem through subword tokenization. Subword tokenization, popularized in the neural machine translation literature (Sennrich et al., 2016;Vaswani et al., 2017;Wu et al., 2016), produces tokens at multiple levels of granularity, from individual characters to full words. As a result, rare words are broken down into a collection of subword units, bottoming out in characters in the worst case.
Critically, a pretrained language model's subword vocabulary cannot be altered: any downstream application of these models must tokenize input or generate output using the original subword vocabulary, making the choice of tokenization a particularly significant decision.
A variety of subword tokenization methods have seen use in pretrained language models. BERT uses the WordPiece method (Schuster and Nakajima, 2012), a language-modeling based variant of BPE; T5 (Raffel et al., 2019) uses character-level BPE; GPT2 (Radford et al., 2019) and ROBERTA  use BPE over raw bytes instead of unicode characters; XLNET (Yang et al., 2019) and ALBERT (Lan et al., 2019) use the Sentence-Piece library (Kudo and Richardson, 2018) which implements both BPE and unigram language model tokenization, but in both cases fail to clarify which of these methods they chose. The effects of tokenization are not examined in a reported experiment in any of the above works except , who note that WordPiece gave a small advantage over BPE in their preliminary investigation. In the machine translation literature, Kudo (2018) introduced the unigram language model tokenization method in the context of machine translation and found it comparable in performance to BPE. Domingo et al. (2018) performed further experiments to investigate the effects of tokenization on neural machine translation, but used a shared BPE vocabulary across all experiments. Gallé (2019) examined algorithms in the BPE family, but did not compare to unigram language modeling.
In this work, we characterize the space of proposed subword tokenization algorithms and analyze the differences between the two methods with publicly available implementations: BPE (merging tokens based on bigram frequency) and unigram language modeling (pruning tokens based on unigram LM perplexity). While the vocabularies resulting from these schemes are heavily overlapping, we compare each method to reference morphological segmentations and find that the unigram LM method produces tokens better aligned with morphology. To understand whether this more natural tokenization leads to improved performance, we pretrain separate language models using the ROBERTA objective  with each tokenization for both English and Japanese, two typologically distant languages. On downstream tasks, we find a performance gap across tasks and languages, with the unigram LM method providing an improvement over BPE of up to 10% in our Japanese QA experiments, indicating the benefits of adopting this technique in the context of language model pretraining.

Algorithms
Subword tokenization algorithms consist of two components: a vocabulary construction procedure, which takes a corpus of text and returns a vocabulary with the desired size, and a tokenization procedure, which takes the built vocabulary and applies it to new text, returning a sequence of tokens. In theory, these two steps can be independent, although for the algorithms we examine the tokenization procedure is tightly coupled to the vocabulary construction procedure.
A BPE vocabulary is constructed as follows: Algorithm 1 Byte-pair encoding (Sennrich et al., 2016;Gage, 1994) Replace each occurrence of t L , t R in 10: D with t NEW 11: end while 12: return V 13: end procedure BPE tokenization takes the vocabulary V con-taining ordered merges and applies them to new text in the same order as they occurred during vocabulary construction.
The WordPiece algorithm (Schuster and Nakajima, 2012), used to construct BERT's vocabulary, closely resembles BPE. However, instead of merging the most frequent token bigram, each potential merge is scored based on the likelihood of an n-gram language model trained on a version of the corpus incorporating that merge. Schuster and Nakajima (2012) note that the process of estimating language model parameters for every potential merge is prohibitive, so they employ aggressive heuristics to reduce the number of potential merges considered. As their implementation is not public, 1 we are unable to make a comparison to this method.
The unigram LM method (Kudo, 2018), in contrast to the bottom-up construction process of BPE and WordPiece, begins with a superset of the final vocabulary, pruning it to the desired size: for t ∈ V do Estimate token 'loss' 8: where θ is the LM without token t return V, θ 17: end procedure Unigram LM tokenization takes the vocabulary V and unigram LM parameters θ and performs Viterbi inference to decode the segmentation with maximum likelihood under θ. This method is similar to Morfessor's unsupervised segmentation (Creutz and Lagus, 2005) without its informed prior over token length.
Translation Magnetism is classified in various ways. Figure 1: Example tokenizations. The character ' ' is a word boundary marker. BPE merges common tokens, such as English inflectional suffixes and Japanese particles, into their neighbors even when the resulting unit is not semantically meaningful. In the course of our experiments we did not observe a major difference in speed between the two algorithms. Both require similar amounts of time to construct a vocabulary, and both have a negligible impact on overall model inference latency.

Morphology
In Figure 1 we illustrate the differences in tokenization output between BPE and the unigram LM method. We observe that the unigram LM method produces subword units that qualitatively align with morphology much better than those produced by BPE. In particular, we note that the unigram LM method recovers common affixes such as -ly, -s, pre-, and triwhile BPE does not, instead absorbing them into adjacent units (-cles) while also producing meaningless single-character units.
This trend is supported by   we observe that recognizable affixes appear much more frequently in the unigram LM tokenization of our pretraining corpus than in the BPE tokenization.  As the BPE tokenization is constructed greedily according to frequency, common affixes (and punctuation) are frequently absorbed into other tokens. 2 We see in Figure 2a that the unigram LM tokenization tends to have longer subword units than BPE. This is closer to the length distribution of gold-standard English morphs, which have a mean length of approximately 6 characters (Creutz and Linden, 2004).

Comparison with morphological segmenters
In Table 3, we further corroborate these observations by performing a quantitative evaluation of the degree to which each unsupervised segmentation algorithm aligns with morphological baselines for each language. For English, we produce gold surface allomorph boundaries from the CELEX2 lexical database (Baayen et al., 1995) in the manner of Creutz and Lindén (2004). We then compare each algorithm's subword unit boundaries with gold morpheme boundaries for words with 2 or more morphemes, weighted by their frequency in English Wikipedia. For Japanese, we compare subword tokenizations of Japanese Wikipedia sentences to morphological reference tokenizations produced using the MeCab morphological analysis and tokenization tool (Kudo, 2006) using version 2.3.0 of the UniDic dictionary (Den et al., 2007).
We find that for both languages, the segmentations produced by the unigram LM method correspond more closely to the morphological references, confirming our qualitative analysis. On English data, both unsupervised methods exhibit low boundary recall; we attribute this to the fact that they represent many common words with underlying derivational morphology as single tokens, although for BPE this is compounded by effects we discuss in Section 3.2.
The ability of the unigram LM method to recover the morphological structure of the text without explicit supervision aligns with the main findings of Creutz and Lagus (2005), who successfully use maximum-a-posteriori unigram language models to perform unsupervised morphological segmentation of English and Finnish.

Vocabulary Allocation
By surfacing subword units that align with morphology, the unigram LM tokenization provides the opportunity for the model to learn composable subword embeddings. If an affix reliably signals a linguistic feature, rather than needing to store that information redundantly across the embeddings of many tokens containing the affix, the model can store it in just the embedding of the affix.
These results suggest that the unigram LM method may allocate its vocabulary more economically. We note in Figure 2b that both vocabularies contain a "dead zone" of tokens whose frequency is much lower than the rest of the vocabulary. This is largely the result of the presence of a number of very uncommon characters, including Chinese and Japanese kanji, in the training corpus. In the BPE tokenization, however, this effect is exacerbated, with the dead zone containing about 1500 more entries as a result of the tendency of its vocabulary construction process to produce intermediate "junk" tokens. For example, in the case where three tokens almost always occur as a group, in order to merge them into a single token, BPE must first merge one pair before incorporating the third token; this leaves an intermediate token in the vocabulary that will only occur rarely on its own. Additionally, tokens that appear in many contexts, such as inflectional affixes (-s, -ed), will tend to merge with many adjacent units due to their frequency. However, these merges lead to embedding redundancy, as these affixes usually have the same linguistic function in every context. Since the unigram LM method selects tokens during vocabulary construction using a global optimization procedure, it does not produce junk tokens; this property also allows it to avoid merging frequent tokens with their neighbors too aggressively.
Japanese vocabulary comparisons are included  Table 4: Fine-tuning results. Metrics are averaged across 5 fine-tuning seeds with standard deviations indicated by ±; due to computational constraints we did not pretrain more than once per tokenization. We include finetuning results for a transformer with a comparable architecture, BERT BASE , for reference, although we note that a direct comparison cannot be made due to BERT BASE using both a larger pretraining corpus and a larger subword vocabulary.
in Appendix B.

Downstream Task Experiments
In order to make a fair experimental comparison between these two methods on downstream tasks, we do not use an existing pretrained language model like BERT, but instead train our own language models from scratch, controlling for the data, training objective, and optimization procedure. We pretrain four transformer masked language models using the architecture and training objective of ROBERTA-BASE  using the reference fairseq implementation . Two are pretrained on the text of English Wikipedia, comprising ∼3B tokens under either tokenization. The other two are pretrained on the text of Japanese Wikipedia, comprising ∼0.6B tokens. In each pair, one model is pretrained on the BPE tokenization of the corpus, and the other on the unigram LM tokenization, each with a vocabulary of 20,000 tokens. Hyperparameters are listed in Appendix A. We subsequently fine-tune each of the pretrained English models on the SQuAD question-answering task (Rajpurkar et al., 2016), the MNLI textual entailment task (Williams et al., 2018), and the English portion of the CoNLL 2003 named-entity recognition shared task (Tjong Kim Sang and De Meulder, 2003). We fine-tune the Japanese models on the Japanese minimal-answer subset of the TyDi question-answering task (Clark et al., 2020). We base our fine-tuning implementations on those of the transformers toolkit (Wolf et al., 2019).
The results of our fine-tuning experiments are presented in Table 4. We show that fine-tuning models pretrained with unigram LM tokenization produces better performance than fine-tuning models pretrained with BPE tokenization for all tasks. These results suggest that the higher morpholog-ical plausibility of the unigram LM tokenization may translate into better downstream task performance as well. Larger performance gaps are evident on SQuAD and MNLI, but the largest gap appears on Japanese TyDi. Differences in pretraining may be more evident in this setting due to the fact that the Japanese portion of the TyDi training split only contains ∼5k examples, compared to the ∼88k examples available for fine-tuning on SQuAD. Additionally, written Japanese does not feature whitespace between words, so it is possible for tokenizations to differ in word boundary placement as well as subword segmentation.

Conclusion
In this work we show that the choice of input encoding makes a difference in how well pretrained language models are able to perform end tasks. This indicates that tokenization encodes a surprising amount of inductive bias, and we suggest that unigram LM tokenization may be the better choice for development of future pretrained models.