CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.


Introduction
Pre-trained language representations from Transformers (Vaswani et al., 2017) have become arguably the most popular choice for building NLP systems 1 . Among all such models, BERT (Devlin et al., 2019) has probably been the most successful, spawning a large number of new improved variants Lan et al., 2019;Zhang et al., 2019;Clark et al., 2020). As a result, many of the recent language representation models inherited BERT's subword tokenization system which relies on a predefined set of wordpieces , supposedly striking a good balance between the flexibility of characters and the efficiency of full words.
While current research mostly focuses on improving language representations for the default "generaldomain", there seems to be a growing interest in building suitable word embeddings for more specialized domains (El Boukkouri et al., 2019;Si et al., 2019;Elwany et al., 2019). However, with the growing complexity of recent representation models, the default trend seems to favor re-training general-domain models on specialized corpora rather than building models from scratch with a specialized vocabulary (e.g., BlueBERT (Peng et al., 2019) and BioBERT (Lee et al., 2020)). While these methods undeniably produce good models 2 , a few questions remain: How suitable are the predefined general-domain vocabularies when used in the context of specialized domains (e.g., the medical domain)? Is it better to train specialized models with specialized subword units? Do we induce any biases by training specialized models with general-domain wordpieces?
In this paper, we propose CharacterBERT, a possible solution for avoiding any biases that may come from the use of a predefined wordpiece vocabulary, and an effort to revert back to conceptually simpler word-level models. This new variant does not rely on wordpieces but instead consults the characters of each token to build representations similarly to previous word-level open-vocabulary systems (Luong and Manning, 2016;Kim et al., 2016;Jozefowicz et al., 2016). In practice, we replace BERT's wordpiece embedding layer with ELMo's (Peters et al., 2018) Character-CNN module while keeping the rest of the architecture untouched. As a result, CharacterBERT is able to produce word-level contextualized representations and does not require a wordpiece vocabulary. Furthermore, this new model seems better suited than vanilla BERT for training specialized models, as evidenced by an evaluation on multiple tasks from the medical domain. Finally, as expected from a character-based system, CharacterBERT is also seemingly more robust to noise and misspellings. To the best of our knowledge, this is the first work that replaces BERT's wordpiece system with a word-level character-based system.
Our contributions are the following: • We provide preliminary evidence that general-domain wordpiece vocabularies are not suitable for specialized domain applications.
• We propose CharacterBERT, a new variant of BERT that produces word-level contextual representations by consulting characters.
• We evaluate CharacterBERT on multiple specialized medical tasks and show that it outperforms BERT without requiring a wordpiece vocabulary.
• We exhibit signs of improved robustness to noise and misspellings in favor of CharacterBERT.
• We enable the reproducibility of our experiments by sharing our pre-training and fine-tuning codes. Furthermore, we also share our pre-trained representation models to benefit the NLP community 3 .
This work has only focused on the English language and the medical (clinical and biomedical) domain. The generalization to other languages and specialized domains is left to future work.

General-Domain Wordpieces in Specialized Domains
Since many specialized versions of BERT come from re-training the original model on a set of specialized texts, we carry out a couple of preliminary experiments to gauge the effect of using a general-domain wordpiece vocabulary in a specialized domain. Here we focus on the medical domain for which we learn 4 a new wordpiece vocabulary using MIMIC-III clinical notes (Johnson et al., 2016) and PMC OA 5 biomedical article abstracts. We then process a sample (1M tokens) of the medical corpus with either the medical vocabulary or BERT's original vocabulary and examine the difference. Looking at the frequency of splitting an unknown token into multiple wordpieces (cf. Figure 1) we see that the medical vocabulary produces overall less wordpieces than the general version, both at occurrence and type levels. Moreover, we see that ≈ 13% of occurrences are never split as they are already part of the medical vocabulary but are decomposed into two or more wordpieces by the general vocabulary.  When looking closer at the quality of the produced wordpieces (cf. Table 1), we see that in addition to producing fewer subwords, the specialized vocabulary also seems to produce more meaningful units (e.g. "choledoch" and "olithiasis"). These preliminary analyses show that the choice of a vocabulary affects the quality of the tokenization which may in turn induce biases in downstream applications of the representation model. To avoid such biases, and in an effort to revert back to more convenient and conceptually simpler word-level models, we propose CharacterBERT, a wordpiece-free variant of BERT.

CharacterBERT
CharacterBERT is similar in every way to vanilla BERT but uses a different method to construct initial context-independent representations: while the original model consults its vocabulary to split unknown tokens into multiple wordpieces then embeds each unit independently using a wordpiece embedding matrix, CharacterBERT uses a Character-CNN module (Peters et al., 2018;Jozefowicz et al., 2016) which consults the characters of a token to produce a single representation (see Figure 2). Figure 2: Comparison of the context-independent representation systems in BERT and CharacterBERT. In this illustration, BERT splits the word "Apple" into two wordpieces then embeds each unit separately. CharacterBERT produces a single embedding for "Apple" by consulting its sequence of characters.

Character-CNN: Building Word Representations From Characters
We use the Character-CNN that is implemented as part of ELMo's architecture. This module constructs context-independent token representations through the following steps: 1. Each token is converted into a sequence of characters 6 with a maximum sequence length of 50.
2. A lookup is performed for each character, producing a sequence of 16-d embeddings.
3. The character embedding sequence is fed to multiple 1-d CNNs (LeCun et al., 1989) with different filters 7 . The output of each CNN is then max-pooled across the character sequence and concatenated with other CNN outputs to produce a single representation.
4. The CNN representation then goes through two Highway layers (Srivastava et al., 2015) that apply non-linearities with residual connections before being projected down to a final embedding size which we chose to be coherent with BERT's 768-dimensional wordpiece representations.
As with BERT, we add the token embedding (here, the Character-CNN representation) to position and segment embeddings before feeding the resulting context-independent representation to several Transformer layers. Since CharacterBERT does not split tokens into wordpieces, each input token is assigned a single final contextual representation by the model.

Pre-training Procedure
Like BERT, our model is pre-trained on two tasks: a Masked Language Modelling task (MLM) and a Next Sentence Prediction task (NSP). The only difference lies in the MLM task where instead of predicting single wordpieces, we predict entire words. This natural consequence of handling words instead of wordpieces is somewhat related to recent work on Whole Word Masking which has been shown to improve the quality of BERT models 8 (Cui et al., 2019).

Experiments
We compare BERT and CharacterBERT on multiple medical tasks to evaluate the impact of using a Character-CNN module instead of wordpieces. In an attempt to dissociate this impact from any other effects that may be related to training models in our own specific settings, we train each CharacterBERT model alongside a BERT counterpart in the exact same conditions.

Model Settings
We base our models on the "base-uncased" version of BERT, which uses 12 Transformer layers with 12 attention heads and produces 768-d representations from uncased texts. This version has ≈ 109.5M parameters and the corresponding CharacterBERT architecture has ≈ 104.6M parameters. It is interesting to note that using a Character-CNN actually results in a smaller overall model despite using a seemingly complex character module. This is because BERT's wordpiece matrix has ≈ 30K × 768-d vectors while CharacterBERT uses smaller 16-d character embeddings with mostly small-sized CNNs.
We pre-train four different models to simulate the usual situation where BERT is first pre-trained on a general corpus before being re-trained on a set of specialized texts: BERT general : a general-domain model obtained by pre-training BERT on a general corpus. It uses the same architecture and wordpiece vocabulary as BERT (base, uncased).
CharacterBERT general : a general-domain model obtained by training CharacterBERT on a general corpus. Besides the Character-CNN, it uses the same architecture as BERT general .
BERT medical : a medical model obtained by re-training BERT general on a medical corpus.
CharacterBERT medical : a medical model obtained by re-training CharacterBERT general on a medical corpus. This is the Character-CNN analog of BERT medical .

Corpora
The original BERT was pre-trained on English Wikipedia and BooksCorpus (Zhu et al., 2015). Since the latter is not publicly available anymore, we replace it with OpenWebText (Gokaslan and Cohen, 2019) to train our general-domain models. We also build a specialized corpus from MIMIC-III and PMC OA abstracts to train our medical-domain models (see Table 2).

Pre-training Setup
We train each model using 16 Tesla V100-SXM2-16GB GPUs and following the implementation and parameters in the NVIDIA codebase 9 . Each complete pre-training phase consists of two steps: Step 1 3,519 updates with a batch size 10 of 8,192 and a learning rate of 6.10 −3 on sequences of size 128.
Step 2 782 updates with a batch size of 4,096 and a learning rate of 4.10 −3 on sequences of size 512.
All models are optimized using LAMB (You et al., 2019) with a warm-up rate and weight decay of 0.01.

Tasks
All models are evaluated on five medical tasks after adding task-specific layers (Devlin et al., 2019).

Natural Language Inference
We also evaluate on the clinical natural language inference task MEDNLI (Romanov and Shivade, 2018) that aims to classify sentence pairs into three categories: CONTRA-DICTION, ENTAILMENT, and NEUTRAL. Sentence Similarity Finally, we also evaluate our models on the clinical sentence similarity task Clini-calSTS (Wang et al., 2018a) from BioCreative/OHNLP Challenge 2018, Task 2 (Wang et al., 2018b). The goal here is to produce similarity scores for sentence pairs that correlate with the gold standard.

Relation Classification
We provide examples for each task in Figure 3 and report the number of examples in Table 3.

Evaluation Setup
Given all the pre-trained models, the evaluation tasks, and a set of random seeds i ∈ 1..10: 1. We choose a pre-trained model, an evaluation task, and a random seed i then run 15 training epochs with batches of size 32.
2. At each epoch, we evaluate the model on a validation set that is either given or computed as 20% of the training set. According to the validation performance, we save the best model.
3. After completing all training epochs, we load the best model and evaluate it on the test set.
4. We repeat the whole process for all seeds to compute a final performance as mean ± std.
In addition to being useful for measuring model variability, fine-tuning 10 versions for each model also enables us to build ensembles. In fact, by using a majority voting strategy, we are able to combine the predictions from each seed into a single ensemble model 11 . In practice we do not use all seeds at once: we exclude a single seed, build an ensemble then repeat this process to get 10 ensembles for each model setting which can be used to compute a final ensemble performance as mean ± std. All fine-tuning experiments are run on a single Tesla V100-PCIE-32GB and are optimized using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 3e-5, a warm-up ratio of 0.1, and a weight decay of 0.1.

Pre-training
Using the setup detailed in Section 4.2.2, training a single BERT through Steps 1 and 2 takes around 26.5 hours for BERT and 55 hours for CharacterBERT even though both architectures have about the same number of parameters. This large gap in pre-training speed is partly due to the Character-CNN being slower to train as it is more complex than the original wordpiece embedding matrix. However, the main reason for the slower pre-training is that we are not able to use a very specific trick during Masked Language Modelling. In fact, BERT shares the parameters of its wordpiece embedding matrix with the MLM output layer, which allows it to train faster. In our case, since we do not use wordpieces, we build a temporary vocabulary from the top 100K tokens in the training corpus and use them as targets for MLM 12 . We expect that improved pre-training speed can be achieved using Noise Contrastive Estimation (Mnih and Kavukcuoglu, 2013) or similar methods. However, such optimizations are left for future work.

Fine-tuning
In addition to pre-training speed, we also report the fine-tuning speed both at training and inference time.  Figure 4 shows that CharacterBERT is much less at a disadvantage when it comes to fine-tuning (19% slower on average instead of 108%). However, in the specific case of the DDI task, CharacterBERT is actually 14% faster than BERT. This exceptional behavior may be due to the presence of many domainspecific terms that are split into multiple wordpieces, thus increasing the input size with BERT. In fact, since our model works at the word level, the input size is stable and data batches may be processed faster than with BERT. At inference time, CharacterBERT is slightly faster than BERT as the Character-CNN is not as slow during inference as it is during optimization.

Reproducing Vanilla Models
We report the performance of BERT(base, uncased) as well as BlueBERT(base, uncased) (Peng et al., 2019), a medical model pre-trained on MIMIC-III and PubMed abstracts 13 . Including these results allows us to evaluate the quality of our pre-training procedure. Figure 5 shows that BERT general performs slightly worse than the original BERT despite using exactly the same architecture. However, this difference is small and can be attributed to either the different general-domain corpora (OpenWebText instead of BooksCorpus) or to differences in pre-training parameters (number of updates, batch size...). Moreover, we see that BERT medical performs at the same level as BlueBERT, sometimes outperforming the latter substantially (≈ +4 F1 on ChemProt), which allows us to safely assume our pre-training procedure to be correct.

Ensembles and Model Selection
We can see from Figure 5 that ensembles (orange bars) clearly improve over single models (blue bars). While not surprising per se, it is worth noting that these ensembles were produced using a naive majority voting strategy which can easily be applied as a post-processing step. Moreover, we see that the test Figure 5: Comparison of pre-trained models when fine-tuned on several medical tasks. For each model, the test performance of 10 random seeds is expressed as mean ± std and is shown in blue for single models and orange for ensembles. The performance of the best validation seed is shown in red.
results of the best validation model (red symbol) are always below those of the ensembles' performance. Finally, we note that ensembles have substantially lower variances compared to single models, which makes them more reliable for comparisons.

BERT vs.
CharacterBERT: How Significant Is the Difference? Figure 5 shows that CharacterBERT often improves over a vanilla BERT. In particular, our medical model improves over the ensemble performance of BERT medical by ≈ 1.5 points on ChemProt, ≈ 2 points on DDI, and ≈ 0.5 points on MEDNLI and i2b2. However, we see that CharacterBERT medical performs worse than BERT in the specific case of ClinicalSTS and suffers from a surprisingly high variance. Since the ClinicalSTS dataset is also very small compared to the other datasets, these results should be taken with care even if the difference with BERT seems to be significant according to Figure 6. Results with general-domain models seem to also be in favor of CharacterBERT. However, these differences may not be substantial.
To provide a more rigorous evaluation of the statistical significance of our results, we perform Almost Stochastic Order tests (ASO) (Dror et al., 2019) for each pair of models. ASO tests aim to determine whether a stochastic order exists between two models based on their respective sets of evaluation scores. In practice, given the 10 single model scores of two chosen models A and B, the method computes a test-specific value that indicates how far model A is from being significantly better than model B. This distance is equal to 0 when model A B, 1 when B A, and 0.5 when no order can be determined. Figure 6 shows the values of for all model pairs on each task. Looking at the average significance matrix, we can see that CharacterBERT general improves over its BERT counterpart (cell [d,c]). Moreover, we see that the overall best model is CharacterBERT medical as evidenced by the bottom blue row (cells [f,a] to [f,e]), which further validates that our model indeed improves over vanilla BERT.

Robustness to Noise and Misspellings
We want to investigate whether CharacterBERT is more robust to noise and misspellings than BERT. For that purpose, we create noisy versions of the MEDNLI corpus where, given a noise level of X%, we transform each token with the same probability into a misspelled version either by removing, adding,   Figure 7 shows the results for BERT medical and CharacterBERT medical with various noise levels. We see that the latter is indeed more robust to misspellings as evidenced by the slower decrease in performance compared to BERT. In particular, when a noise level of 40% is applied to the test set only, CharacterBERT is ≈ 5 F1 higher than BERT whereas the original difference between the two models was < 1 F1. Experiments adding noise to all splits show that both models can learn to be more robust, however, CharacterBERT remains at an advantage.

Discussion and Future Work
Overall CharacterBERT seems to either perform at the same level or improve over BERT. This is especially true for the specialized versions and is further validated by the ASO tests. The new variant also seems to be more robust to misspellings while at the same time producing word-level open-vocabulary representations. This improved robustness is desirable since BERT seems to be sensitive to misspellings (Pruthi et al., 2019;Sun et al., 2020). On the downside, CharacterBERT is slower to pre-train, although not as slow to fine-tune and even slightly faster at inference time. Future work may apply a Character-CNN to recent Transformer-based models (Lan et al., 2019;, optimize the pre-training architecture to improve its speed, or explore any other advantages of a character-level system over wordpieces.

Conclusion
The overall strategy when building specialized versions of BERT seems to be re-training the original model on a specialized corpus. This implies keeping a general-domain wordpiece vocabulary that may not be suited for the domain of interest. Our main contribution is CharacterBERT, a variant of BERT that drops the wordpiece system altogether in favor of a Character-CNN. This module represents tokens by consulting their characters, allowing our model to produce word-level open-vocabulary representations. We evaluate CharacterBERT and show that it globally outperforms BERT when specialized for the medical domain while at the same time being more robust to misspellings.