Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.


Introduction
BERT is one of the prominent models used for a variety of NLP tasks. With the Masked Language Model (MLM) method, it has been successful at leveraging bidirectionality while training the language model. The BERT-Base-Uncased model has 12 encoder layers, with each layer consisting of 12 self-attention heads. The word representations are context-dependent 768 dimensional dynamic embeddings. In order to leverage the learnings of such pre-trained networks, fine tuning is commonly done while building NLP applications in industries.The BERT-Base-Uncased vocabulary has a size of 30522 with only 994 unused slots (in comparison, BERT-Base-Cased has only 101 unused slots). While the unused slots in the vocabulary can be used to include domain specific words, the representations of these will have to be fine tuned with domain specific corpus before they can be utilized. Hence, it is essential that the tokenization algorithm performs well to handle domain specific OOV words.
BERT relies on the WordPiece algorithm (Schuster and Nakajima, 2012) to create the vocabulary, that chooses those sub-word units for the vocabulary towards maximising the language model likelihood. However, the tokenization using this vocabulary is not done semantically. This leads to a poor tokenization that induces a semantic information loss in terms of dealing with OOV words for domain centric downstream tasks. While the largest substring tokenization problem can be alleviated to a large extent by integrating recent algorithms like BPE-Dropout (Provilkov et al., 2019) or Sen-tencePiece (Kudo and Richardson, 2018), that use frequency and/or language model based tokenization, the remaining aforementioned challenges still persist. In the following section, we discuss these challenges with respect to two categories: Tokenization and Sub-word representations.

Experiments
The experiments we performed are using the pretrained BERT-Base-Uncased model without any domain specific fine tuning as the examples were chosen with the purpose of highlighting the challenges with BERT across various domains. The 12 th encoder layer dynamic embeddings were used for all the analysis tasks. In case a word was OOV, the average of its sub-word units embeddings was considered as its embedding. Otherwise, the embedding for the word was considered as it is. In all cases, we ignore the [CLS] and [SEP] embeddings while computing the embedding of a par-  ticular word. The ## counterpart of a word is ## prefixed to the word. For example, the ## counterpart of the word active is ##active. We also made a subtle change to the tokenizer to leave untouched any word beginning with ##.

Tokenization problems
BERT always picks the largest substring from the left that is in-vocab at every sub-word unit for the tokenized output. While this performs reasonably well for words where the root (or stem) are suffixed, prefixed words are vulnerable to a poor tokenization.
Taking deconstructed, deactivated and unequal as examples, even though the vocabulary had the prefixes de and un as well as the words constructed, activated and equal, the tokenizer chose the substrings deco, dea and une (see Table 1). In comparison since SentencePiece is a likelihood based tokenization algorithm, it has managed to generate better tokenizations (deconstructed: de, con, struct, ed; deactivated: de, activated; unequal: unequal). We believe that if the BERT tokenizer correctly separates the prefixes while the model is being trained, it can help the model to learn better representations for the prefix as well as the subword units since the attention mechanism would understand the influence of the different categories of prefixes. Further it can be seen in Section 2.2 how a poor tokenization can lead to weaker semantic representations for the word.
Domain specific corpus often contain a large amount of jargons that can be misspelled frequently. Taking the in-vocab word cabbage as an example, ccabbage, cababge and cabbagee were chosen as the misspelled versions. The cosine similarities of cabbage with ccabbage, cababge and cabbagee were 0.33, 0.44 and 0.63 respectively. To verify that the low cosine similarity scores in the misspelled versions were not due to lack of surrounding context, we checked the cosine similarity score between cabbage and onion (in-vocab) and found it to be 0.88.
To analyze the extent of this problem and the im-   pact of the position of the error in the word on the tokenization, we chose the TOEFL-Spell corpus that contains over 6000 common spelling errors. 1 We took the intersection of the common words between the TOEFL-Spell corpus and the BERT vocabulary words. The corpus was segregated depending on whether the spelling error occurred in the word within the starting 33% of the letters, in the middle or at the end. As we can see in Table 2, since the BERT tokenizer has the largest substring problem, the penalty of a spelling error earlier in the word is more harmful as it leads to subsequent sub-word tokenization choices to be suboptimal.

Semantic meaning deterioration from sub-word representations
For a model to handle OOV words well, it should learn strong representations of a words constituents. While OOV words that begin with an in-vocab root (or stem) will retain its semantic meaning when tokenized, they become vulnerable in other cases as the root (or stem) will be broken down into smaller constituent sub-word units.
To see how BERT handles this, we created two sets of words from the vocabulary of length 4,5 and 6 that were consisting of: 1) Words whose ## counterparts were OOV and 2) Words whose ## counterparts were in-vocab. We chose these particular words since a word with length less than 4 would be commonly be found as a sub-word across many words, while a word with length larger that 6 would be rarer to be found as a sub-word. The cosine similarity between a word and its ## counterpart was computed (see Table 3). This problem is not a concern when the ## counterpart is in-vocab as the average cosine similarity was around 0.66, which can be improved if supplied with a context in a sentence. However, when the ## counterpart is not part of the vocab, the average cosine similarity drops to a low value of 0.48, which makes it difficult for the network to recover from.
To further analyze this problem, we compared the embeddings of the words unsaturated (OOV but un and saturated are in-vocab) with saturated. The cosine similarity between unsaturated and saturated was only 0.30. In comparison, the cosine similarity between un saturated and saturated is 0.81. To verify that this low similarity was being  Table 4: Difference in inward and outward attention scores between saturated and ##sat ##ura ##ted.

Word
Nearest neighbours saturated bacon, nutrition, cereal, obesity, flour, tobacco, humidity, mustard, cigarettes, vitamin ##sat ##ura ##ted destruction, egypt, erosion, malaria, morphology, concussion, organ, topography, aroused, sample caused by the poor representation learning of the constituents of saturated, we compared the average embedding of ##sat, ##ura, ##ted (since unsaturated was tokenized into sub-word units) with saturated and found their cosine similarity to be only 0.35. Further, to rule out the possibility that it was being caused due to lack of surrounding context, we compared the average embedding of ##sat, ##ura, ##ted with saturated as found in the following sentences: pork has saturated fat and pork has ##sat ##ura ##ted fat.
The cosine similarity even in this case was found to be only 0.57. For the above sentences, we wanted to see the impact of this problem by an-alyzing the attention scores in each encoder layer. The multi-head (12 heads) attention score matrix across the 12 encoder layers is of size 12 x 12 x 6 x 6 for the first sentence and 12 x 12 x 8 x 8 for the second sentence. Within each layer, we averaged the attention scores across the 12 heads. This resulted in 12 x 6 x 6 and 12 x 8 x 8 sized attention scores matrices for the two sentences respectively. We wanted to observe the inward influence of other words on saturated as well as outward influence of saturated towards the other words in both sentences. For the first sentence, the attention score matrix was hence reduced to a size of 12 x 5 x 5. In the second sentence, we averaged the inward and output influence for ##sat ##ura ##ted, leading to a reduced matrix of size 12 x 5 x 5. The two matrices were then subtracted to see the difference in the inward and outward influences for the word saturated (see Table 4).
Since the difference was taken, a positive value means saturated as found in the first sentence had a larger inwards or outwards attention influence compared to the second sentence. Clark et al. (2019) previously showed that a large number of attention heads in the early layers of BERT put >50% of their attention on previous and next tokens. As we can see in Table 4, the values in bold show a significant difference in attention scores, especially in the case for the neighbouring words of saturated, which we believe has caused the loss of semantic meaning between saturated and when tokenized to ##sat, ##ura, ##ted. The inward attention visualization for fat in Encoder layer 1 -Attention head 3 generated using BertViz (Vig, 2019) can be seen in Figure 1. Further, we checked the top 10 cosine similar neighbours in the BERT-Base-Uncased vocabulary (using their dynamic embeddings) for the embeddings of saturated and ##sat ##ura ##ted from the above sentences. We found that while saturated as found in the first sentence had semantically similar neighbours, its occurrence in the second sentence had neighbours which had a completely irrelevant semantic meaning (see Table 5). This confirmed that such a challenge can lead to cascading problems in the network.

Conclusion
In this paper we highlighted various challenges in the BERT model which if solved could significantly boost the models accuracy, especially in domain specific applications. These are mainly due to BERT lacking a semantic tokenization algorithm and its semantic information loss from sub-word representations in OOV scenarios.