Robust Backed-off Estimation of Out-of-Vocabulary Embeddings

Out-of-vocabulary (oov) words cause serious troubles in solving natural language tasks with a neural network. Existing approaches to this problem resort to using subwords, which are shorter and more ambiguous units than words, in order to represent oov words with a bag of subwords. In this study, inspired by the processes for creating words from known words, we propose a robust method of estimating oov word embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target oov words. We collect known words by segmenting oov words and by approximate string matching, and we then aggregate their pre-trained embeddings. Experimental results show that the obtained oov word embeddings improve not only word similarity tasks but also downstream tasks in Twitter and biomedical domains where oov words often appear, even when the computed oov embeddings are integrated into a bert-based strong baseline.


Introduction
The dynamic nature of language and the limited size of training data requires neural network models to handle out-of-vocabulary (OOV) words that are absent from the training data. We thus use an UNK embedding shared among diverse OOV words or break those OOV words into semanticallyambiguous subwords (even characters), leading to poor task performance (Peng et al., 2019;Sato et al., 2020).
To solve this problem, several approaches (Pinter et al., 2017;Zhao et al., 2018;Sasaki et al., 2019) learn subword embeddings from pre-trained embeddings and then use these subword embeddings for computing OOV word embeddings ( § 2). However, the embeddings computed by these ap- * Currently, he works for NTT Laboratories.

BoS
GloVe (sub)word highe <high high er> higher high cosine 48.4 34.2 20.1 −7.8 36.5 69.8 Table 1: Cosine similarity between Glove.840B embedding of "higher" and related embeddings: subword and reconstructed embeddings of "higher" by BoS (Zhao et al., 2018) and Glove.840B embedding of "high." brexit exit grexit  proaches are subject to the noisiness and ambiguity of intermediate subwords. For example, the Glove.840B 1 embedding of "higher" is closer to "high" compared with the embedding of "higher" reconstructed from its subwords using the method of BoS (Zhao et al., 2018), due to the ambiguous subword er> as shown in Table 1.
Contextual word embeddings such as BERT (Devlin et al., 2019) can mitigate subword ambiguity by considering context. However, it has been reported that adversarial typos can degrade a BERT model that uses subword tokenization (Pruthi et al., 2019;Sun et al., 2020). Subword meanings change across domains, making domain adaptation difficult (Sato et al., 2020). These problems are more critical in the processing of noisy text (Wang et al., 2020;Niu et al., 2020).
To solve the above problems, we propose di-rectly leveraging word-based pre-trained embeddings to compute OOV embeddings (Figure 1) ( § 3). Inspired by the two major processes for creating words, compounding and derivation, our method dynamically extracts words with pre-trained embeddings whose surfaces are similar to the target OOV word. Our method then aggregates pretrained word embeddings on the basis of similarity score over the known words calculated by a character-level CNN encoder. We further integrate this method into a BERT-based fine-tuning model. In essence, the proposed method directly computes OOV embeddings from pre-trained word embeddings, whereas the existing methods compute indirectly via their subwords.
To investigate the performance of the proposed method against baseline approaches, we conduct both intrinsic and extrinsic evaluations of OOV word embeddings ( § 4). In the experiments for the intrinsic evaluation, we examine the performance of the proposed method in inducing embeddings for rare words by using the CARD benchmark (Pilehvar et al., 2018) and for misspelled words by using the TOEFL dataset (Flor et al., 2019). Then, in those for the extrinsic evaluation, we demonstrate the effectiveness of the calculated OOV word embeddings in two downstream tasks, named entity recognition (NER) and part-of-speech (POS) tagging for Twitter and biomedical domains, where OOV words frequently appear. We finally evaluate the BERT-based fine-tuning model with our method on these tasks and adversarial perturbations (Sun et al., 2020).
The contributions of this work are as follows.
• We propose a robust backed-off approach for estimating OOV word embeddings, inspired by two processes for creating words: compounding and derivation.
• We confirm by intrinsic and extrinsic evaluations that the proposed method outperforms subword-based methods in computing OOV word embeddings.
• We demonstrate that the proposed extension to BERT boosts the performance of BERT except for one POS dataset and robustness to adversarial perturbations on a sentiment dataset.

Related work
Existing approaches for leveraging surface information in computing OOV word embeddings basically learn the embeddings of characters or subwords to reconstruct pre-trained word embeddings from them and then use the obtained embeddings to compute embeddings for OOV words (Pinter et al., 2017;Zhao et al., 2018;Sasaki et al., 2019). Zhao et al. (2018) proposed Bag-of-Subwords (BoS) to reconstruct pre-trained word embeddings from bag-ofcharacter n-grams in the same way as fastText (Bojanowski et al., 2017). Sasaki et al. (2019) extended BoS to reduce the number of embedding vectors and introduce a self-attention mechanism into the aggregation of subword embeddings. However, these methods compute embeddings via ambiguous character or subword embeddings. This will degrade the quality of embeddings for target OOV words as we will confirm later in § 4.
Other approaches utilize the embeddings of the known words around a target OOV word as its contextual information (Lazaridou et al., 2017;Khodak et al., 2018;Schick and Schütze, 2019;Hu et al., 2019). Schick and Schütze (2020) reported that they can improve BERT (Devlin et al., 2019) for understanding rare words. Notably, in these approaches for utilizing both surface and context information, the surface-based embeddings are the same as (Zhao et al., 2018). These approaches can have difficulties in representing misspelled words or spelling variations when a small number of contexts are available in a text corpus.
Several approaches utilize external data such as a knowledge base (Bahdanau et al., 2018;Yang et al., 2019;Yao et al., 2019). Existing approaches successfully impute OOV word embeddings by convolutional graph neural network (Yang et al., 2019) or by spectral embeddings derived from an affinity matrix of entities (Yao et al., 2019). These approaches can have difficulties in representing OOV words that do not exist in the external data and have little versatile applicability to misspelled words.
Recently, contextualized word embeddings such as BERT (Devlin et al., 2019) mitigate the problem of subword ambiguities by dynamically inferring meanings of OOV words from their contexts. However, several researchers reported that BERT remains brittle to misspellings (Pruthi et al., 2019;Sun et al., 2020), rare words (Schick and Schütze, 2020), and out-of-domain samples (Park et al., 2019). Pre-trained word embeddings are reported to be more effective for these cases and morphological tasks such as entity typing and NER .
We therefore improve not only models based on pre-trained word embeddings but also the brittle subword-based BERT. Our approach will broaden the application range of neural-network models.

Robust backed-off estimation of OOV word embeddings
In this section, we describe our method of computing embeddings for target OOV words by using a weighted sum of pre-trained word embeddings. Specifically, we calculate the weights over known words from similarity scores derived on the basis of their surface information (Figure 1).
In what follows, we first describe how to retrieve known words that have surfaces similar to a target OOV word ( § 3.1), and we next describe how to aggregate their pre-trained embeddings on the basis of the similarity scores calculated through a neural network surface encoder ( § 3.2). We then discuss how to integrate our method into the BERT-based fine tuning model (Devlin et al., 2019)

Efficient back-off to known words
We first describe methods of efficiently extracting known words with a similar surface to a target OOV word: (i) segmentation of the target OOV word referring to known words and (ii) approximate string matching used for extracting known words with a similar surface from the OOV word. These components are inspired by the two major processes for creating words, namely, compounding and derivation, from existing words; we back-off unknown words to known words to rewind and replay the processes for creating words.
In this paper, we assume that word embeddings are already trained on a large corpus in an unsupervised method such as GloVe (Pennington et al., 2014). Backing off to these known words can alleviate the ambiguity of subwords because wordlevel pre-trained embeddings can be expected to be less polysemous than subword embeddings. Moreover, we do not update word-level pre-trained embeddings in training the reconstruction task described below. Then, we dynamically calculate the embeddings for OOV words in the same continuous space with known words.
Segmentation by known words Inspired by the compounding of words such as German nouns (e.g., "Kinder|garten") and chemical compounds (e.g., "dichloro|difluoro|methane"), the first approach extracts known words contained in the target OOV word. Using known words and characters as vocabulary, we first enumerate all possible segmentations of the OOV word as a lattice using dynamic programming and then choose the segmentation. We then extract n seg known words in order of length from the segmentation with the smallest number of words/characters, 2 assuming longer words to be less ambiguous.
Approximate string matching Inspired by the derivation of words (e.g., "ignore/ignorance"), the second approach extracts known words with similar surface features from the target OOV word by approximate string matching. As a similarity measure for the surface distance between the target OOV word and known words, we use string similarity coefficients that view a word as a bag of n-grams. The string similarity coefficient has been used for fast approximate string matching. We search for n approx known words in order of string similarity to the OOV word. The approximate string matching is robust to subtle spelling variations and can extract a word with the correct spelling in most cases.

Aggregation of pre-trained word embeddings for known words
Next, we describe the calculation of OOV word embeddings from known words w seg i and w approx i that are extracted for the target OOV word q using the segmentation by known words and approximate string matching, respectively. Words extracted by the methods described above can contain undesired words whose meanings are far from the target OOV word. Thus, we calculate more accurate similarity scores for the extracted words through a neural network surface encoder. By computing surface representations v q and v w k i of the target OOV word q and the extracted known words w k i (k ∈ {seg, approx}) through a character-level CNN (Zhang et al., 2015), we calculate the similarity score s q,w k i between the OOV word q and known word w: where W k is a learnable parameter. We then aggregate the pre-trained word embeddings e w on the basis of similarity scores s q,w k to calculate the OOV embeddingê q as follows: Here, θ k are learnable parameters, and s q,w k represents the vector of similarities s q,w k i between q and each known word w k i (i = 1 . . . n seg or n approx ). We consider utilizing cosine similarity as the reconstruction objective over an individual oracle embedding e q . Here, we used pre-trained embeddings, e q (in experiments, GloVe.840B), as the oracle embeddings for each word: W k , θ k , and the parameters of a character-CNN are updated in this manner. Although there are other ways of designing W k , such as using the same matrix for words of the same length, we leave this as future work.

Extension to BERT-based fine-tuning
We finally integrate the proposed method with contextualized word embeddings, BERT (Devlin et al., 2019). Although BERT works effectively in practice, the subword-based modeling is known to be brittle when the input has misspellings ( § 2). For example, typos in informative words can significantly change the set of subwords, which causes severe damage to subword-based modeling (e.g., "robustness" → <robust|ness>, "robusntess" → <rob|us|ntes|s>). We thus utilize OOV embeddings computed by our method to enhance BERT. We extend the pre-trained BERT to refer to oov embeddings in fine-tuning. We first tokenize each word into subwords with a BERT tokenizer (Wolf et al., 2019). For each subword, the embedding of the words containing the subwords is added to the BERT's pre-trained subword embedding as follows.
Here, e subword ∈ R m is the BERT's subword embedding, and e word ∈ R n is the pre-trained or OOV word embedding computed with our method. W 1 ∈ R m×n and W 2 ∈ R n are learnable parameters in the fine-tuning. For example, when "robusntess" is tokenized into <rob|us|ntes|s>, <rob's embedding [e in (8)] is calculated from BERT's subword embedding of <rob (e subword ) and our embedding of "robusntess" (e word ). Finally, we input the subword embeddings computed in this way to the BERT model and calculate the output label. In the fine-tuning process, we update the parameters of the BERT model including its embedding layers while fixing the word embeddings computed with the proposed method. This method is applicable to any neural network other than BERT-based fine tuning model.

Experiments
We evaluated the performance of OOV word embeddings calculated by the proposed method. We performed intrinsic evaluations on benchmark datasets ( § 4.1) and extrinsic evaluations on two downstream tasks: named entity recognition (NER) and part-of-speech (POS) tagging ( § 4.2). We then conducted experiments on the extension to a BERTbased fine-tuning model ( § 4.3 and § 4.4).
We used Glove.840B embeddings 1 (2.2M vocabulary size) as the pre-trained embeddings (known words) following Sasaki et al. (2019). In all the experiments, we used PyTorch (v1.0.1) 3 as the core architecture and regarded words that were absent from GloVe.840B as OOV words. For the extrinsic evaluations ( § 4.2, § 4.3 and § 4.4), the reported numbers are the medians of five trials.

Intrinsic evaluations: CARD and TOEFL
We evaluated the performance of OOV embeddings through similarity estimations of rare words or misspelled words. For the baseline methods, we also evaluated the BoS (Zhao et al., 2018) and KVQ-FH (F = 1M, H = 0.5M) (Sasaki et al., 2019) referred to in § 2. Although these studies focus on replacing infrequent words as well as OOV words, we here focus on replacing only OOV words. In addition to these subword-based baselines, we used a simple baseline (Simple back-off) that backs off an OOV word to the most orthographically-similar known word. This is a special case of the proposed method with n seg = 0, n approx = 1.

Datasets and experimental settings
The CARD-660 (Pilehvar et al., 2018) (hereafter, CARD) is a rare word benchmark that consists of pairs of words annotated with their similarity score. CARD contains 660 word pairs of 1306 words with duplicates collected from various domains, such as computer science, social media, and medicine. Among these word pairs, 289 pairs contain OOV words that are absent from GloVe.840B. We calculated the cosine similarities between the embeddings for the two words in a pair and evaluated the Spearman's correlation coefficient ρ between the cosine similarity and the annotated similarity. We also calculated the correlation for pairs containing OOV words.
The TOEFL-Spell dataset (Flor et al., 2019) (hereafter, TOEFL) contains examples of misspellings appearing in a corpus of essays written by English language learners. We extracted 1514 pairs of a correct, known word and a misspelled, OOV word for evaluation. We computed the embedding for the misspelled word and evaluated the cosine similarity to the gold embedding of the correct word. We could evaluate the robustness against practical misspelled words using TOEFL rather than synthetic adversarial perturbations.
To train the subword-based baselines and the proposed method, we randomly sampled 10 5 words with frequency f 4 (10 3 < f < 10 5 ) from the Glove.840B embeddings. We then sampled the 10 3 embeddings from the 10 5 embeddings as the development data and the remaining embeddings to optimize the parameters of the proposed method. We adopted Adam (Kingma and Ba, 2015) with a learning rate of 10 −3 as the optimizer. We set the gradient clipping as 1, the dropout rate as 0.3, the number of epochs as 50, and the batch size as 1000, and we chose the model at the epoch that achieved the maximum total cosine similarities between the GloVe.840B embeddings and the induced embeddings for the target words in the development data.
To find orthographically similar words with the proposed method and Simple back-off, we ran Simple back-off with various similarity measures 5 (Dice, Cosine, Jaccard, and Overlap) and n-grams (1 ≤ n ≤ 7), and we obtained the Jaccard measure based on 3-grams as the best-performing configuration for the development set. In computing the surface representations v q and v w k i in Eq. 2 through a character-CNN with the proposed method, we set the dimension to 100, and the convolutions had window sizes of 1, 3, 5 and 7 characters. We then searched for the best-performing hyper- 4 We used the word frequencies at https://github. com/losyer/compact_reconstruction. 5 https://github.com/chokkan/simstring CARD (ρ) TOEFL (cos)

ALL OOV OOV
GloVe (Pennington et al., 2014) 27.3 --BoS (Zhao et al., 2018) 40.7 17.2 33.4 KVQ-FH (Sasaki et al., 2019)   parameters n seg and n approx (1 ≤ n * ≤ 10) of the proposed method for the development set and obtained n seg = 7, n approx = 10. Table 2 shows the results of the intrinsic evaluations. ALL indicates the performance on all word pairs, while OOV indicates the performance only on pairs that contained an OOV word. We regarded the cosine similarity of a word pair as zero when a method could not compute embeddings for OOV words in pairs, following Yang et al. (2019). The correlation coefficients for the known word pairs were 55.5 for all methods for the CARD dataset. We observed that the proposed method outperformed all baselines except for Simple back-off on the TOEFL dataset. We considered the Simple backoff baseline to be tailored for misspelings. Notably, compared with the subword-based methods, the proposed method was robust against the misspelled words for the TOEFL dataset, which demonstrates the risk of relying on subword embeddings.

Results
We conducted a qualitative analysis of our two modules: segmentation by known words (hereafter SEG) and approximate string matching (hereafter APPROX) described in § 3.2. SEG successfully handled compound words such as "horse|cloth" and "boat|master," while APPROX successfully handled "aeolipile." The meanings of these words can be inferred from their surfaces; we believe that our method could successfully compute their embeddings by relating them to the known words. Both methods failed to handle "covfefe" (a misspelling of "coverage") and proper nouns such as "Kobani" (a place) and "AccuRay" (a company). From these observations, the proposed method is considered to be less effective with these words as well as acronyms whose meanings are difficult to predict from their surfaces.

Sensitivity to hyper-parameters
To investigate the impact of using the two modules, segmentation by known words and approximate string matching ( § 3.2), we evaluated the performance of the intrinsic evaluations with various n seg and n approx . Figure 2 depicts the performance of the proposed method with different n seg , n approx (0 ≤ n * ≤ 10). The larger the n seg and n approx used, the more shorter and less superficially-similar known words were taken into account. This result shows that segmentation by known words tended to benefit compound words in CARD, and approximate string matching tended to benefit misspellings in TOEFL. This is reasonable since we can guess the meaning of a compound word in a constructive way and the meaning of a misspelled word by considering words with close spellings.

Datasets and experimental settings
We evaluated whether the computed embeddings captured semantic and morphosyntactic information through NER and POS tagging on domains where many OOV words appear. Table 3 shows a summary of the datasets used in the extrinsic evaluations. Here, OOV% represents the OOV word rate in each dataset. For each dataset, we used the standard split for training, development, and test sets. We used the training data of T-POS for the DCU training following Derczynski et al. (2013). We adopted the Bi-LSTM-CRF (Lample et al., 2016) and Bi-LSTM tagger (Pinter et al., 2017) for NER and POS tagging and measured the performance in terms of the classification accuracy and entity-level F 1 score, respectively. We used two bidirectional LSTM layers of hidden size 200 for the two taggers. In the training of both taggers, we adopted Adam with a learning rate of 10 −3 as the optimizer. We set the gradient clipping as 1, the dropout rate as 0.5, and the number of epochs as 50, and the batch size was 500 for the biomedical NER datasets and 50 for the Twitter POS and NER datasets, and we then chose the model at the epoch that achieved the best performance on the development data.
When training the taggers, we fixed their embedding layers to the pre-trained embeddings or OOV embeddings computed by each method except for the shared OOV embedding in Single-UNK. Since HiCE uses an external corpus [here, Wikitext-103 (Merity et al., 2017)] as contexts for training, we trained all the methods to compute OOV embeddings using a part of the pre-trained embeddings used for training in the intrinsic evaluations whose contexts are available in Wikitext-103. In training  HiCE, we set the hidden size as 300, intermediate hidden size as 600, number of self-attention layers as 2, number of self-attention heads as 10, dropout rate of the context encoder and multi-context aggregator as 0.3, the maximum number of contexts as 10, and the context window size as 10. With the trained HiCE, we computed the embeddings for OOV words from their contexts in the training/test data of the target task and Wikitext-103.

Results
Tables 4 and 5 list the results for the two tasks.
Here, ALL shows the overall performance, and OOV indicates the performance only on words or entities that are absent from the training data 7 and 6 http://cr.fvcrc.i.nagoya-u.ac.jp/ sasano/test/permutation.html 7 This is because the models will be able to handle words in the training data regardless of the quality of their embeddings.  have no GloVe embeddings. The proposed method outperformed the baselines for OOV words, except for RARE-NER and MULTI-NER. This result confirms the effectiveness of the proposed method in computing OOV embeddings. The proposed method exhibited additive improvements over HiCE. The tendency for the surface-based methods to outperform the purely context-based baseline HiCE suggests that the LSTM models for the downstream tasks already captured the contextual information.
Examples Table 6 shows examples of the system outputs. The proposed method successfully predicted word embedding for two OOV words: "neuraminidases" (a plural form of "neuraminidase")   and "amphotercin" (a misspelling of "amphotericin"), while BoS was strongly influenced by the subwords contained in the OOV words. As shown in the last example, however, the proposed method wrongly predicted "chromone" (a chemical compound) as "chromosome" away from the correct domain. This example suggests the limitations of using a surface-based approach that does not utilize context information.

Extrinsic evaluation: extension to BERT
To investigate the effectiveness of integrating our OOV embeddings into the BERT, we conducted experiments on the downstream tasks. We used BERT with a token classification head on top (Wolf et al., 2019) for POS tagging and that with a CRF layer for NER. In the integration of surface-based embeddings into BERT ( § 3.3), we fixed the surfacebased embeddings and fine-tuned the WordPiece subword embeddings and model parameters of the BERT. We adopted Adam with a learning rate of 5 × 10 −5 as the optimizer, and we set the number of epochs as 20, the batch size as 16, and other hyper-parameters the same as Wolf et al. (2019).
To ensure that the number of tokens did not exceed the maximum length limit of BERT,8   sentences with 100 words or less for all datasets in the fine-tuning of the BERT model. Tables 7 and 8 show a comparison of the BERT models with and without the proposed extension in POS tagging and NER. Our extension to BERT improved the overall performance on all datasets except for DCU; all the improvements were significant except for RARE-NER and ARK. Although our extension was sometimes harmful in recognizing entities that have OOV words (RARE-NER, BC4CHEMD, BC5CDR, and NCBI-DISEASE), it still improved the overall performance. We consider this to be because our extension helps BERT utilize contexts with OOV words to classify known words.
Examples Table 9 shows examples of the system outputs. The proposed method successfully predicted word embeddings for two OOV words: "renallo" (a person) and "dimethylarsenic" (a chemical compound), while the BERT tokenizer tokenized these words into short pieces, which might have lead to the incorrect labels. As shown in the last example of "antipurine," however, the proposed method was influenced by words with common prefixes <anti and predicted a wrong embedding away from the context.

Extrinsic evaluation: adversarial typos
Finally, we evaluated the robustness to adversarial attacks on the extension to BERT-based finetuning. We considered adversarial typos on a keyboard (Pruthi et al., 2019;Sun et al., 2020) to consider natural perturbations in the real world.
We describe the generation of adversarial perturbation via keyboard typos. First, we selected words to be edited according to the max-grad policy (Sun et al., 2020). We computed the gradients of a pre-trained BERT classifier at a time and selected words in order of larger gradients. Second, we explored four types of subtle character-level edits for each word: (i) swapping two adjacent characters in a word, (ii) removing a character in a word, (iii) replacing a character with an adjacent character on the keyboard, and (iv) inserting an adjacent character on the keyboard before a character in a word. We did not edit stop words or words with less than three characters, following Pruthi et al. (2019). To reduce the computational cost, we limited the number of candidates of typos to N typos randomly. In this paper, we set the hyperparameters as K ∈ {0, 1, 3, 5}, N = 10.
We used movie reviews from the Stanford Sentiment Treebank (SST) (Socher et al., 2013), which consists of 8544 movie reviews. With only positive and negative reviews, we trained a BERT with a sequence classification head on top (Wolf et al., 2019) to generate the adversarial examples described above. We evaluated the accuracy for the BERT-based fine-tuning model and a word-based LSTM model with and without embeddings computed with the proposed method. For comparison, we also evaluated a variant of our extension to BERT that assigns a zero vector instead of embeddings computed by the proposed method to OOV words (+GloVe). As with the LSTM tagger in § 4.2, we used two bidirectional LSTM layers of hidden size 200. In the training of both models, we adopted Adam with a learning rate of 5 × 10 −5 for the BERT model and of 10 −3 for the LSTM model, set the gradient clipping as 1, the dropout rate as 0.5, the number of epochs as 10, and the batch size as 16. Table 10 shows the results for the sentiment classification task. K in the table indicates the number of words edited in a text, and #tokens per words indicates the average number of subwords in words when the words were tokenized with the BERT tokenizer. Although this result shows that the BERT outperformed the other models without any perturbations (K = 0), its performance degraded as K and #tokens increased, which is consistent with (Sun et al., 2020). Moreover, the proposed method could mitigate this performance degradation of both BERT and LSTM models.

Conclusion
In this paper, inspired by two major processes for creating words, we proposed a method for computing OOV word embeddings by learning the similarities between a target OOV word and known words and integrated the method into BERT. We conducted intrinsic and extrinsic evaluations, and we confirmed that the proposed method more successfully mimics the pre-trained word embeddings for OOV words than existing subword-based methods that suffer from the noisiness and ambiguity of intermediate subwords. The proposed method boosted the performance of BERT and equipped BERT with robustness to adversarial typos. We will release all code to promote the reproducibility of our results. 9 In future work, we will investigate better integrating our method into BERT.