Few-Shot Representation Learning for Out-Of-Vocabulary Words

Existing approaches for learning word embedding often assume there are sufficient occurrences for each word in the corpus, such that the representation of words can be accurately estimated from their contexts. However, in real-world scenarios, out-of-vocabulary (a.k.a. OOV) words that do not appear in training corpus emerge frequently. How to learn accurate representations of these words to augment a pre-trained embedding by only a few observations is a challenging research problem. In this paper, we formulate the learning of OOV embedding as a few-shot regression problem by fitting a representation function to predict an oracle embedding vector (defined as embedding trained with abundant observations) based on limited contexts. Specifically, we propose a novel hierarchical attention network-based embedding framework to serve as the neural regression function, in which the context information of a word is encoded and aggregated from K observations. Furthermore, we propose to use Model-Agnostic Meta-Learning (MAML) for adapting the learned model to the new corpus fast and robustly. Experiments show that the proposed approach significantly outperforms existing methods in constructing an accurate embedding for OOV words and improves downstream tasks when the embedding is utilized.


Introduction
Distributional word embedding models aim to assign each word a low-dimensional vector representing its semantic meaning. These embedding models have been used as key components in natural language processing systems. To learn such embeddings, existing approaches such as skip-gram models (Mikolov et al., 2013) resort to an auxiliary task of predicting the context words (words surround the target word). These embed-dings have shown to be able to capture syntactic and semantic relations between words.
Despite the success, an essential issue arises: most existing embedding techniques assume the availability of abundant observations of each word in the training corpus. When a word occurs only a few times during training (i.e., in the few-shot setting), the corresponding embedding vector is not accurate (Cohn et al., 2017). In the extreme case, some words are not observed when training the embedding, which are known as out-ofvocabulary (OOV) words. These words are often rare and might only occurred for a few times in the testing corpus. Therefore, the insufficient observations hinder the existing context-based word embedding models to infer accurate OOV embeddings. This leads us to the following research problem: How can we learn accurate embedding vectors for OOV words during the inference time by observing their usages for only a few times?
Existing approaches for dealing with OOV words can be categorized into two groups. The first group of methods derives embedding vectors of OOV words based on their morphological information (Bojanowski et al., 2017;Kim et al., 2016;Pinter et al., 2017). This type of approaches has a limitation when the meaning of words cannot be inferred from its subunits (e.g., names, such as Vladimir). The second group of approaches attempts to learn to embed an OOV word from a few examples. In a prior study (Cohn et al., 2017;Herbelot and Baroni, 2017), these demonstrating examples are treated as a small corpus and are used to fine-tune OOV embeddings. Unfortunately, fine-tuning with just a few examples usually leads to overfitting. In another work (Khodak et al., 2018), a simple linear function is used to infer embedding of an OOV word by aggregating embeddings of its context words in the examples. However, the simple linear averaging can fail to capture the complex semantics and relationships of an OOV word from its contexts.
Unlike the existing approaches mentioned above, humans have the ability to infer the meaning of a word based on a more comprehensive understanding of its contexts and morphology. Given an OOV word with a few example sentences, humans are capable of understanding the semantics of each sentence, and then aggregating multiple sentences to estimate the meaning of this word. In addition, humans can combine the context information with sub-word or other morphological forms to have a better estimation of the target word. Inspired by this, we propose an attentionbased hierarchical context encoder (HiCE), which can leverage both sentence examples and morphological information. Specifically, the proposed model adopts multi-head self-attention to integrate information extracted from multiple contexts, and the morphological information can be easily integrated through a character-level CNN encoder.
In order to train HiCE to effectively predict the embedding of an OOV word from just a few examples, we introduce an episode based few-shot learning framework. In each episode, we suppose a word with abundant observations is actually an OOV word, and we use the embedding trained with these observations as its oracle embedding. Then, the HiCE model is asked to predict the word's oracle embedding using only the word's K randomly sampled observations as well as its morphological information. This training scheme can simulate the real scenarios where OOV words occur during inference, while in our case we have access to their oracle embeddings as the learning target. Furthermore, OOV words may occur in a new corpus whose domain or linguistic usages are different from the main training corpus. To deal with this issue, we propose to adopt Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) to assist the fast and robust adaptation of a pre-trained HiCE model, which allows HiCE to better infer the embeddings of OOV words in a new domain by starting from a promising initialization.
We conduct comprehensive experiments based on both intrinsic and extrinsic embedding evaluation. Experiments of intrinsic evaluation on the Chimera benchmark dataset demonstrate that the proposed method, HiCE, can effectively utilize context information and outperform baseline algorithms. For example, HiCE achieves 9.3% rel-ative improvement in terms of Spearman correlation compared to the state-of-the-art approach, a la carte, regarding 6-shot learning case. Furthermore, with experiments on extrinsic evaluation, we show that our proposed method can benefit downstream tasks, such as named entity recognition and part-of-speech tagging, and outperform existing methods significantly.
The contributions of this work can be summarized as follows.
• We formulate the OOV word embedding learning as a K-shot regression problem and propose a simulated episode-based training schema to predict oracle embeddings. • We propose an attention-based hierarchical context encoder (HiCE) to encode and aggregate both context and sub-word information.
We further incorporate MAML for fast adapting the learned model to the new corpus by bridging the semantic gap. • We conduct experiments on multiple tasks, and through quantitative and qualitative analysis, we demonstrate the effectiveness of the proposed method in fast representation learning of OOV words for down-stream tasks.

The Approach
In this section, we first formalize the problem of OOV embedding learning as a few-shot regression problem. Then, we present our embedding prediction model, a hierarchical context encoder (HiCE) for capturing the semantics of context as well as morphological features. Finally, we adopt a stateof-the-art meta-learning algorithm, MAML, for fast and robust adaptation to a new corpus.

The Few-Shot Regression Framework
Problem formulation We consider a training corpus D T , and a given word embedding learning algorithm (e.g., Word2Vec) that yields a learned word embedding for each word w, denoted as T w ∈ R d . Our goal is to infer embeddings for OOV words that are not observed in the training corpus D T based on a new testing corpus D N . D N is usually much smaller than D T and the OOV words might only occur for a few times in D N , thus it is difficult to directly learn their embedding from D N . Our solution is to learn an neural regression function F θ (·) parameterized with θ on D T . The function F θ (·) takes both the few contexts and morphological features of an OOV word as input, and outputs its approximate embedding vector. The output embedding is expected to be close to its "oracle" embeddings vector that assumed to be learned with plenty of observations.
To mimic the real scenarios of handling OOV words, we formalize the training of this model in a few-shot regression framework, where the model is asked to predict OOV word embedding with just a few examples demonstrating its usage. The neural regression function F θ (·) is trained on D T , where we pick N words {w t } N t=1 with sufficient observations as the target words, and use their embeddings {T wt } N t=1 as oracle embeddings. For each target word w t , we denote S t as all the sentences in D T containing w t . It is worth noting that we exclude words with insufficient observations from target words due to the potential noisy estimation for these words in the first place.
In order to train the neural regression function F θ (·), we form episodes of few-shot learning tasks. In each episode, we randomly sample K sentences from S t , and mask out w t in these sentences to construct a masked supporting context set S K t = {s t,k } K k=1 , where s t,k denotes the k-th masked sentence for target word w t . We utilize its character sequence as features, which are denoted as C t . Based on these two types of features, the model F θ is learned to predict the oracle embedding. In this paper, we choose cosine similarity as the proximity metric, due to its popularity as an indicator for the semantic similarity between word vectors. The training objective is as follows.
where S K t ∼ S t means that the K sentences containing target word w t are randomly sampled from all the sentences containing w t . Once the model Fθ is trained (based on D T ), it can be used to predict embedding of OOV words in D N by taking all sentences containing these OOV words and their character sequences as input.

Hierarchical Context Encoding (HiCE)
Here we detail the design of the neural regression function F θ (·). Based on the previous discussion, F θ (·) should be able to analyze the complex semantics of context, to aggregate multiple pieces of context information for comprehensive embedding prediction, and to incorporate morphological features. These three requirements cannot be fulfilled using simple models such as linear aggregation (Khodak et al., 2018).
Recent progress in contextualized word representation learning (Peters et al., 2018;Devlin et al.) has shown that it is possible to learn a deep model to capture richer language-specific semantics and syntactic knowledge purely based on self-supervised objectives. Motivated by their results, we propose a hierarchical context encoding (HiCE) architecture to extract and aggregate information from contexts, and morphological features can be easily incorporated. Using HiCE as F θ (·), a more sophisticated model to process and aggregate contexts and morphology can be learned to infer OOV embeddings.

Self-Attention Encoding Block Our proposed
HiCE is mainly based on the self-attention encoding block proposed by Vaswani et al. (2017). Each encoding block consists of a self-attention layer and a point-wise, fully connected layer. Such an encoding block can enrich the interaction of the sequence input and effectively extract both local and global information.
Self-attention (SA) is a variant of attention mechanism that can attend on a sequence by itself. For each head i of the attention output, we first transform the sequence input matrix x into query, key and value matrices, by a set of three different then scale it by the square root of the dimension of the sequence input 1 √ dx to get mutual attention matrix of the sequence. Finally we aggregate the value matrices using the calculated attention matrix, and get a self,i as the self attention vector for head i: by a linear projection W O , we have a SA(x) with totally h heads, which can represent different aspects of mutual relationships of the sequence x: The self-attention layer is followed by a fully connected feed-forward network (FFN), which applies a non-linear transformation to each position of the sequence input x.
For both SA and FFN, we apply residual connection (He et al., 2016) followed by layer normalization (Ba et al., 2016). Such a design can help the overall model to achieve faster convergence and better generalization.
In addition, it is necessary to incorporate position information for a sequence. Although it is feasible to encode such information using positional encoding, our experiments have shown that this will lead to bad performance in our case. Therefore, we adopt a more straightforward positionwise attention, by multiplying the representation at pos by a positional attention digit a pos . In this way, the model can distinguish the importance of different relative locations in a sequence.
HiCE Architecture As illustrated in Figure 1, HiCE consists of two major layers: the Context Encoder and the Multi-Context Aggregator.
For each given word w t and its K masked supporting context set S K t = {s t,1 , s t,2 , ..., s t,K }, a lower-level Context Encoder (E) takes each sentence s t,k as input, followed by positionwise attention and a self-attention encoding block, and outputs an encoded context embedding E(s t,k ). On top of it, a Multi-Context Aggregator combines multiple encoded contexts, i.e., E(s t,1 ), E(s t,2 ), ..., E(s t,K ), by another selfattention encoding block. Note that the order of contexts can be arbitrary and should not influence the aggregation, we thus do not apply the positionwise attention in Multi-Context Aggregator.
Furthermore, the morphological features can be encoded using character-level CNN following (Kim et al., 2016), which can be concatenated with the output of Multi-Context Aggregator. Thus, our model can leverage both the contexts and morphological information to infer OOV embeddings.

Fast and Robust Adaptation with MAML
So far, we directly apply the learned neural regression function Fθ trained on D T to OOV words in D N . This can be problematic when there exists some linguistic and semantic gap between D T and D N . For example, words with the same form but in different domains (Sarma et al., 2018) or at different times (Hamilton et al., 2016) can have different semantic meanings. Therefore, to further improve the performance, we aim to adapt the learned neural regression function Fθ from D T to the new corpus D N . A naïve way to do so is directly fine-tuning the model on D N . However, in most cases, the new corpus D N does not have enough data compared to D T , and thus directly fine-tuning on insufficient data can be sub-optimal and prone to overfitting.
To address this issue, we adopt Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) to achieve fast and robust adaption. Instead of simply fine-tuning Fθ on D N , MAML provides a way of learning to fine-tune. That is, the model is firstly trained on D T to get a more promising initialization, based on which fine-tuning the model on D N with just a few examples could generalize well.
More specifically, in each training episode, we first conduct gradient descent using sufficient data in D T to learn an updated weight θ * . For simplification, we use L to denote the loss function of our objective function (1). The update process is as: We then treat θ * as an initialized weight to optimize θ on the limited data in D N . The final update in each training episode can be written as follows.
where both α and β are hyper-parameters of twostage learning rate. The above optimization can be conducted with stochastic gradient descent (SGD). In this way, the knowledge learned from D T can provide a good initial representation that can be effectively fine-tuned by a few examples in D N , and thus achieve fast and robust adaptation. Note that the technique presented here is a simplified variant of the original MAML, which considers more than just two tasks compared to our case, i.e., a source task (D T ) and a target task (D N ). If we require to train embeddings for multiple domains simultaneously, we can also extend our approach to deal with multiple D T and D N .

Experiments
In this section, we present two types of experiments to evaluate the effectiveness of the proposed HiCE model. One is an intrinsic evaluation on a benchmark dataset, and the other is an extrinsic evaluation on two downstream tasks: (1) named entity recognition and (2) part-of-speech tagging.

Experimental Settings
As aforementioned, our approach assumes an initial embedding T trained on an existing corpus D T . As all the baseline models learn embedding from Wikipedia, we train HiCE on WikiText-103 (Merity et al., 2017) with the initial embedding provided by Herbelot and Baroni (2017) 1 . WikiText-103 contains 103 million words extracted from a selected set of articles. From WikiText-103, we select words with an occurrence count larger than 16 as training words. Then, we collect the masked supporting contexts set S t for each training word w t with its oracle embedding T wt , and split the collected words into a training set and a validation set. We then train the HiCE model 2 in the previous introduced episode based K-shot learning setting, and select the best hyperparameters and model using the validation set. After we obtain the trained HiCE model, we can either directly use it to infer the embedding vectors for OOV words in new corpus D N , or conduct adaptation on D N using MAML algorithm as shown in Eq. (2).

Baseline Methods
We compare HiCE with the following baseline models for learning OOV word embeddings.
•  (Lazaridou et al., 2017) is a purely non-parametric algorithm that averages the word embeddings of the masked supporting contexts S t . Specifically: Also, this approach can be augmented by removing the stop words beforehand.
• nonce2vec: This algorithm (Herbelot and Baroni, 2017) is a modification of original gensim Word2Vec implementation, augmented by a better initialization of additive vector, higher learning rates and large context window, etc. We directly used their opensource implementation 4 . •à la carte: This algorithm (Khodak et al., 2018) is based on an additive model, followed by a linear transformation A that can be learned through an auxiliary regression task. Specifically: We conduct experiments by using their opensource implementation 5 .

Intrinsic Evaluation: Evaluate OOV Embeddings on the Chimera Benchmark
First, we evaluate HiCE on Chimera (Lazaridou et al., 2017), a widely used benchmark dataset for evaluating word embedding for OOV words.
Dataset The Chimera dataset simulates the situation when an embedding model faces an OOV word in a real-world application. For each OOV word (denoted as "chimera"), a few example sentences (2, 4, or 6) are provided. The dataset also provides a set of probing words and the humanannotated similarity between the probing words and the OOV words. To evaluate the performance of a learned embedding, Spearman correlation is used in (Lazaridou et al., 2017) to measure the agreement between the human annotations and the machine-generated results.
Experimental Results Table 1 lists the performance of HiCE and baselines with different numbers of context sentences. In particular, our method (HiCE+Morph+MAML) 6 achieves the best performance among all the other baseline methods under most settings. Compared with the current state-of-the-art method,à la carte, the relative improvements (i.e., the performance difference divided by the baseline performance) of HiCE are 4.0%, 5.4% and 9.3% in terms of 2,4,6shot learning, respectively. We also compare our results with that of the oracle embedding, which is the embeddings trained from D T , and used as ground-truth to train HiCE. This results can be regarded as an upper bound. As is shown, when the number of context sentences (K) is relatively large (i.e., K = 6), the performance of HiCE is on a par with the upper bound (Oracle Embedding) and the relative performance difference is merely 2.7%. This indicates the significance of using an advanced aggregation model. Furthermore, we conduct an ablation study to analyze the effect of morphological features. By comparing HiCE with and without Morph, we can see that morphological features are helpful when the number of context sentences is relatively small (i.e., 2 and 4 shot). This is because morphological information does not rely on context sen-tences, and can give a good estimation when contexts are limited. However, in 6-shot setting, their performance does not differ significantly.
In addition, we analyze the effect of MAML by comparing HiCE with and without MAML. We can see that adapting with MAML can improve the performance when the number of context sentences is relatively large (i.e., 4 and 6 shot), as it can mitigate the semantic gap between source corpus D T and target corpus D N , which makes the model better capture the context semantics in the target corpus. Also we evaluate the effect of MAML by comparing it with fine-tuning. The results show that directly fine-tuning on target corpus can lead to extremely bad performance, due to the insufficiency of data. On the contrary, adapting with MAML can leverage the source corpus's information as regularization to avoid over-fitting.

Extrinsic Evaluation: Evaluate OOV Embeddings on Downstream Tasks
To illustrate the effectiveness of our proposed method in dealing with OOV words, we evaluate the resulted embedding on two downstream tasks: (1) named entity recognition (NER) and (2) partof-speech (POS) tagging.
Named Entity Recognition NER is a semantic task with a goal to extract named entities from a sentence. Recent approaches for NER take word embedding as input and leverage its semantic information to annotate named entities. Therefore, a high-quality word embedding has a great impact on the NER system. We consider the following two corpora, which contain abundant OOV words, to mimic the real situation of OOV problems.
• Rare-NER: This NER dataset (Derczynski et al., 2017) focus on unusual, previouslyunseen entities in the context of emerging discussions, which are mostly OOV words. • Bio-NER: The JNLPBA 2004 Bio-entity recognition dataset (Collier and Kim, 2004) focuses on technical terms in the biology domain, which also contain many OOV words. Both datasets use entity-level F1-score as an evaluation metric. We use the WikiText-103 as D T , and these datasets as D N . We select all the OOV words in the dataset and extract their context sentences. Then, we train different versions of OOV embeddigns based on the proposed approaches and the baseline models. Finally, the inferred embedding is used in an NER system based on the  Bi-LSTM-CRF (Lample et al., 2016) architecture to predict named entities on the test set. We posit a higher-quality OOV embedding results in better downstream task performance.
As we mainly focus on the quality of OOV word embeddings, we construct the test set by selecting sentences which have at least one OOV word. In this way, the test performance will largely depend on the quality of the OOV word embeddings. After the pre-processing, Rare-NER dataset contains 6,445 OOV words and 247 test sentences, while Bio-NER contains 11,748 OOV words and 2,181 test sentences. Therefore, Rare-NER has a high ratio of OOV words per sentence.
Part-of-Speech Tagging Besides NER, we evaluate the syntactic information encoded in HiCE through a lens of part-of-speech (POS) tagging, which is a standard task with a goal to identify which grammatical group a word belongs to. We consider the Twitter social media POS dataset (Ritter et al., 2011), which contains many OOV entities. The dataset is comprised of 15,971 English sentences collected from Twitter in 2011. Each token is tagged manually into 48 grammatical groups, consisting of Penn Tree Bank Tag set and several Twitter-specific classes. The performance of a tagging system is measured by accuracy. Similar to the previous setting, we use different updating algorithms to learn the embedding of OOV words in this dataset, and show different test accuracy results given by learned Bi-LSTM-CRF tagger. The dataset contains 1,256 OOV words and 282 test sentences. Table 2 illustrates the results evaluated on the downstream tasks. HiCE outperforms the baselines in all the settings. Compared to the best baselineà la carte, the relative improvements are 12.4%, 2.9% and 5.1% for Rare-NER, Bio-NER, and Twitter POS, respectively. As aforementioned, the ratio of OOV words in Rare-NER is high. As a result, all the systems perform worse on Rare-NER than Bio-NER, while HiCE reaches the largest improvement than all the other baselines. Besides, our baseline embedding is trained on Wikipedia corpus (WikiText-103), which is quite different from the bio-medical texts and social media domain. The experiment demonstrates that HiCE trained on D T is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with MAML can further reduce the domain gap and enhance the performance.

Qualitative Evaluation of HiCE
To illustrate how does HiCE extract and aggregate information from multiple context sentences, we visualize the attention weights over words and contexts. We demonstrate an example in Figure 2, where we choose four sentences in chimera dataset, with "clarinet" (a woodwind instrument) as the OOV word. From the attention weight over words, we can see that the HiCE puts high attention on words that are related to instruments, such as "horns", "instruments", "flows", etc. From the attention weight over contexts, we can see that HiCE assigns the fourth sentence the lowest context attention, in which the instrumentrelated word "trumpet" is distant from the target   placeholder, making it harder to infer the meaning by this context. This shows HiCE indeed distinguishes important words and contexts from the uninformative ones.
Furthermore, we conduct a case study to show how well the inferred embedding for OOV words capture their semantic meaning. We randomly pick three OOV words with 6 context sentences in Chimera benchmark, use additive, fastText and HiCE to infer the embeddings. Next, we find the top-5 similar words with the highest cosine similarity. As is shown in Table 3, Additive method can only get embedding near to neutral words as "the", "and", etc, but cannot capture the specific semantic of different words. FastText can find words with similar subwords, but representing totally different meaning. For example, for OOV "scooter" (a motor vehicle), FastText finds "cooter" as the most similar word, which looks similar in character-level, but means a river turtle actually. Our proposed HiCE however, can capture the true semantic meaning of the OOV words. For example, it finds "cars", "motorhomes" (all are vehicles) for "scooter", and finds "piano", "orchestral" (all are instruments) for "cello", etc. This case study shows that HiCE can truly infer a highquality embedding for OOV words.

Related Work
OOV Word Embedding Previous studies of handling OOV words were mainly based on two types of information: 1) context information and 2) morphology features.
The first family of approaches follows the distributional hypothesis (Firth, 1957) to infer the meaning of a target word based on its context. If sufficient observations are given, simply applying existing word embedding techniques (e.g., word2vec) can already learn to embed OOV words. However, in a real scenario, mostly the OOV word only occur for a very limited times in the new corpus, which hinders the quality of the updated embedding (Lazaridou et al., 2017;Herbelot and Baroni, 2017). Several alternatives have been proposed in the literature. Lazaridou et al. (2017) proposed additive method by using the average embeddings of context words (Lazaridou et al., 2017) as the embedding of the target word. Herbelot and Baroni (2017) extended the skip-gram model to nonce2vec by initialized with additive embedding, higher learning rate and window size. Khodak et al. (2018) introduced a la carte, which augments the additive method by a linear transformation of context embedding.
The second family of approaches utilizes the morphology of words (e.g., morphemes, character n-grams and character) to construct embedding vectors of unseen words based on sub-word information. For example, Luong et al. (2013) proposed a morphology-aware word embedding technique by processing a sequence of morphemes with a recurrent neural network. Bojanowski et al. (2017) extended skip-gram model by assigning embedding vectors to every character n-grams and represented each word as the sum of its n-grams. Pinter et al. (2017) proposed MIMICK to induce word embedding from character features with a bi-LSTM model. Although these approaches demonstrate reasonable performance, they rely mainly on morphology structure and cannot handle some special type of words, such as transliteration, entity names, or technical terms.
Our approach utilizes both pieces of information for an accurate estimation of OOV embeddings. To leverage limited context information, we apply a complex model in contrast to the linear transformation used in the past, and learn to embed in a few-shot setting. We also show that incorporating morphological features can further enhance the model when the context is extremely limited (i.e., only two or four sentences).
Few-shot learning The paradigm of learning new tasks from a few labelled observations, referred to as few-shot learning, has received significant attention. The early studies attempt to transfer knowledge learned from tasks with sufficient training data to new tasks. They mainly follow a pre-train then fine-tune paradigm (Donahue et al., 2014;Bengio, 2012;Zoph et al., 2016). Recently, meta-learning is proposed and it achieves great performance on various few-shot learning tasks. The intuition of meta-learning is to learn generic knowledge on a variety of learning tasks, such that the model can be adapted to learn a new task with only a few training samples. Approaches for meta-learning can be categorized by the type of knowledge they learn. (1) Learn a metric function that embeds data in the same class closer to each other, including Matching Networks (Vinyals et al., 2016), and Prototypical Networks (Snell et al., 2017). The nature of metric learning makes it specified on classification problems. (2) Learn a learning policy that can fast adapt to new concepts, including a better weight initialization as MAML (Finn et al., 2017) and a better optimizer (Ravi and Larochelle, 2017). This line of research is more general and can be applied to different learning paradigms, including both classification and regression.
There have been emerging research studies that utilize the above meta-learning algorithms to NLP tasks, including language modelling (Vinyals et al., 2016), text classification , machine translation (Gu et al., 2018), and relation learning (Xiong et al., 2018;Gao et al., 2019). In this paper, we propose to formulate the OOV word representation learning as a few-shot regression problem. We first show that pre-training on a given corpus can somehow solve the problem. To further mitigate the semantic gap between the given corpus with a new corpus, we adopt model-agnostic meta-learning (MAML) (Finn et al., 2017) to fast adapt the pre-trained model to new corpus.

Contextualized Embedding
The HiCE architecture is related to contextualized representation learning (Peters et al., 2018;Devlin et al.). However, their goal is to get a contextualized embedding based on a given sentence, with word or sub-word embeddings as input. In contrast, our work utilizes multiple contexts to learn OOV embeddings. This research direction is orthogonal to their goal. In addition, the OOV embeddings learned by ours can be served as inputs to ELMO and BERT, helping them to deal with OOV words.

Conclusion
We studied the problem of learning accurate embedding for Out-Of-Vocabulary word and augment them to a per-trained embedding by only a few observations. We formulated the problem as a K-shot regression problem and proposed a hierarchical context encoder (HiCE) architecture that learns to predict the oracle OOV embedding by aggregating only K contexts and morphological features. We further adopt MAML for fast and robust adaptation to mitigate semantic gap between corpus. Experiments on both benchmark corpus and downstream tasks demonstrate the superiority of HiCE over existing approaches.