Memory, Show the Way: Memory Based Few Shot Word Representation Learning

Distributional semantic models (DSMs) generally require sufficient examples for a word to learn a high quality representation. This is in stark contrast with human who can guess the meaning of a word from one or a few referents only. In this paper, we propose Mem2Vec, a memory based embedding learning method capable of acquiring high quality word representations from fairly limited context. Our method directly adapts the representations produced by a DSM with a longterm memory to guide its guess of a novel word. Based on a pre-trained embedding space, the proposed method delivers impressive performance on two challenging few-shot word similarity tasks. Embeddings learned with our method also lead to considerable improvements over strong baselines on NER and sentiment classification.


Introduction
Humans can learn a new word quickly from minimal exposure to its context, as in the following example: The Labrador runs happily towards me, barking and wagging its tail.
Even this is the first time one hears about Labrador, we can guess it should be an animal or even further a dog easily, since it runs, barks and has a tail. Such ability to efficiently acquire representation from small data, namely fast mapping, is thought to be the hallmark of human intelligence that a cognitive plausible agent should strive to reach (Xu and Tenenbaum, 2007;Lake et al., 2015).
However, as the mainstream of text representation learning in NLP, most distributed semantic models (DSMs) don't fare well in tiny data (Lazaridou et al., 2017;Herbelot and Baroni, 2017;Wang et al., 2016). Even if they have learned a lot of words, they still need sufficient examples to acquire a high-quality representation for a novel word. This not only constitutes a blow to DSM's cognitive plausibility but also limits its practical usage in NLP. Because plentiful enough data is not always available, especially in domain specific tasks. Even if a large corpora is at hand, low-frequency words in it are still more than highly frequent ones, according to the Zipfian distribution of natural language.
Given the above reasons, it's desirable to build a word embedding method capable of acquiring high quality representations with limited contexts, i.e., few shot word representation learning. We take lessons from hypothesis constraint (HC) theory to achieve this goal. HC is an influential proposal for human's fast mapping (Xu and Tenenbaum, 2007). It indicates that people learn a new word by eliminating incorrect hypotheses about the word meaning, based on usage of the target word and prior knowledge of context words. This is instructive to us since embedding a word in the high-dimensional vector space also faces nearly unlimited candidate hypotheses (Wang et al., 2018) . General DSMs can't efficiently handle these candidates, so they fall back on multiple context examples to find the path, while we propose to let a memory show the way. We augment DSM with an longterm memory to transfer knowledge from a large general domain corpora to adapt the representation learning on the small text. In context of the HC theory, we directly constrain the hypothesis a DSM makes about the target word by its current usage and prior knowledge acquired from a large corpora. Experiments show our method makes educated guess of a novel word efficiently with fairly limited examples, just as humans do in the fast mapping.
It's worth noting that us attaching importance to few-shot word representation learning doesn't mean we need to learn words, no matter frequent or rare, all in the few-shot way. DSMs have done pretty well in frequent words learning with large corpus. We augment a DSM with an external memory for few-shot representation learning, under the assumption that gradual acquisition of frequent words plus fast learning of rare words make an integrated word representation learning scheme. Our ultimate goal is certainly an allround architecture that learns text representations from any amount of data. We believe Mem2Vec, which bridges the word representation learning from big data to small text, will be a building block of that ideal architecture. The primary contribution of this work is a memory augmented word embedding model with a fast adaptation mechanism, capable of learning representations efficiently from tiny data. Experimental results show that the proposed Mem2Vec learns high quality target word representation with both single informative sentence and a few casual sentences as contexts. To show its performance in downstream applications, Mem2Vec is used to pre-train embeddings for three NER tasks and also surpasses strong baselines. Since our model transfers from general domain corpus to a target small text, it is highly possible to face the problem of domain shift. Mem2vec is impressively competent in tackling domain shift, as demonstrated in a series of cross-domain sentiment classification tasks.

Related Work
Rare Word Embedding Acquiring representations for rare words has long been a well-known challenge of natural language understanding (Herbelot and Baroni, 2017;. Khodak et al. (2018) learn a linear transformation with pretrained word vectors and linear regression, which can be efficient adapted for novel words. Lazaridou et al. (2017) directly sum the context embedding of a novel word as its representation, based on a pre-trained embedding space. Though not explicitly stated, their idea actually matches the HC theory. They constrain the hypothesis solely within the current context of the target word which we think is not enough. We constrain the hypothesis with memory and the context. Another strand of solutions rely on auxiliary information, such as morphological structure (Luong et al., 2013;Kisselew et al., 2015) and external knowledge (Long et al., 2017). Lazaridou et al. (2013) derive morphologically complex words from sub-word parts with phrase composition methods. Ling et al. (2015) read characters of the rare word with a bidirectional LSTM to deal with open vocabulary problem in language modeling and NER. Hill et al. (2016) learn an embedding of a dictionary definition to match the pre-trained headword vector, while Weissenborn (2017) refines the word embeddings with explicit background knowledge from a commonsense knowledge base. Different from this strand of work, our method doesn't fall back on auxiliary information. We acquire knowledge from a large unlabeled general domain corpora which is widely available.
Cross Domain Word Embedding The knowledge accumulation phase of our model aims to learn an embedding space from a large general domain corpora. This is partially in line with cross domain word embedding work. Among these work, a strand of approach hypothesizes that a word frequent in multiple domains should mean nearly across these domains. Bollegala et al. (2015) call such word pivot, share its embeddings across domains and use them to predict the surrounding non-pivots. Yang et al. (2017) selectively incorporate source domain information to target domain word embeddings with a word-frequencybased regularization. These pivot-based methods have delivered improvements on sentiment analysis and NER. However, they have a defect that only limited target domain words benefit from the knowledge transfer.
Memory based Meta Learning Memory augmented neural networks (MANN) are widely used in different tasks for efficient recall of experience and fast adaptation to new knowledge (Bahdanau et al., 2014;Merity et al., 2017;Miller et al., 2016;Grave et al., 2017;Sprechmann et al., 2018;. Intuitively, Meta-learning, which aims to train a model that quickly adapts to a new task, should benefit from memory architecture, and empirically it does do (Santoro et al., 2016;Duan et al., 2016;Wang et al., 2016;Munkhdalai and Yu, 2017;Kaiser et al., 2017). The memory we use is closely related to (Kaiser et al., 2017), but still get three major differences. First,they only retrieve the single nearest neighbor from the memory while we retrieve an average of the K nearest neighbors weighted by how they match the current context. Second, they focus on supervised learn-  ing and don't have a fast adaptation mechanism for acquiring representation. Third, they update the memory according to whether the returned value is strictly the same as the target. However, synonyms are common in natural language text. We thus take a softer criterion and update the memory according to vector similarity between the addressed value and the target word embedding.

Methods
Our model in brief is a neural network based DSM augmented by a longterm memory. As showed in Fig.1, it operates in two consecutive phases, first accumulating knowledge and then doing fast adaptation on new words, just as the human learning process goes. In the knowledge acquisition phase, we train the memory augmented DSM to learn a semantic space. We also accumulate similar contexts of target words in the memory and gradually form "prototype" representations. The pre-trained embedding space and the saved prototypes are just the knowledge acquired. The fast adaption phase occurs when we need to learn a new word from minimal context. In this phase, we directly combine the context embedding and retrieve content from the memory to form the target word representation.
In the following sections, we will first introduce the memory architecture and the content based addressing. We then detail how exactly our model runs respectively in the knowledge accumulation and fast adaption phase.

Memory Addressing
M is a non-parametric key-value memory which stores a key-value pair (k i , v i ) in each memory slot i. Inspired by (Kaiser et al., 2017), we keep an additional vector A tracking the age of slots. The initial age of all is zero. So the whole memory M looks like (K m×ks , V m , A m ) where m denotes memory size and ks denotes key vector size. Given a normalized query q, its nearest neighbor in M is defined as any of the keys that maximize the cosine similarity with q: During training, a query to the memory M searches k nearest neighbors which is a natural extension to (1): We take an average weighted by how the addressed memory slots match the query: This is actually a dot-product attention on the k nearest neighbors. R K , R V are the final output of the memory. Note that here we use softmax with temperature T : T is normally set to 1. Using a higher value for T produces a softer probability distribution. We set different temperatures in the knowledge accumulation and fast adaptation phase, which will be detailed in the following subsections.

Knowledge Accumulation
Given a target word embedding t j with its context embedding c j as input, we query the memory with c j as (1)-(3) and retrieve R k , R V . The semantic relation between the current example and the retrieved content from memory is hoped to be consistent from context to target words, so we derive the following loss: We also hope the target word representation fully incorporates context information and stay far from negative examples, so we also inherit the loss from (Mikolov et al., 2013): where D denotes the whole corpora, and #(t j , c j ) means times the target and the context word cooccur. The word p i is a negative sample sampled from the distribution P (t j ), as Mikolov et al. (2013) do. We minimize the sum of L m and L s : Memory Update. After each query, we update a memory slot according to how frequently the key is addressed and how useful the addressed value is. The update is done piecewise according to similarity between the addressed values (V n 1 , ..., V n k ) and the target word. For all the addressed values V n i whose similarity to the target word is higher than the threshold β, we only update its corresponding key by taking a weighted average of the current key and the query: Otherwise, it means no addressed value correlates enough with the target, we then choose memory slots n with maximum age and rewrite the stored items in it: The age of each updated slot will be reset to zero while all other non-updated slots get incremented by 1 in age. Memory updated in this way gradually accumulates similar contexts of a word into the same slot, which in another word, forms the prototype representation of a word.

Fast Adaptation
Now we show how to poll the memory to efficiently learn a new word representation from limited context. This is where the hypothesis constraint takes place. To be specific, given a new word embedding t * j to be learned and its context embedding c * j , we retrieve memory relevant to c * j as (2)-(3) and get R * K . Then we adapt context embedding with the retrieved memory to form the target word representation: where α can be tuned a hyper-parameter or learned with regression models. Actually we also try to incorporate R * V , but the aggregated prototype R * K seems to continuously perform better. We here pay additional attention to the softmax temperature T . T is emphasized since it conditions how the model "treats" the retrieved memory. Contexts are fairly limited in the few-shot case, so how the retrieved memory is treated crucially affects the quality of the learned representation. A higher temperature leads to a softer attention distribution , which means the model will be more likely to sample from all retrieved contents. A lower temperature makes the model focus more on the memory with highest similarity to the query. We predict a slightly higher temperature will generally be better in the fast adaptation phase. Since the HC theory points out that hypotheses are not in strict mutual-exclusions, they overlap with each other which corresponds to the higher-temperature condition. We will test this in the experiments.

Few-shot Word Similarity Tasks
We test the proposed method on two few-shot word similarity tasks. Fig.2 gives examples of the two tasks. In the following subsections we will introduce these datasets in detail and show the performance of tested methods on these tasks.

Tasks and Datasets
Definitional Nonce Task We evaluate on the Definitional Nonce dataset (Herbelot and Baroni, Nonce Definition Provided Context : ___ international inc is an american multinational conglomerate company that produces a variety of commercial and consumer products engineering services and aerospace systems for a wide variety of customers from private consumers to major corporations and governments Ground Truth Word: Honeywell Chimera-l2 Provided Context : 1. Canned sardines and ____ between two slices of whole meal bread and thinly spread Flora Original. 2. Erm, ____, low fat dairy products, incidents of heart disease for those who have an olive oil rich diet. Probe Words: rhubarb, onion, pear, strawberry, limousine, cushion Human Response: 2.57, 4.43 , 3.86, 3.71, 1.43, 2.14 Figure 2: Examples of the Nonce Definition and Chimera Task 2017) to simulate the process where a competent speaker learns a novel word from one informative sentence. 1000 words are included in the dataset as targets, with 700 for training and 300 for testing. Each target word corresponds to only one sentence extracted from its Wikipedia definition as context. All context sentences have been manually checked to be definitional enough to describe the corresponding target words. After tuning parameters on training data, the model is required to learn the target word representation with the provided context in test set. Learned representations are assessed by similarity to ground truth vectors produced in exposure to the whole corpora. We use the Reciprocal Rank (RR) of the ground vector in all nearest neighbors to the learnt representation for fair comparison of different methods, following Herbelot and Baroni (2017)'s settings. The mean value of RR over all test instances in the dataset is calculated as the final score.
Chimera Task Our second evaluation on the Chimera dataset (Lazaridou et al., 2017) means to simulate the case where a speaker learns the new word in a more casual multi-sentence context, not as highly informative as definitions in the Nonce dataset. There are 3 sub-tasks in Chimera:L2, L4 and L6, respectively providing 2, 4, 6 sentences as context to for each of the 330 instances in the dataset. The tested model needs to learn target word representation from the provided contexts. The similarity between learned embeddings and each of the probe words is measured and compared to human judgments by Spearman correlation. The final score is the average Spearman across all test pairs.

Baselines
Our model is compared to several baselines, including Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), SUM (Lazaridou et al., 2017) and N2V (Herbelot and Baroni, 2017). Glove and Word2Vec are representatives of traditional DSMs. With them we want to test how exactly traditional DSMs perform in the few shot representation learning without any additional mechanism for small data. SUM and N2V are proposed especially for learning on small corpus. They adapt Word2Vec's skip-gram structure for incremental learning and show improvements on the Chimera dataset. They partially match the HC theory which Mem2Vec is based on. Note that several rare word learning methods (Long et al., 2017;Xu et al., 2014;Lazaridou et al., 2013) that rely on auxiliary information don't apply with most of our task settings. In the Nonce and Chimera task, context for target word learning is strictly limited for fair comparison, so external knowledge is banned. And the target word, as showed in Fig.2, is just a slot which doesn't provide any morphological hints, so sub-word methods are also excluded.
Both the above baselines and the proposed Mem2Vec use a dump of Wikipedia to learn a fundamental semantic space. To be specific, N2V and SUM use embeddings pre-trained by Word2Vec, while Mem2Vec acquires prior-knowledge, all from that Wiki corpora. We calculate correlation with the similarity ratings in the MEN and SIM-LEX dataset to verify if the pre-trained semantic space is ready for use.

Results
Nonce Definition Task We show the results of Nonce Definition Task in Table 1. Before analyzing the results we need to explain that MRR achieved by the tested models seems pretty low. This is no odd since matching the ground truth word in a vocabulary of 210,512 sets a potentially very large denominator in the reciprocal rank calculation. Our model achieves an MRR of 0.05416, which means the median rank of the true vector is 518 in the challenging two hundred thousand neighbors, surpassing all the baselines. N2V and SUM also deliver satisfactory performance with N2V working better. We are sorry to find that the naive Word2Vec and GloVe totally fail in the Nonce task, supporting the importance of adapting traditional DSMs for few-shot word representation learning. Chimera Task The results on 3 chimera tasks are shown in Table 1, too. Mem2vec out-performs baselines in all the 3 context length settings. SUM performs steadily well from Nonce to the Chimera task, suggesting the effectiveness of constraining hypothesis space with contexts. But the continuous improvement of Mem2Vec over SUM confirms the advantage of our model, which incorporates "global" semantic information from the memory with the local contexts. N2V also works here but not as well as in the Nonce task, probably because the contexts in chimera are not as informative. Such performance drop may indicate N2V's limited scalability to downstream NLP tasks since not all real world texts are as informative as in Nonce Definition Task. We will test this speculation with NER tasks in section 5.

Memory Parameter Analysis
We are interested to know how the two key parameters of memory, the softmax temperature T , and the memory size influence the quality of learned representations. We use the Nonce Definition task as the testbed. While studying the influence of one parameter, the other parameters are fixed. We run the model for 3 times with each candidate parameter and calculate the average precision as the final score. Softmax Temperature Fig.3 (left) shows task performance under different softmax temperatures in semilog coordinate. We are a little bit surprised to find that it roughly fits a normal distribution and a mid-high temperature leads to best performance. A mid-high T means the model doesn't give too large or too small weights to certain retrieved memory. This meets the HC theory about how humans weight the constrained hypothesis. We don't just trust a single hypothesis, nor do we treat all the hypotheses equally. The experiments shows similar principle also applies to our model. Memory Size Fig.3 (right) shows task performance under different memory size. We find that increasing the memory size does lead to improved performance . But the improvements tend to be minor after the memory size is larger than 20,000. We owe it to the fact that we does not save specific examples, we accumulate similar contexts together in the memory to form prototypes. While we retrieve the memory, prototypes can be combined in different ways to represent multiple examples, thus a smaller memory can also work as well as the bigger one.

Extrinsic Tasks
We hope that the learned representations not only perform well on word similarity tasks but also apply to downstream NLP tasks. NER on domain specific datasets is an ideal benchmark. Named entities in these datasets are relatively low in frequency and not well covered by general domain corpus, tough for a traditional DSM to learn. Besides, while transferring from general domain corpus to a target small text, domain shift is a highly possible issue. We test if Mem2Vec could tackle the domain shift with a series of cross-domain sentiment analysis tasks.

Tasks and datasets
Domain Specific NER We use BioNLP11species (Kim et al., 2011), AnatEMs (Pyysalo and Ananiadou, 2013) and NCBI-disease (Dogan et al., 2014) dataset, respectively from taxonomy, anatomy and pathology literatures. We train embeddings with tested methods to initialize the recognizer, whose performance then demonstrates whether the tested models learn representations well for rare words. Cross Domain Sentiment Classification cross domain sentiment classification on Amazon Review dataset (Blitzer et al., 2007) is chosen as a benchmark. This dataset includes reviews from 4 product categories: books, DVDs, kitchens and electronics, suitable for the cross-domain setting. Using one as source domain and one as the target, we get 16 pairs for experiments. We train the classifier with source domain data and directly test it on the target domain, using the pre-trained embed-dings as input feature. Note that through this task we also test how Mem2Vec performs when transferring from a small text, since in all the above experiments we learn prior knowledge from a large corpora.

Baselines
Except for the four baselines considered in word similarity tasks, we also compare with DAREP (Bollegala et al., 2015) and CRE (Yang et al., 2017) in the NER and sentiment classification tasks. They are both pivot-based methods for cross domain embedding learning which fare well in some downstream tasks. Besides we introduce SCL (Blitzer et al., 2006), a well-cited crossdomain sentiment analyser, as a baseline only for the sentiment classification task. For NER, we use pre-trained embeddings by the tested methods as only input features for a LSTM-CRF recognition model (Lample et al., 2016). We simply mix the Wikipedia corpora with a dump of PubMed as our source corpora. Note that N2V and SUM can't be directly used to pre-train embeddings for downstream tasks since they focus on novel word learning. We thus explicitly divide words which occur less than 5 times as rare words while others as frequent words. N2V and SUM learn the frequent words with Word2Vec and learn the rare words in their own way. This setting also applies to the sentiment classification task.
For sentiment classification, we use a multilayer perceptron (MLP) as the classifier, with one hidden layer of 400 nodes, ReLu activation and softmax output function. Table 2 shows the results of domain specific named entity recognition. Used for pre-training embeddings, Mem2Vec achieves higher F1-score than all the baselines. It first surpasses CRE and DAREP that only bring slight improvements over Word2Vec. CRE and DAREP are both methods which relies on words with cooccurance patterns in source and target domain as the pivots for cross-domain transfer. This indicates the advantage of Mem2Vec over the traditional word frequency based methods in fast mapping cases where word cooccurrence pattern is not clear.

Named Entity Recognition
Our improvements over the N2V and SUM are more obvious than in the two word similarity   tasks. This again affirms our speculation that constraining hypothesis solely with the context is not enough. In the setting of NER, the context of one named entity is likely to be filled with other named entities which are also low in frequency. Directly summing the context as SUM does or taking risk to enlarge the window size as N2V may lead to over-fitting. While every training step of our method incorporates relative information from all the experienced examples stored in the memory, alleviating the danger of learn representations that over fits the local contexts.
In addition, it's worth noting that parameter tuning for N2V is no picnic. In our experiments, the original settings: high learning rate, large window size and short iteration span don't lead to satisfactory performance. More conservative parameter selection gets N2V back in track but departs from its fast mapping intention. Fig.4 shows the results of Amazon Review sentiment classification. Mem2Vec delivers impressive performance, beating all the baselines in 10 of the total 12 pairs, including CRE and DAREP. This demonstrates the advantage of the memory as a transfer medium over the word-frequency based transfer of CRE and DAREP. But CRE and DAREP are still strong baselines in the cross domain task, surpassing SCL by a large margin. N2V and SUM are built for learning representation from small data, but they don't consider the possible domain discrepancy when using pre-trained embeddings on the target small text. So they don't bring much improvements over Word2Vec and GloVe. This also reminds us that to get the few-shot word representation learning methods in practical use, domain shift should be properly addressed.

Conclusion
We presented an integrated representation learning scheme which gradually learns from a big corpora and quickly adapt on tiny data. It accumulates knowledge with a long-term memory to adapt the representation learning of a novel word, in the few-shot learning case. Such adaptation means to constrain the "guess" of a DSM for the novel word according to the most relevant representation learning experience, inspired by hypothesis constraint theory for fast mapping. Experiments show the proposed method learns high quality representation from both highly informative and less definitional contexts in limited size. Pre-trained embeddings with our model also lead to improvements in Named Entity Recognition and sentiment analysis.
This work is our effort towards an ideal word representation learning scheme which learns from any amount of data. In the future work, we will explore more effective memory addressing and updating approaches to boost the few-shot representation learning. We believe not all examples are equally important and worth memorizing. Learning to memorize core examples should alleviate the data-hungry of representation learning methods.