Second-order contexts from lexical substitutes for few-shot learning of word representations

There is a growing awareness of the need to handle rare and unseen words in word representation modelling. In this paper, we focus on few-shot learning of emerging concepts that fully exploits only a few available contexts. We introduce a substitute-based context representation technique that can be applied on an existing word embedding space. Previous context-based approaches to modelling unseen words only consider bag-of-word first-order contexts, whereas our method aggregates contexts as second-order substitutes that are produced by a sequence-aware sentence completion model. We experimented with three tasks that aim to test the modelling of emerging concepts. We found that these tasks show different emphasis on first and second order contexts, and our substitute-based method achieves superior performance on naturally-occurring contexts from corpora.


Introduction
As language vocabulary follows the zipfian distribution, we expect to encounter a large number of rare and unseen words no matter how large the training corpus is. The effective handling of such words is thus crucial for Natural Language Processing (NLP).
Attempts to learn rare and unseen word representations can be categorized into the following three approaches: (1) constructing target word embeddings from the subword components (Pinter et al., 2017;Bojanowski et al., 2017), (2). leveraging definitions or relational structures from external resources such as Wordnet (Bahdanau et al., 2017;Pilehvar and Collier, 2017), and (3) modelling the target word from few available contexts. Our paper falls into the last approach.
We demonstrate improvements in performance by employing an alternative context representation, second-order lexical substitutes, as opposed to the traditional bag of word context representations. In line with previous research in this area, we evaluate our methodology on three tasks that measure the quality of the induced unseen word representation from contexts (Lazaridou et al., 2017;Herbelot and Baroni, 2017;Khodak et al., 2018). Our results reveal that the three tasks involve different types of contexts which put different emphasis on first or second order contexts. Our second-order substitute-based method achieves the best performance for modelling rare words in natural contexts from corpora. In the tasks in which both first order and second order contexts are important, the ensemble of these two types of contexts yields superior performance. 1 2 Related work 2.1 First-order context The most naive way of inducing new word representation from contexts is to simply take the average of context word embeddings that co-occur with the target word in a sentence. With stop words removed, this simple method has proven to be a strong baseline as shown in Lazaridou et al. (2017) and Herbelot and Baroni (2017). A potential improvement from the simple additive baseline model is that we weigh words with ISF (inverse sentence frequency). We follow the definition of ISF in Samardzhiev et al. (2018) and implement it as a baseline model in our study. More recently, Khodak et al. (2018) learn a transformation matrix to reconstruct pre-trained word embeddings, which essentially learns to highlight informative dimensions. Along a different line, Herbelot and Baroni (2017) take a high-risk learning rate and processing strategy for new words but would require the contexts that come at the beginning of the training to be maximally informative. Recent work implements a memory-augmented word embedding model (Sun et al., 2018) however our system shows comparable or superior performance on the two intrinsic tasks that they use (Table 1 below  and Table 1 of their paper).

Second-order substitute-based context
An alternative to a bag-of-words representation is a second-order substitute vector generated by a language model for the target word's slot. For example, we can represent the context 'It is a move.' as a substitute vector [big 0.35, good 0.28, bold 0.05, ...] with the numbers indicating fitness weights of each substitute in the context (Melamud et al., 2015;Yatbaz et al., 2012;Melamud et al., 2015). Melamud et al. (2016) later on introduced context2vec which trains both context and word embeddings in a similar setup to CBOW (Mikolov et al., 2013) except that the context is represented with a Bidirectional LSTM rather than as a bag of words. In this way, con-text2vec captures sequence information in the context, and is able to produce high-quality substitutes for a sentence-completion task, while overcoming the sparseness issues in the previous substitutebased approaches. Kobayashi et al. (2017) finetune this context2vec representation to compute entity representations in a discourse for the language modelling task.
A related application of second-order substitutes is word sense induction. Baskaya et al. (2013) represent contexts as second-order substitutes and apply co-occurrence modelling on top of the instance id -substitute pairs. Alagić et al. (2018) propose a similar method to our paper and showed that second-order lexical substitutes and first-order contexts complement each other in word sense induction. Our paper provides alternative evidence for the use of lexical substitutes in the setting of rare word modelling with analysis on the effect from different contexts. .

Proposed Method
In this paper, we make a simple modification from the previous work by representing the context of an unseen word as the weighted sum of the lexical substitute vectors in a continuous embedding space such as the word2vec space. This can be seen as a post-processing technique applied on an existing embedding space. The substitutes and their fitness scores are generated from con-text2vec. Compared with the context2vec representation itself, our method isolates the effect of the second-order substitutes and can be applied on top of an existing pre-trained embedding space. For each context, we generate the top N most likely substitutes at the slot of the unseen word by computing the nearest neighbours from the con-text2vec context representation. 2 We then compute the centroid of these substitutes from our base word embedding space, weighted by each substitute's fitness, cosine similarity, to the context representation. Let ContextVec 3 be the context representation produced by context2vec, S be the set of the top 20 substitute target word vectors produced by context2vec, S be the same 20 substitutes that we look up in our base word embedding space, and f (S i ) be the normalized fitness score of S i as defined in equation 1. The substitute-based context (SC), and thus the unseen word representation for this context, is defined in equation 2. If the unseen word occurs multiple times, we average the unseen word representations across the multiple contexts.
To directly compare with the previous studies, we take the word2vec embedding model and the 1.6B Wikipedia training corpus provided by Herbelot and Baroni (2017) for our substitute-based method and for training Context2vec. Model parameters for training Context2vec, as listed in Appendix A, are fine-tuned on the training sets of the intrinsic tasks as there are no development sets.

The definitional Nonce dataset (Nonce)
Nonce is introduced in Herbelot and Baroni (2017) as a task that challenges the models to reconstruct target word embeddings from single wikipedia definitions. The quality of the representations is evaluated by measuring how close they are to the original word embeddings trained from the whole 2 From experiments on the training sets of the tasks (Notice that there are no development sets), we found that N=20 is optimal.
3 Symbols in bold indicate vectors Wikipedia corpora. Following Herbelot and Baroni (2017), we report in the Nonce columns of Table 1 the mean reciprocal rank (MRR) and median rank (Med. Rank) of the gold-vector (trained from the whole Wikipedia) in the ranked list of nearest neighbours from the induced representation in the 300 test cases. We see strong performance from first-order context representation especially the a la carte method. Manual observations show that definitions are designed to be maximally informative with many synonyms, hypernyms or words semantically related to the target word in the context, and the first-order context models can easily exploit this information. Also, the sequential context around the target word in a definition may not reflect the context in which a target word will be typically used in a corpus. The good performance of first-order context models is therefore to be expected. Furthermore, the Nonce task tests how well the model reconstructs the original embedding but does not probe into the semantic properties or relations captured in the induced word representations. A la carte is thus especially suitable for this task as it has been explicitly trained to match the original embedding. However, we demonstrate in the following experiments that the superior performance from a la carte may not always be transferred to other tasks.

The Chimera dataset (Chimera)
In the Chimera dataset, Lazaridou et al. (2017) introduce unseen novel concepts (chimeras), each of which is formed by combining two related nouns Additive ISF substitutes drowning civet drown tapir drowns langur shoos crocodile undresses opossum Table 2: Nearest neighbours produced by additive ISF and substitutes approaches for the Chimera concept elephant bison in the context 'but his pleasure soon turns to distress when he sees that a baby is stuck in the mud and drowning .' (from the Chimeras dataset) (For example, buffalo and elephant). Each novel concept is accompanied by 2, 4 or 6 natural contexts that originally belong to the related nouns. The model needs to induce representation for these novel concepts from the contexts. The quality of the representations is evaluated by similarity judgment with probe words. Following Herbelot and Baroni (2017) and Lazaridou et al. (2017), we report in the Chimera columns of Table 1 the average Spearman Rank coefficients against human annotations for 110 test cases in each sentence condition .
We observe that the additive ISF model turns out to be the strongest of the first-order context models, outperforming all the other previouslyreported results. We see immediate improvement when we represent the context as substitutes in the 6 sentence condition. We see further improvement when combining both additive ISF (first order) and substitutes (second order contexts), which yields the best performance in 2 sentence and 6 sentence conditions. The positive effect of the ensemble method from combining first-order and second-order contexts shows that the two different contexts capture complementary information in this task. This is especially due to the fact that the contexts were controlled for informativeness so as to have different degrees of overlap with feature norms. Therefore at least some, but not all, contexts will have a high bag-of-word overlap with features that are semantically related to the concepts (Lazaridou et al., 2017). These contexts will easily benefit from first-order contexts alone. However, for the other contexts where there is few or even no overlap with feature norms in the context words, it is the contextual sequence, and thus second-order context, that will give the maximum information about the target word. We show such an example with the nearest neighbours of the representations induced by our substitutes model and additive ISF in Table 2. We can see that while the additive ISF representation is easily affected by unrelated words in the sentence, the substitutes approach clearly has at least identified that the target word is likely to be a kind of animal.

The Contextual Rare Words dataset (CRW)
The For each pair, the second word is the rare word and is accompanied by 255 contexts. We follow the experiment setup in Khodak et al. (2018) and use their pre-trained vectors on the subcorpus that does not contain any of the rare words from the dataset. This subcorpus is also used to train the context2vec model that generates substitutes. As in Khodak et al. (2018), we randomly choose 2, 4, 6..128 number of contexts as separate conditions for 100 trials, and use these contexts to predict the rare word representations. Cosine similarity is computed between the rare word representation from the given rare word contexts in the trial (2,4..128) and the embedding of the other word in the pair from the pre-trained vectors. The cosinesimilarity of each pair is compared against similarity judgments from human annotations. The average Spearman Rank coefficients against human annotations across the trials are reported in Figure  1. Standard deviations are reported in Appendix B.
We see dramatic improvement from the substitutes method over all the other methods including the previous state-of-the-art a la carte in this datasets which come from corpora-based natural contexts of rare words. The result here suggests that, in natural contexts, the sequence information rather than bag of words plays a more important role in predicting a target word's meaning.
We also notice that applying second order information on word2vec space consistently outperforms Context2vec alone which generates the second order substitutes. We suspect that this is because the context representation induced by con-text2vec is more syntactically-oriented whereas the tasks in our study mainly test semantic relations. We confirm this assumption by following Herbelot and Baroni (2017) to test the target word embeddings produced by context2vec on the MEN dataset (Bruni et al., 2014). We find that context2vec (Spearman ρ = 0.65) correlates less with human's semantic relatedness judgment than word2vec (Spearman ρ = 0.75) on this dataset. Isolating the second order information from Con-text2vec and applying it on the word2vec space as an external constraint effectively preserves the semantic relations present in word2vec and at the same time provides a paradigmatic view which finds a both syntactically and semantically appropriate position for the rare word.

Conclusion
To conclude, our paper teases apart the effect of second-order context by proposing a simple second-order substitute-based method that can post-process and improve over an existing embedding space. Our substitute-based method achieves the state-of-the-art performance when modelling emerging concepts in natural contexts from corpora. This is not surprising as the substitutes contain rich linguistic constraints from their surrounding contextual sequences to inform the word representation. We plan to investigate whether the second order information is also the key element in the success of the recently-proposed language model embeddings (Peters et al., 2018;Devlin et al., 2018), for example, by testing whether the performance of these contextualized embeddings correlate more with first-order context representation or the second-order substitute context across the different tasks in this study. However, we need further research to find ways to bring type-level and token-level representations of these contextualized embeddings into the same space for these tasks.
Also, as we found that definitions seem to exhibit different properties from natural contexts in corpora, it may be advisable to model definitions and corpora contexts differently. An aspect that we did not cover in this paper is the morphological information from target words. As contexts, definitions and subword information can provide complementary information (Schick and Schütze, 2019), in future work, we plan to leverage subwords, contexts and definitions together in modelling rare or unseen words.