Learning to Learn to Disambiguate: Meta-Learning for Few-Shot Word Sense Disambiguation

The success of deep learning methods hinges on the availability of large training datasets annotated for the task of interest. In contrast to human intelligence, these methods lack versatility and struggle to learn and adapt quickly to new tasks, where labeled data is scarce. Meta-learning aims to solve this problem by training a model on a large number of few-shot tasks, with an objective to learn new tasks quickly from a small number of examples. In this paper, we propose a meta-learning framework for few-shot word sense disambiguation (WSD), where the goal is to learn to disambiguate unseen words from only a few labeled instances. Meta-learning approaches have so far been typically tested in an N-way, K-shot classification setting where each task has N classes with K examples per class. Owing to its nature, WSD deviates from this controlled setup and requires the models to handle a large number of highly unbalanced classes. We extend several popular meta-learning approaches to this scenario, and analyze their strengths and weaknesses in this new challenging setting.


Introduction
Natural language is inherently ambiguous, with many words having a range of possible meanings. Word sense disambiguation (WSD) is a core task in natural language understanding, where the goal is to associate words with their correct contextual meaning from a pre-defined sense inventory. WSD has been shown to improve downstream tasks such as machine translation (Chan et al., 2007) and information retrieval (Zhong and Ng, 2012). However, it is considered an AI-complete problem (Navigli, 2009) -it requires an intricate understanding of language as well as real-world knowledge.
Approaches to WSD typically rely on (semi-) supervised learning (Zhong and Ng, 2010;Melamud et al., 2016;Kågebäck and Salomonsson, 2016;Yuan et al., 2016) or are knowledge-based (Lesk, 1986;Agirre et al., 2014;Moro et al., 2014). While supervised methods generally outperform the knowledge-based ones (Raganato et al., 2017a), they require data manually annotated with word senses, which are expensive to produce at a large scale. These methods also tend to learn a classification model for each word independently, and hence may perform poorly on words that have a limited amount of annotated data. Yet, alternatives that involve a single supervised model for all words (Raganato et al., 2017b) still do not adequately solve the problem for rare words (Kumar et al., 2019).
Humans, on the other hand, have a remarkable ability to learn from just a handful of examples (Lake et al., 2015). This inspired researchers to investigate techniques that would enable machine learning models to do the same. One such approach is transfer learning (Caruana, 1993), which aims to improve the models' data efficiency by transferring features between tasks. However, it still fails to generalize to new tasks in the absence of a considerable amount of task-specific data for fine-tuning (Yogatama et al., 2019). Meta-learning, known as learning to learn (Schmidhuber, 1987;Bengio et al., 1991;Thrun and Pratt, 1998), is an alternative paradigm that draws on past experience in order to learn and adapt to new tasks quickly: the model is trained on a number of related tasks such that it can solve unseen tasks using only a small number of training examples. A typical meta-learning setup consists of two components: a learner that adapts to each task from its small training data; and a meta-learner that guides the learner by acquiring knowledge that is common across all tasks.
In this paper, we present the first meta-learning approach to WSD. We propose models that learn to rapidly disambiguate new words from only a few labeled examples. Owing to its nature, WSD exhibits inter-word dependencies within sentences, has a large number of classes, and inevitable class imbalances; all of which present new challenges compared to the controlled setup in most current meta-learning approaches. To address these challenges we extend three popular meta-learning algorithms to this task: prototypical networks (Snell et al., 2017), model-agnostic meta-learning (MAML) (Finn et al., 2017) and a hybrid thereof -ProtoMAML (Triantafillou et al., 2020). We investigate meta-learning using three underlying model architectures, namely recurrent networks, multi-layer perceptrons (MLP) and transformers (Vaswani et al., 2017), and experiment with varying number of sentences available for task-specific fine-tuning. We evaluate the model's rapid adaptation ability by testing on a set of new, unseen words, thus demonstrating its ability to learn new word senses from a small number of examples.
Since there are no few-shot WSD benchmarks available, we create a few-shot version of a publicly available WSD dataset. We release our code as well as the scripts used to generate our few-shot data setup to facilitate further research. 1 2 Related Work 2.1 Meta-learning In contrast to "traditional" machine learning approaches, meta-learning involves a different paradigm known as episodic learning. The training and test sets in meta-learning are referred to as meta-training set (D meta-train ) and meta-test set (D meta-test ) respectively. Both sets consist of episodes rather than individual data points. Each episode constitutes a task T i , comprising a small number of training examples for adaptation -the support set D (i) support , and a separate set of exam-1 https://github.com/Nithin-Holla/ MetaWSD ples for evaluation -the query set D (i) query . A typical setup for meta-learning is the balanced N -way, Kshot setting where each episode has N classes with K examples per class in its support set.
Meta-learning algorithms are broadly categorized into three types: metric-based (Koch et al., 2015;Vinyals et al., 2016;Sung et al., 2018;Snell et al., 2017), model-based (Santoro et al., 2016Munkhdalai and Yu, 2017), and optimization-based (Ravi and Larochelle, 2017;Finn et al., 2017;Nichol et al., 2018). Metric-based methods first embed the examples in each episode into a highdimensional space typically using a neural network. Next, they obtain the probability distribution over labels for all the query examples based on a kernel function that measures the similarity with the support examples. Model-based approaches aim to achieve rapid learning directly through their architectures. They typically employ external memory so as to remember key examples encountered in the past. Optimization-based approaches explicitly include generalizability in their objective function and optimize for the same. In this paper, we experiment with metric-based and optimization-based approaches, as well as a hybrid thereof.

Meta-learning in NLP
Meta-learning in NLP is still in its nascent stages. Gu et al. (2018) apply meta-learning to the problem of neural machine translation where they metatrain on translating high-resource languages to English and meta-test on translating low-resource languages to English. Obamuyide and Vlachos (2019b) use meta-learning for relation classification and Obamuyide and Vlachos (2019a) for relation extraction in a lifelong learning setting. Chen et al. (2019) consider relation learning and apply meta-learning to few-shot link prediction in knowledge graphs. Dou et al. (2019) perform metatraining on certain high-resource tasks from the GLUE benchmark  and metatest on certain low-resource tasks from the same benchmark. Bansal et al. (2019) propose a softmax parameter generator component that can enable a varying number of classes in the meta-training tasks. They choose the tasks in GLUE along with SNLI (Bowman et al., 2015) for meta-training, and use entity typing, relation classification, sentiment classification, text categorization, and scientific NLI as the test tasks. Meta-learning has also been explored for few-shot text classification (Yu et al., 2018;Geng et al., 2019;Jiang et al., 2018;Sun et al., 2019). Wu et al. (2019) employ meta-reinforcement learning techniques for multilabel classification, with experiments on entity typing and text classification. Hu et al. (2019) use meta-learning to learn representations of out-ofvocabulary words, framing it as a regression task.

Task and Dataset
We treat WSD as a few-shot word-level classification problem, where a sense is assigned to a word given its sentential context. As different words may have a different number of senses and sentences may have multiple ambiguous words, the standard setting of N -way, K-shot classification does not hold in our case. Specifically, different episodes can have a different number of classes and a varying number of examples per class -a setting which is more realistic (Triantafillou et al., 2020).
Dataset We use the SemCor corpus (Miller et al., 1994) manually annotated with senses from the New Oxford American Dictionary by Yuan et al. (2016) 2 . With 37, 176 annotated sentences, this is one of the largest sense-annotated English corpora. We group the sentences in the corpus according to which word is to be disambiguated, and then randomly divide the words into disjoint meta-train, meta-validation and meta-test sets with a 60:15:25 split. A sentence may have multiple occurrences of the same word, in which case we make predictions for all of them. We consider four different settings with the support set size |S| = 4, 8, 16 and 32 sentences. The number of distinct words in the meta-training / meta-test sets is 985/270, 985/259, 799/197 and 580/129 respectively. The detailed statistics of the resulting dataset are shown in Appendix A.1.
Training episodes In the meta-training set, both the support and query sets have the same number of sentences. Our initial experiments using one word per episode during meta-training yielded poor results due to an insufficient number of episodes. To overcome this problem and design a suitable metatraining setup, we instead create episodes with multiple annotated words in them. Specifically, each episode consists of r sampled words {z j } r j=1 and min( |S|/r , ν(z j )) senses for each of those words, where ν(z j ) is the number of senses for word z j . Therefore, each task in the meta-training set is the disambiguation of r words between up to |S| senses. We set r = 2 for |S| = 4 and r = 4 for the rest. Sentences containing these senses are then sampled for the support and query sets such that the classes are as balanced as possible. For example, for |S| = 8, we first choose 4 words and 2 senses for each, and then sample one sentence for each word-sense pair. The labels for the senses are shuffled across episodes, i.e., one sense can have a different label when sampled in another episode. This is key in meta-learning as it prevents memorization (Yin et al., 2020). The advantage of our approach for constructing meta-training episodes is that it allows for generating a combinatorially large number of tasks. Herein, we use a total number of 10, 000 meta-training episodes.
Evaluation episodes For the meta-validation and meta-test sets, each episode corresponds to the task of disambiguating a single word. While splitting the sentences into support and query sets, we ensure that senses in the query set are present in the support set. Furthermore, we only consider words with two or more senses in their query set. The distribution of episodes across different number of senses is shown in Appendix A.1. Note that, unlike the meta-training tasks, our meta-test tasks represent a natural data distribution, therefore allowing us to test our models in a realistic setting.

Methods
Our models consist of three components: an encoder that takes the words in a sentence as input and produces a contextualized representation for each of them, a hidden linear layer that projects these representations to another space, and an output linear layer that produces the probability distribution over senses. The encoder and the hidden layer are shared across all tasks -we denote this block as f θ with shared parameters θ. The output layer is randomly initialized for each task T i (i.e. episode) -we denote this as g φ i with parameters φ i . θ is meta-learned whereas φ i is independently learned for each task.

Model Architectures
We experiment with three different architectures:

Meta-learning Methods
Prototypical Networks Proposed by Snell et al. (2017), prototypical networks is a metric-based approach. An embedding network f θ parameterized by θ is used to produce a prototype vector for every class as the mean vector of the embeddings of all the support data points for that class. Suppose S c denotes the subset of the support set containing examples from class c ∈ C, the prototype µ c is: Given a distance function defined on the embedding space, the distribution over classes for a query point is calculated as a softmax over negative distances to the class prototypes.
We generate the prototypes (one per sense) from the output of the shared block f θ for the support examples. Instead of using g φ i , we obtain the probability distribution for the query examples based on the distance function. Parameters θ are updated after every episode using the Adam optimizer (Kingma and Ba, 2015): where L q T i is the cross-entropy loss on the query set and β is the meta learning rate.
Model-Agnostic Meta-Learning (MAML) MAML (Finn et al., 2017) is an optimizationbased approach designed for the N -way, K-shot classification setting. The goal of optimization is to train a model's initial parameters such that it can perform well on a new task after only a few gradient steps on a small amount of data. Tasks are drawn from a distribution p(T ). The model's parameters are adapted from θ to a task T i using gradient descent on D (i) support to yield θ i . This step is referred to as inner-loop optimization. With m gradient steps, the update is: where U is an optimizer such as SGD, α is the inner-loop learning rate and L s T i is the loss for the task computed on D (i) support . The meta-objective is to have f θ i generalize well across tasks from p(T ): where the loss L q T i is computed on D (i) query . The meta-optimization, or outer-loop optimization, does the update with the outer-loop learning rate β: This involves computing second-order gradients, i.e., the backward pass works through the update step in Equation 3 -a computationally expensive process. Finn et al. (2017) propose a first-order approximation, called FOMAML, which computes the gradients with respect to θ i rather than θ. The outer-loop optimization step thus reduces to: Bidirectional GRU Shared encoder Shared linear layer Task-specific output layer     FOMAML does not generalize outside the Nway, K-shot setting, since it assumes a fixed number of classes across tasks. We therefore extend it with output layer parameters φ i that are adapted per task. During the inner-loop for each task, the optimization is performed as follows: where α and γ are the learning rates for the shared block and output layer respectively. We introduce different learning rates because the output layer is randomly initialized per task and thus needs to learn aggressively, whereas the shared block already has past information and can thus learn slower. We refer to α as the learner learning rate and γ as the output learning rate. The outer-loop optimization uses Adam: where the gradients of L q T i are computed with respect to θ i , β is the meta learning rate, and the sum over i is for all tasks in the batch.
ProtoMAML Snell et al. (2017) show that with Euclidean distance metric, prototypical networks are equivalent to a linear model with the following parameters: w c = 2µ c and b c = −µ T c µ c , where w c and b c are the weights and biases for the output unit corresponding to class c. Triantafillou et al. (2020) combine the strengths of prototypical networks and MAML by initializing the final layer of the classifier in each episode with these prototypical network-equivalent weights and biases and continue to learn with MAML, thus proposing a hybrid approach referred to as ProtoMAML. Similarly, using FOMAML would yield ProtoFO-MAML. While updating θ, they allow the gradients to flow through the linear layer initialization.
We construct the prototypes from the output from f θ for the support examples. The parameters φ i are initialized as described above. The learning then proceeds as in (FO)MAML; the only difference being that γ need not be too high owing to the good initialization. Proto(FO)MAML thus supports a varying number of classes per task.

Baseline Methods
Majority-sense baseline This baseline always predicts the most frequent sense in the support set. Hereafter, we refer to it as MajoritySenseBaseline.
Nearest neighbor classifier This model predicts the sense of a query instance as the sense of its nearest neighbor from the support set in terms of cosine distance. We perform nearest neighbor matching with the ELMo embeddings of the words as well as with their BERT outputs but not with GloVe embeddings since they are the same for all senses. We refer to this baseline as NearestNeighbor.
Non-episodic training It is a single model that is trained on all tasks without any distinction between them -it merges support and query sets, and is trained using mini-batching. The output layer is thus not task-dependent and the number of output units is equal to the total number of senses in the dataset. The softmax at the output layer is taken only over the relevant classes within the mini-batch. Instead of φ i per task, we now have a single φ. During training, the parameters are updated per mini-batch as: where α is the learning rate. During the metatesting phase, we independently fine-tune the trained model on the support sets of each of the tasks (in an episodic fashion) as follows: where the loss is computed on the support examples, α is the learner learning rate as before and γ is the output learning rate. We refer to this model as NE-Baseline.
Episodic fine-tuning baseline For each of the meta-learning methods, we include a variant that only performs meta-testing starting from a randomly initialized model. It is equivalent to training from scratch on the support examples of each episode. We prepend the prefix EFto denote this.

Experimental setup
We use the meta-validation set to choose the best hyperparameters for the models. The chosen evaluation metric is the average of the macro F1 scores across all words in the meta-validation set. We report the same metric on the meta-test set. We employ early stopping by terminating training if the metric does not improve over two epochs. The size of the hidden state in GloVe+GRU is 256, and the size of the shared linear layer is 64, 256 and 192 for GloVe+GRU, ELMo+MLP and BERT respectively. The shared linear layer's activation function is tanh for GloVe+GRU, and ReLU for ELMo+MLP and BERT. For FOMAML, ProtoFOMAML and Pro-toMAML, the batch size is set to 16 tasks. The output layer for these is initialized anew in every episode, whereas in NE-Baseline it has a fixed number of 5612 units. We use the higher package (Grefenstette et al., 2019) to implement the MAML variants.

Results
In Table 1, we report macro F1 scores averaged over all words in the meta-test set. We report the mean and standard deviation from five independent runs. We note that the results are not directly comparable across |S| setups as, by their formulation, they involve different meta-test episodes.
GloVe+GRU All meta-learning methods perform better than their EF counterparts, indicating successful learning from the meta-training set. FO-MAML fails to outperform NE-Baseline as well as the EF versions of the other meta-learning methods when |S| = 8, 16, 32. Interestingly, solely metatesting is often better than NE-Baseline model which shows that the latter does not effectively transfer knowledge from the meta-training set. Pro-toNet is the best-performing model (except when |S| = 8), with ProtoMAML being a close second.
ELMo+MLP The scores for NearestNeighbor, NE-Baseline and the EF methods are higher compared to GloVe-based models, which can be attributed to the input embeddings being contextual. ProtoNet and ProtoFOMAML still improve over their EF counterparts due to meta-training. Proto-FOMAML outperforms other methods for all |S|, and FOMAML is comparatively weak. The models are also relatively stable as indicated by the low standard deviations across runs.
Effect of second-order gradients We further experiment with ProtoMAML, including secondorder gradients. In Table 2, we report its F1 scores alongside ProtoNet and ProtoFOMAML. For BERT, we train ProtoMAML while fine-tuning only the top layer and only for one inner-loop update step due to its high computational cost. We also train an equivalent ProtoFOMAML variant for a fair comparison. We can observe that ProtoMAML obtains scores similar to ProtoFO-MAML in most cases, indicating the effectiveness of the first-order approximation. ProtoFOMAML achieves higher scores than ProtoMAML in some cases, perhaps due to an overfitting effect induced by the latter. In light of these results, we argue that first-order ProtoFOMAML suffices for this task.

Analysis
Effect of number of episodes We first investigate whether using more meta-training episodes always translates to higher performance. We plot the average macro F1 score for one of our high-scoring  models -ProtoNet with BERT -as the number of meta-training episodes increases (Figure 2). The shaded region shows one standard deviation from the mean, obtained over five runs. Different |S| setups reach peaks at different data sizes; however, overall, the largest gains come with a minimum of around 8, 000 episodes.
Effect of number of senses To investigate the variation in performance with the number of senses, in Figure 3, we plot the macro F1 scores obtained from ProtoNet with BERT, averaged over words with a given number of senses in the meta-test set. We see a trend where the score reduces as the number of senses increase. Words with more senses seem to benefit from a higher support set size. For a word with 8 senses, the |S| = 32 case is roughly a 4-shot problem whereas it is roughly a 2shot and 1-shot problem for |S| = 16 and |S| = 8 respectively. In this view, the disambiguation of words with many senses improves with |S| due to an increase in the effective number of shots.
Challenging cases Based on the 10 words that obtain the lowest macro F1 scores with ProtoNet with GloVe+GRU (Appendix A.4), we see that verbs are the most challenging words to disambiguate without the advantage of pre-trained models and their disambiguation improves as |S| increases.

Discussion
Our results demonstrate that meta-learning outperforms the corresponding models trained in a non-episodic fashion when applied in a few-shot learning setting -a finding consistent for all |S| setups. Using the BERT-based models, we obtain up to 72% average macro F1 score with as few   as 4 examples, and closely approach the reported state-of-the-art performance 3 with |S| = {16, 32}.
The success of meta-learning is particularly evident with GloVe+GRU. GloVe embeddings are sense-agnostic and yet, ProtoNet, ProtoFOMAML and ProtoMAML approach the performance of some ELMo-based models, which enjoy the benefit of contextualization via large-scale pretraining.
Although contextualized representations from ELMo and BERT already contain information relevant to our task, integrating them into a metalearning framework allows these models to substantially improve performance. To illustrate the advantage that meta-learning brings, we provide example t-SNE visualizations (van der Maaten and Hinton, 2008) of the original ELMo embeddings and those generated by ProtoNet based on ELMo (Figure 4). The representations from ProtoNet are more accurately clustered with respect to the senses than the original ELMo representations. ProtoNet thus effectively learns to disambiguate new words, i.e. separate the senses into clusters, thereby improving upon ELMo embeddings. We provide further t-SNE visualizations in Appendix A.6.
The success of ProtoNet and ProtoFOMAML can be in part attributed to the nature of the problem -WSD lends itself well to modeling approaches 3 Not a direct comparison due to different data splits. based on similarity (Navigli, 2009;Peters et al., 2018). Their relative ranking, however, depends on the architecture and the value of |S|. ELMo+MLP has the simplest architecture and ProtoFOMAML -an optimization-based method -performs best. For GloVe+GRU and BERT, which are more complex architectures, lower-shot settings benefit from ProtoFOMAML and higher-shot settings from Pro-toNet. The reasons for this effect, however, remain to be investigated in future work.
Our experiments further highlight the weakness of FOMAML when applied beyond the N -way, K-shot setting. This may be due to the fact that the number of "new" output parameters in each episode is much greater than the number of support examples. Informed output layer initialization in Proto(FO)MAML is therefore important for effective learning in such scenarios. A similar problem with FOMAML is also pointed out by Bansal et al. (2019), who design a differentiable parameter generator for the output layer.

Conclusion
Few-shot learning is a key capability for AI to reach human-like performance. The development of meta-learning methods is a promising step in this direction. We demonstrated the ability of metalearning to disambiguate new words when only a  handful of labeled examples are available. Given the data scarcity in WSD and the need for few-shot model adaptation to specific domains, we believe that meta-learning can yield a more general and widely applicable disambiguation model than traditional approaches. Interesting avenues to explore further would be a generalization of our models to disambiguation in different domains, to a multilingual scenario or to an altogether different task. We report the number of words, the number of episodes, the total number of unique sentences and the average number of senses for the meta-training, meta-validation and meta-test sets for each of the four setups with different |S| in Table 3. Additionally, in Figure 5 and Figure 6, we present bar plots of the number of meta-test episodes for different number of senses in the meta-test support and query sets respectively. It shows that the number of episodes drops quite sharply as the number of senses increases. In each episode, only words with a maximum of |S| senses are considered so that all of them are accommodated in the support set.

A.2 Hyperparameters
We performed hyperparameter tuning for all the models under the |S| = 16 setting. We obtain the best hyperparameters on the basis of the average macro F1 score on the meta-validation set.
We trained the models with five seeds (42 -46) and recorded the mean of the metric from the five runs to identify the best hyperparameters. For |S| = 4, 8, 32, we chose the best hyperparameters obtained from this tuning. We employed early stopping with a patience of 2 epochs, i.e., we stop meta-training if the validation metric does not improve over 2 epochs. Tuning over all the hyperparameters of our models is prohibitively expensive. Hence, for some of the hyperparameters we chose a fixed value. The size of the shared linear layer is 64, 256 and 192 for the GloVe+GRU, ELMo+MLP and BERT models respectively. The shared linear layer's activation function is tanh for GloVe+GRU and ReLU for ELMo+MLP and BERT. For FOMAML, ProtoFO-MAML and ProtoMAML, the batch size is set to 16 tasks. For the BERT models, we perform learning rate warm-up for 100 steps followed by a constant rate. For GloVe+GRU and ELMo+MLP, we decay the learning rate by half every 500 steps. We also experimented with two types of regularization -dropout for the inner-loop updates and weight decay for the outer-loop updates -but both of them yielded a drop in performance.

A.3 Training times
We train all our models on TitanRTX GPUs. Our model architectures vary in the total number of trainable parameters. Thus, the time taken to train each of them varies. The number of meta-learned parameters θ is as follows: • GloVe+GRU: 889, 920 • ELMo+MLP: 262, 404 • BERT: 107, 867, 328 To give an idea of how long it takes to train them, we provide the approximate time taken for one epoch for the |S| = 16 setup in Table 6. The training time would be slightly lower for |S| = 4, 8 and slightly higher for |S| = 32. The training time for ProtoMAML with GloVe+GRU is extremely long (second-order derivatives for RNNs with the cuDNN backend is not supported in PyTorch and hence cuDNN had to be disabled).

A.4 Challenging cases
In Table 4, we present 10 words with the lowest macro F1 scores (in increasing order of the score) obtained from ProtoNet with GloVe+GRU. We perform the analysis on this model to investigate challenging cases without the contextualization advantage offered by ELMo and BERT. For |S| = 4, 8, 16, many words in the list have predominantly verb senses, showing that they are more challenging to disambiguate. The number of such cases drops in |S| = 32, indicating that disambiguation of verbs improves as |S| increases.

A.5 F1 score distribution
For ProtoNet with GloVe+GRU, we plot the distribution of macro F1 scores across the words in the meta-test set in Figure 7. The distribution is mostly right-skewed with very few words having scores in the range 0 to 0.2.

A.6 t-SNE visualizations
We provide t-SNE visualizations of the word representations generated by f θ of ProtoNet with GloVe+GRU for three words (with macro F1 score of 1) in the meta-test set in Figure 8. Even though it receives the same input embedding for all senses, it manages to separate the senses into clusters on the basis of the representations of the support examples. This occurs even though ProtoNet does not perform any fine-tuning step on the support set. Moreover, the query examples also seem to be part of the same cluster and lie close to the prototypes. ELMo embeddings, being contextual, already capture information in how the various senses are represented. In order to compare them against the representations generated by ProtoNet with ELMo+MLP, we again provide t-SNE visualizations. We plot the ELMo embeddings of three words in the meta-test test in Figure 9a, 9b and 9c. We also show the prototypes computed from these embeddings for illustration. For the same three words, we plot the representations obtained from f θ of ProtoNet with ELMo+MLP in Figure 9d, 9e and 9f. It can be observed that the ELMo embeddings alone are not well-clustered with respect to the senses. On the other hand, ProtoNet manages to separate the senses into clusters, which aids in making accurate predictions on the query set.
These visualizations further demonstrate Pro-toNet's success in disambiguating new words. From a learning to learn standpoint, the model has learned how to separate the senses in a highdimensional space so as to disambiguate them. Proto(FO)MAML often improves upon this good initialization during its inner-loop updates.

A.7 Results on the meta-validation set
To facilitate reproducibility, we provide the results on the meta-validation set for all the methods that involved hyperparameter tuning in Table 7.