Knowledge Guided Metric Learning for Few-Shot Text Classification

Humans can distinguish new categories very efficiently with few examples, largely due to the fact that human beings can leverage knowledge obtained from relevant tasks. However, deep learning based text classification model tends to struggle to achieve satisfactory performance when labeled data are scarce. Inspired by human intelligence, we propose to introduce external knowledge into few-shot learning to imitate human knowledge. A novel parameter generator network is investigated to this end, which is able to use the external knowledge to generate different metrics for different tasks. Armed with this network, similar tasks can use similar metrics while different tasks use different metrics. Through experiments, we demonstrate that our method outperforms the SoTA few-shot text classification models.


Introduction
Humans are adept at quickly learning from a small number of examples. This motivates research of few-shot learning (Vinyals et al., 2016;Finn et al., 2017), which aims to recognize novel categories from very few labeled examples.
The key challenge in few-shot learning is to make full use of the limited labeled examples to find the "right" generalizations. Metric-based approaches (Vinyals et al., 2016;Snell et al., 2017;Sung et al., 2018;Zhang et al., 2020) are effective ways to address this challenge, which learn to represent examples in an appropriate feature space and use a distance metric to predict labels. However, directly employing metric-based approaches in text classification faces a problem that tasks are diverse and significantly different from each other, since words that are highly informative for one task may not be relevant for other tasks (Bao et al., 2019). Therefore, a single metric is insufficient to cope with all these tasks in few-shot text classification (Yu et al., 2018).
To adapt metric learning to significantly diverse tasks, we propose a knowledge guided metric learning method. This method is inspired by the fact that human beings approach diverse tasks armed with knowledge obtained from relevant tasks (Lake et al., 2017). We use external knowledge from the knowledge base (KB) to imitate human knowledge, whereas the role of external knowledge has been ignored in previous methods (Yu et al., 2018;Bao et al., 2019;Geng et al., 2019Geng et al., , 2020. In detail, we resort to distributed representations of the KB instead of symbolic facts, since symbolic facts face the issues of poor generalization and data sparsity. Based on such KB embeddings, we investigate a novel parameter generator network (Ha et al., 2016;Jia et al., 2016) to generate task-relevant relation network parameters. With these generated parameters, the task-relevant relation network is able to apply diverse metrics to diverse tasks and ensure that similar tasks use similar metrics while different tasks use different metrics.
In summary, the major contributions of this paper are: • Inspire by human intelligence, we present the first approach that introduces external knowledge into few-shot learning.
• A novel parameter generator network based on external knowledge is proposed to generate diverse metrics for diverse tasks.
• Experimental results on the public dataset show that our model significantly outperforms previous methods.  Figure 1: The main architecture for a C-way N -shot (C = 3, N = 2) problem with one query example.

Problem Setting
both the training and test stages, the labeled examples are called the support set, which serves as a meta-training set and the meta-testing examples are called the query set. If the support set contains N labeled examples for each of C unique classes, the few-shot problem is called C-way N -shot. To guarantee a good generalization performance at test time, the training and evaluation of the model are accomplished by episodically sampling the support set and the query set (Vinyals et al., 2016). More concretely, in each meta-training iteration, an episode is formed by randomly selecting C classes from the training set with N labeled examples for each of the C classes to serve as the support set , as well as a fraction of the remainder of those C classes' examples to act as the query set , where x i and y i ∈ {1, ..., C} are the sentence and its label, and m is the number of query samples. The model is trained on the support set S to minimize the loss of its predictions over the query set Q. This training procedure is iteratively carried out episode by episode until convergence.

Sentence Embedding Network
In this network, a pre-trained BERT (Devlin et al., 2019) encoder is used to model sentences. Given an input text x i = ([CLS], w 1 , w 2 , ..., w T , [SEP]) as input, the output of BERT encoder is denoted as H(x i ) ∈ R (T +2)×d 1 , where d 1 is the output dimension of the BERT encoder. We use the first token of the sequence (classification token) as the sentence representation, which is denote as h(x i ).
In meta-learning, the representation of each class is the mean vector of the embedded sentences belonging to its class, where S z denotes the set of sentences labeled with class z. Following Sung et al. (2018), we use concatenation operator to combine the query representation h(x j ) with the class representation c z .

Knowledge Guided Relation Network
This module takes combined representation (shown in Equation 2) and the knowledge of the support set as input, and produces a scalar in range of 0 to 1 representing the similarity between the query sentence and the class representation, which is called relation score. Compared with the original relation network (Sung et al., 2018), we decompose the relation network into two parts, task-agnostic relation network and task-relevant relation network, in order to serve two purposes. Task agnostic relation network models a basic metric function, while taskrelevant relation network adapts to diverse tasks.
Task-Agnostic Relation Network The taskagnostic relation network uses a learned unified metric for all tasks, which is the same with the original relation network (Sung et al., 2018). With this unified metric, C task-agnostic relation scores r agn z,j are generated for modeling the relation between one query input x j and the class representation c z , r agn z,j = RN agn (p z,j |θ agn ) ∈ R, z = 1, 2, ..., C (3) where RN agn denotes task-agnostic relation network and θ agn are learnable parameters.
Task-Relevant Relation Network The taskrelevant relation network is able to apply diverse metrics for diverse tasks armed with external knowledge. In detail, for each support set S (S contains C × N labeled sentences), we retrieve a set of potentially relevant KB concepts K(S), where each concept k i is associated with KB embedding e i ∈ R d 2 . (we will describe these processes in the following section). We average over these KB embeddings element by element to form the knowledge representation of this support set.
Then we use this knowledge representation to generate task-relevant relation network parameters, where M ∈ R d 3 ×d 2 are learnable parameters and d 3 denotes the number of parameters of the taskrelevant relation network. With these generated parameters, we use the task-relevant network to generate C task-relevant relation scores r rel z,j for the relation between one query input x j and the class representation c z , r rel z,j = RN rel (p z,j |θ rel ) ∈ R, z = 1, 2, ..., C where RN rel denotes task-relevant relation network. Finally, relation score is defined as: where a sigmoid function is used to keep the score in a reasonable range. Following Sung et al. (2018), the network architecture of relation networks is two full-connected layers and mean square error (MSE) loss is used to train the model. The relation score is regressed to the ground truth: the matched pairs have similarity 1 and the mismatched pairs have similarity 0.

Knowledge Embedding and Retrieval
We use NELL (Carlson et al., 2010) as the KB, stored as (subject, relation, object) triples, where each triple is a fact indicating a specific relation between subject and object, e.g., (Intel, competes with, Nvidia).
Knowledge Embedding Since symbolic facts suffer from poor generalization and data sparsity, we resort to distributed a representation of triples. In detail, given any triple (s, r, o), vector embeddings of the subject s, the relation r and the object o are learned jointly such that the validity of the triple can be measured in the real number space.
We adopt the BILINEAR model (Yang et al., 2015) to measure the validity of triples: where s, r, o ∈ R d 2 are the embeddings associated with s, r, o, respectively, and diag(r) is a diagonal matrix with the main diagonal given by the relation embedding r. To learn these vector embeddings, a margin-based ranking loss is designed, where triples in the KB are adopted to be positive and negative triples are constructed by corrupting either subjects or objects.
Knowledge Retrieval Inspired by the previous studies (Yang and Mitchell, 2017;, exact string matching (Charras and Lecroq, 2004) is used to recognize entity mentions from a given passage and link recognized entity mentions to subjects in KB. Then, we collect the corresponding objects (concepts) as candidates. After this retrieval process, we obtain a set of potentially relevant KB concepts, where each KB concept is associated with a KB embedding.

Dataset
Our model is evaluated on the widely used ARSC (Blitzer et al., 2007) dataset, which comprises reviews for 23 types of products on Amazon. For each product domain, there are three different binary classification tasks. These buckets form 69 tasks in total. Following Yu et al. (2018), we select 12 tasks from four domains (Books, DVDs, Electronics, and Kitchen) as testing set, with only 5 examples as support set for each class.

Implementation Details
In our experiments, we use hugginface's implementation 1 of BERT (base version) and initialize parameters of the BERT encoding layer with pre-trained models officially released by Google 2 . To represent knowledge in NELL (Carlson et al., 2010), BILIN-EAR model (Yang et al., 2015) is implemented with the open-source framework OpenKE (Han et al., 2018) to obtain the embedding of entities and relations. The size of embeddings of entities and relations is set to 100. To train our model, We use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.00001. All experiments are run with an NVIDIA GeForce RTX 2080 Ti.

Experiment Results
Baseline. We compare our method to the following baselines: (1) Match Network is a metricbased attention method for few-shot learning; (2) Prototypical Network is a metric-based method that uses sample averages as class prototypes; (3) MAML is an optimization-based method through learning to learn with gradients; (4) Relation Network is a metric-based method that leverages two full-connected layers as the distance metric and sums up sample vectors in the support set as class vectors; (5) Graph Network is a graphbased model that implements a task-driven message passing algorithm on the sample-wise level; (6) ROBUSTTC-FSL is an approach that combines adaptive metric methods by clustering the tasks; (7) Induction Network is a metric-based method by using dynamic routing to learn class-wise representations.

Model Mean Acc
Matching Network (Vinyals et al., 2016) 65.73 Prototypical Network (Snell et al., 2017) 68.15 MAML (Finn et al., 2017) 78.33 Graph Network (Garcia and Bruna, 2017) 82.61 Relation Network (Sung et al., 2018) 83.07 ROBUSTTC-FSL (Yu et al., 2018) 83.12 Induction Network (Geng et al., 2019) 85.63 Ours 87.93 Analysis. Experiment results on ARSC are presented in Table 1. We observe that our method 1 https://huggingface.co/transformers 2 https://github.com/google-research/bert achieves the best results amongst all meta-learning models. Both Induction Network and Relation Network use a single metric to measure the similarity. Compared with these methods, we attribute the improvements of our model to the fact that our model can adapt to diverse tasks with diverse metrics. Compared with ROBUSTTC-FSL, our model leverages knowledge to get implicit task clusters and is trained in an end-to-end manner, which can mitigate error propagation.

Effectiveness of Introducing Knowledge
To analyze the contributions and effects of external knowledge in our approach, we perform some ablation and replacement studies, which are shown in Table 2. Ablation means that we delete the taskrelevant relation network and the model is reduced to the original BERT-based relation network. We observe that ablation degrades performance. To exclude the factor of reduction in the number of parameters, we conduct a replacement experiment, in which we replace the task-relevant relation network with a task-agnostic relation network. We find that increasing the number of parameters can slightly improve performance, but there is still a big gap between our model. Therefore, we conclude that the effectiveness of our model is credited with introducing external knowledge rather than increasing the number of model parameters.

Different Strategies of Introducing Knowledge
To analyze different strategies of introducing knowledge in few-shot learning, we remove the task-relevant relation network, and replace the BERT encoder in our method with KT-NET encoder  and K-BERT encdoer (Liu et al., 2019). In the KT-NET encoder, an attention mechanism is used to adaptively fuse selected knowledge with BERT. In the K-BERT encoder, a knowledge-rich sentence tree is the input of the model. These methods both introduce knowledge at the representation level 3 , while our method injects knowledge at the task level. The result is shown in Table 3. Combined Table 2 and Table 3, we find that (1) introducing knowledge can improve the performance of few-shot text classification; (2) it is more effective to introduce knowledge at the task level rather than at the representation level.

Conclusion
Inspired by human intelligence, we introduce external knowledge into few-shot learning. A parameter generator network is investigated to this end, which can use external knowledge to generate relation network parameters. With these parameters, the relation network can handle diverse tasks with diverse metric. Through various experiments, we demonstrate the effectiveness of our model.