Meta-Information Guided Meta-Learning for Few-Shot Relation Classification

Few-shot classification requires classifiers to adapt to new classes with only a few training instances. State-of-the-art meta-learning approaches such as MAML learn how to initialize and fast adapt parameters from limited instances, which have shown promising results in few-shot classification. However, existing meta-learning models solely rely on implicit instance-based statistics, and thus suffer from instance unreliability and weak interpretability. To solve this problem, we propose a novel meta-information guided meta-learning (MIML) framework, where semantic concepts of classes provide strong guidance for meta-learning in both initialization and adaptation. In effect, our model can establish connections between instance-based information and semantic-based information, which enables more effective initialization and faster adaptation. Comprehensive experimental results on few-shot relation classification demonstrate the effectiveness of the proposed framework. Notably, MIML achieves comparable or superior performance to humans with only one shot on FewRel evaluation.


Introduction
Conventional machine learning algorithms, especially neural methods, require an adequate amount of data to learn model parameters. To alleviate the heavy reliance on annotated data, few-shot learning, which aims at adapting to new tasks with only a few training examples, has drawn more and more attention. Few-shot classification is a typical few-shot learning task, which samples several new classes with a handful of training examples (i.e., support instances) and query instances, and requires models to classify these queries into given classes (Lake et al., 2011;Vinyals et al., 2016).
To grasp the patterns of new classes with limited examples, meta-learning was proposed. Inspired by human behaviors, meta-learning models focus on learning to learn: they learn how to better initialize parameters and fast adapt classification models from given instances. For example, MAML (Finn et al., 2017) finds the best initialization point of parameters, where it can take minimal efforts to reach the optimal points for each class. To this end, MAML adapts towards each class by gradient steps using support instances, and uses the loss of the adapted model on the query instances to optimize the initialization parameters.
However, meta-learning still has three challenges: (1) Most meta-learning methods learn how to learn (i.e., how to initialize and adapt) solely relying on instance statistics, which inevitably suffer from data sparsity and noise in low-resource scenarios, especially in text domain. (2) The approach of learning to learn, like the learning process itself, is a black-box and thus lacks interpretability. (3) Most conventional meta-learning methods are designed for few-shot classification, and cannot well handle zero-shot scenarios, where no support instances are available. In contrast, humans usually learn novel concepts with high-level descriptive definitions, instead of solely learning from several unsystematic instances. For Figure 1: Diagram of meta-learning models. (a) MAML learns a class-agnostic representation θ 0 that can fast adapt to new classes. (b) MIML learns meta-parameters Φ to fast initialize class-aware parameter θ 0 i , and to quickly adapt to new classes using informative instances, where both phases are guided by meta-information. Informative instances and noisy instances are marked accordingly. example, when learning a new relation art director, humans usually first get a rough estimation of the concept by its name and definition, and then reach a more precise understanding by concrete instances.
Inspired by the learning process of humans, we propose a novel Meta-Information guided Meta-Learning (MIML) framework, as shown in Figure 1. The meta-information derives from the semantic concepts of classes, and could provide strong guidance for both parameter initialization and fast adaptation in meta-learning. Specifically, MIML integrates meta-information in two essential components, namely the meta-information guided fast initialization and fast adaptation. (1) In meta-information guided fast initialization, instead of using a static class-agnostic initialization point for all classes as in MAML, MIML uses meta-information to estimate dynamic class-aware initialization parameters for each class. This alleviates the reliance on support instances to reach optimal adapted parameters. (2) In meta-information guided fast adaptation, MIML adapts the class-aware initialization parameters with gradient steps according to the support instances, where informative support instances are selected to contribute more to the adaptation gradients with a novel meta-information based attention mechanism. By integrating high-level meta-information and concrete instances, MIML achieves superior performance on low-resource tasks. Moreover, MIML also provides better interpretability in meta-learning process.
Note that we are not the first attempt to use meta-information for low-resource classification tasks: In zero-shot learning, where there are no training examples for new classes at all, class names are used to produce semantic representations for classification (Socher et al., 2013;Frome et al., 2013;Norouzi et al., 2014). In few-shot scenarios, however, supporting examples can bring more direct supervision. In this paper, we argue that both signals are crucial to the learning process, and combining them could achieve the best results.
In experiments, the significant improvements on few-shot relation classification tasks demonstrate the effectiveness and robustness of MIML in low-resource relation classification. We show the advantage of MIML in handling noisy instances, and its potential in zero-shot classification. We also conduct comprehensive ablation study and visualization to better understand our model. In summary, our main contributions are twofold: (1) We propose a principled meta-information guided meta-learning framework for few-shot classification. To the best of our knowledge, we are the first to introduce meta-information to meta-learning for few-shot relation classification. (2) We conduct comprehensive experiments to demonstrate the effectiveness of MIML. Notably, MIML achieves human-level performance with only one shot on FewRel evaluation. We also show the robustness and interpretability of MIML, as well as its potential in zero-shot classification through experiments.

Preliminary
In few-shot classification, we aim to learn a model that can handle the classification task with only a few available training instances. Specifically, given a set of classes C from the class distribution p(C), the Algorithm 1 Meta-Information Guided Meta-Learning Require: p(C): distribution over classes Require: β: meta learning rate 1: randomly initialize: Φ = {φ e , φ n , φ a }: meta-parameters 2: while not done do 3: Sample batch of classes C i ∼ p(C) 4: Sample support instance set S and query instance set Q 5: for all C i do 6: Fast initialize parameters of C i : θ 0 i = Ψ(c i ; φ n ) 7: for t = 1, . . . , T do 8: Compute gradients and learning rates for fast adaptation using support instance set S 9: Compute adapted parameters with gradient descent: {φe,φn} , x j , y j ) 10: Meta-optimize using query instance set Q: model is required to first learn classifiers on the support set S, and then handle the classification task on the query set Q, where S and Q consist of instances {x j , y j } m j=1 from same classes, and x j is an instance of class y j . Few-shot classification is usually formalized in an N way K shot setting, where C contains N different classes, and S contains K instances for each of the N classes.
Our work is inspired by MAML (Finn et al., 2017), an effective meta-learning approach to the few-shot classification problem. MAML contains two key phases: initialization and fast adaptation. Initialization aims to learn a globally shared initialization point of parameters for different classes, such that a few gradient steps of fast adaptation on the initialization parameters can produce good results on new classes. We refer readers to the paper (Finn et al., 2017) for more details about MAML.

Methodology
In this section, we introduce our meta-information guided meta-learning (MIML) framework. Despite the effectiveness of MAML, we observe that two assumptions in MAML limit the model capacity: (1) In initialization, MAML assumes that the parameters of different classes can be derived from single initialization parameters from a few gradient steps. However, single initialization parameters cannot well capture the shared knowledge in different classes, especially when the number of classes is large, making it difficult to adapt the initialized parameters with a few gradient steps to reach reasonable performance.
(2) In fast adaptation, MAML assumes that different instances in support set are equally important, and thus share the same learning rate for parameter adaptation. However, instances in text are usually diverse and noisy in practice, and noisy instances can dominate the model parameters in fast adaptation to produce inferior results (Koh and Liang, 2017).
To address the aforementioned problems, we propose MIML to integrate meta-information into metalearning, and provide strong guidance in both initialization and adaptation phases. The intuition behind MIML is that human learn new concepts from both high-level meta-information and concrete instances. Specifically, MIML consists of four components: Instance Encoder. Given a sentence and the corresponding entity pair, we employ deep neural networks (with meta-parameters φ e ) to construct the representation of the relation between the entity pair. Meta-Information Guided Fast Initialization. In fast initialization phase, MIML dynamically initializes the parameters for each class based on meta-information (with meta-parameters φ n ), which can be viewed as a rough but flexible estimation of class parameters from high-level semantics. Meta-Information Guided Fast Adaptation. In fast adaptation phase, MIML adapts the initialized parameters according to the performance on the support set, and selects informative support instances to contribute more to the adaptation gradients (with meta-parameters φ a ), which can be viewed as accurate fine-tuning from concrete instances. Meta-Optimization. In meta-optimization phase, the meta-parameters Φ = {φ e , φ n , φ a } are optimized based on the performance of the adapted model on the query set. The framework is shown in Algorithm 1.

Instance Encoder
Given a sentence and the corresponding target entity pair (i.e., head entity and tail entity), we employ BERT model (Devlin et al., 2019) to encode the instance into contextualized representations, due to its effectiveness on a broad variety of NLP tasks. Specifically, sentences are first tokenized into word pieces (Wu et al., 2016). Inspired by Soares et al. (2019), to mark the positions of entities, we adopt four special tokens as entity markers, and insert them to the start and end of each entity. We select the representations of the start tokens of the head entity and tail entity on the top layer, and concatenate them to obtain the instance representation. The instance encoder can be formulated as follows: where x j is the sentence, h and t are head and tail entities respectively. g(·) is the encoder, φ e is the parameters of the encoder, and x j ∈ R ds is the instance representation.

Meta-Information Guided Fast Initialization
Given a set of classes {C 1 , C 2 , . . . , C N } sampled from class distribution p(C), MAML learns classagnostic initialization that can adapt to new classes via a few gradient steps. In comparison, we utilize meta-information for class-aware initialization in a generative manner via a meta-initializer module.
The meta-initializer module captures meta-knowledge shared in different classes, and generates the class-aware parameters via semantic knowledge in meta-information. We initialize the parameters of each class with meta-information derived from its semantic concepts. In this work, without losing generality we utilize class names as our meta-information, i.e., relation names such as founder of and birth place. Note that it is also convenient to generate class parameters with other meta-information such as textual descriptions and hierarchical ontology. Specifically, given the name of a class C i , we obtain the meta-information representation c i ∈ R dw by the average of the word embeddings of the name. Then the parameter of the class is initialized via the meta-initializer module as follows: where θ 0 i ∈ R ds is the class-aware initialization parameters for class C i , Ψ(·) is the meta-initializer, φ n is the corresponding meta-parameters. In our experiments, Ψ(·) is implemented via a fully connected layer. Intuitively, the meta-initializer mimics the learning process of human, where we usually first get a rough but flexible estimation of a new concept based on its high-level semantics. The initialized parameter θ 0 i can be used to measure the classification score of an instance: where s i,j is the score of x j being an instance of C i . The probability p(y = C i |x j ) is obtained by normalizing the score s i,j with a softmax layer over all classes {C 1 , C 2 , . . . , C N }. The model after fast initialization can be denoted as f θ 0 ,{φe,φn} , where θ 0 = {θ 0 1 , θ 0 2 , . . . , θ 0 N } denotes initialized parameters.

Meta-Information Guided Fast Adaptation
In fast adaptation, like human learners, MIML fine-tunes the estimation of a new concept by concrete instances. Specifically, the initialized parameters θ 0 are adapted via gradient descent steps, according to the classification performance of instances on the support set S. The adaptation iterates dynamically for T steps. At each time step t, the parameters θ t are adapted as follows: where L denotes cross-entropy loss of a support instance (x j , y j ) computed by the model f θ t ,{φe,φn} , and α i,j is the learning rate of θ i on the support instance (x j , y j ). The parameters after T steps of adaptation are denoted as θ T . With a static learning rate for all instances, noisy instances can dominate the model parameters in fast adaptation (Koh and Liang, 2017), which leads to inferior performance. To select informative instances for fast adaptation in MIML, instead of using a static learning rate for all instances, the learning rate of each instance is dynamically determined by a selective attention mechanism as follows: where e i,j is the score of instance x j for class C i . Intuitively, e i,j should be large if x j is an informative instance of class C i , and thus x j should contribute more to the adaptation of θ i , i.e., the learning rate should be larger. The score is obtained by: where q i ∈ R ds is the query vector for class C i . We note that similar to classifier parameters, learning query vectors is also faced with data sparsity in few-shot classification, since only a few training instances are available for each class. Thus we estimate the query vector from meta-information via a meta-querier module as follows: where Ψ(·) is implemented via a fully connected layer with meta-parameters φ a .
In our experiments, we observe that the estimation of class-aware parameters (i.e., initialization parameter θ 0 i and query vector q i ) are prone to over-fitting, due to the limited number of classes, e.g., less than 100 in most datasets. This limits the diversity of inputs to the meta-initializer and meta-querier, which leads to complex hyper-planes in meta-information space, and hurts the generalization ability.
We address the problem by (1) regularizing class-aware parameters by L2 normalization, and (2) penalizing sharp changes in the meta-information space via virtual adversarial training (Miyato et al., 2017). Specifically, we normalize class-aware parameters to be of unit length in L2 norm. For virtual adversarial training, we add worst-case perturbations on the meta-information c i , such that the classification results on the query set reach the maximum changes. We measure the changes of classification results by Kullback-Leibler divergence, and penalize the changes to encourage a smooth meta-information space.

Meta-Optimization
After fast adaptation on support instances, the meta-parameters Φ = {φ e , φ n , φ a } are optimized according to the performance of the adapted model on the query set Q as follows: where β is the learning rate for meta-parameters. In this way, MIML learns meta-parameters that can effectively customize initialization parameters for each class, and select informative support instances for fast adaptation, so as to produce good classification results on the query set.

Implementation Details
All hyper-parameters are selected by grid-search on the development set. The class distribution p(C) is implemented by uniform distribution. We adopt Adam (Kingma and Ba, 2015) to optimize metaparameters. The meta learning rate β is 1 for meta-initializer and meta-querier, and 5e-5 for instance encoder. We employ 50 dimensional GloVe (Pennington et al., 2014) for word embeddings and BERT BASE (Devlin et al., 2019) implemented by Wolf et al. (2019) as the instance encoder. The hidden state dimensions d s and d w are 1, 536 and 50 respectively. The number of adaptation steps T is 150.
In virtual adversarial training, we first randomly generate a perturbation vector δ 1 for meta-information Encoder Model 5-way-1-shot 5-way-5-shot 10-way-1-shot 10-way- representation c i . Then the perturbation vector δ 1 is scaled such that its L2 norm is 1e-3. We add δ 1 to c i , and compute the worst-case perturbation δ 2 based on the gradient. Finally δ 2 is scaled to 1e-3 in L2 norm, and added to c i to obtain the perturbed representation.

Experiments
In this section, we empirically evaluate MIML on few-shot relation classification. To evaluate the robustness of MIML, we conduct experiments in the presence of noisy instances. We also show the potential of MIML in zero-shot classification. Ablation study and visualization are conducted to better understand the inner mechanism of MIML.

Experiment Settings
We first introduce the experiment settings, including datasets, evaluation protocol and baselines.
Dataset. We evaluate MIML on FewRel (Han et al., 2018), a widely-used few-shot relation classification dataset. FewRel contains 70, 000 labeled sentences in 100 relations (i.e., each relation has 700 sentences). The relation annotations are first generated under distant supervision assumption (Mintz et al., 2009) by aligning Wikipedia and Wikidata (Vrandečić and Krötzsch, 2014), and then labeled by human annotators. The training set contains 44, 800 sentences in 64 relations, the valid set has 11, 200 sentences in 16 relations, and the test set has the rest 14, 000 sentences in 20 relations.
Evaluation Protocol. Following the same settings in Han et al. (2018), we consider four types of fewshot settings in evaluation, namely 5-way-1-shot, 5-way-5-shot, 10-way-1-shot and 10-way-5-shot. The N -way-K-shot setting indicates that each evaluation batch has N classes that do not appear in training set and each class has K support instances. Smaller shots or more ways imply more challenging settings. We adopt the classification accuracy of query instances as the evaluation metric.
Baseline. We compare MIML with strong baseline methods for few-shot classification. Meta Network (Munkhdalai and Yu, 2017) and SNAIL (Mishra et al., 2018) are classical meta-learning models that learn to fast adapt to new classes. GNN (Garcia and Estrach, 2018) performs message passing over instance graphs. Prototypical Network (Snell et al., 2017) constructs the prototypes of new classes by averaging their instance representations. MLMAN (Ye and Ling, 2019) obtains prototypes by a multilevel matching and aggregation network. We directly report the accuracies of these models (with CNN encoders), and human performance from the FewRel leaderboard. 1 We also compare with strong baselines with BERT (Devlin et al., 2019) encoders. BERT-PAIR (Gao et al., 2019b) measures the similarity of an instance pair using BERT. In addition, we also implement the enhanced Prototypical Network and MAML (Finn et al., 2017) with BERT encoder for fair comparisons.  Table 2: Accuracies (%) on few-shot relation classification with noise on FewRel development set.

Main Results
We report the main results in Table 1, from which we have the following observations: (1) MIML consistently outperforms all baseline methods in four settings. Notably, MIML achieves comparable or superior performance to humans with only one shot. To the best of our knowledge, we are the first to achieve human-level performance with only one shot on FewRel without tailored pretraining for RE. The results demonstrate that MIML can effectively leverage high-level meta-information to provide strong guidance for meta-learning.
(2) The advantages of MIML are more significant in more challenging settings, i.e., with fewer shots or more ways. For example, MIML achieves 8.5 absolute accuracy improvement compared to MAML in 10-way-1-shot setting. This is because that, in comparison to static class-agnostic initialization in MAML, meta-information guided fast initialization in MIML can produce more flexible class-aware initialization, which alleviates heavy reliance on support instances. In Section 4.3, we further show the advantage of MIML when multiple shots are available in the presence of noise.

Robustness to Noisy Instances
Instances in real-world few-shot text classification tasks can be diverse and noisy, especially when multiple support instances are available. Previous works have shown that noisy instances can dominate the model parameters (Koh and Liang, 2017), especially for meta-learning methods where adaptation is based on gradients from instances, e.g., MAML, due to the substantially higher loss of noisy instances. To demonstrate the robustness of MIML in the presence of noise, we randomly corrupt 0%, 10%, 20%, 30% support instances, by replacing them with noisy instances randomly sampled from different relations in FewRel. In addition to Prototypical Network and MAML, we also compare MIML with hybrid attention-based prototypical networks (Gao et al., 2019a) (Proto-HATT), which uses hybrid attention to denoise for Prototypical Network. The results are shown in Table 2, from which we observe that: (1) The performance of MAML degrades significantly when the noise rate increases, since its fast adaptation process can be dominated by noisy instances. Prototypical Network constructs the prototype with the average of all instances, and shows smaller drops in performance. The results show the disadvantage of gradient-based meta-learning models in dealing with noisy instances.
(2) MIML consistently outperforms baseline methods in different noise rates. Specifically, MIML exhibits smaller drops in performance as compared to MAML and Prototypical Network. The results show that meta-information guided fast adaptation can effectively select informative instances, which helps MIML overcome the inherent disadvantage of gradient-based meta-learning models, and achieve more robust fast adaptation in the presence of noise.

Zero-Shot Classification
In this section, we show the potential of MIML in zero-shot classification. Specifically, we remove the support instances in evaluation phase in 5-way and 10-way setting, and ask the model to classify query instances with class-aware initialization parameters. We compare MIML with strong zero-shot classification baselines. DeViSE (Frome et al., 2013) utilizes word embeddings of class names to classify Setting Random DeViSE SK4 MIML 5-way-0-shot 20.00 55.90 ± 0.09 79.68 ± 0.12 79.54 ± 0.06 10-way-0-shot 10.00 42.29 ± 0.08 66.17 ± 0.11 61.14 ± 0.10 Table 3: Experimental results of zero-shot classification on FewRel development set.
instances from unseen classes, and we implement the DeViSE model with BERT encoder. SK4 (Zhang et al., 2019) incorporates rich semantic knowledge of classes, including word embeddings, class descriptions, class hierarchy, and commonsense knowledge graphs. We report the results in Table 3, from which we observe that: Compared to models tailored for zero-shot classification problem, MIML achieves reasonable performance. This is because that the class-aware fast initialization parameters in MIML are guided by meta-information, and thus can potentially be used to severe as classifiers without further adaptation using support instances. In summary, the results show that MIML can effectively integrate high-level meta-information and concrete instances for low-resource classification tasks, including few-shot and zero-shot classification tasks.

Ablation Study
To investigate the contribution of different components in MIML, we conduct ablation study in 10-way-5-shot setting, by removing each component, including meta-information guided fast initialization (MI) and adaptation (MA), class-aware parameter normalization (NM) and virtual adversarial training (VAT). Table 5 shows the results of ablation study.
We can observe that all components contribute to the performance of MIML. The performance drops most significantly when removing class-aware parameter normalization. This is because that estimating high-dimensional parameters in a generative manner is prone to over-fitting and also faced with high variance, which can be effectively regularized by class-aware parameter normalization. Meta-information guided fast initialization also contributes significantly to the performance, indicating the importance of class-aware initialization to meta-learning models.

Visualization
In addition to the improvements in performance, the meta-information guided meta-learning process in MIML can also provide better interpretability in few-shot classification problems. To give a more intuitive picture and show the interpretability of MIML, we visualize the workflow of MIML in the presence of 20% noise in 5-way-5-shot setting, and compare it with MAML. Specifically, we visualize the initialization representations and adaptation steps using principal component analysis (Jackson, 2005). From Figure 2, we have the following observations: (1) In comparison to MAML, the initialization parameters in MIML reflect the semantic similarity between classes. For example, the initialization point of relation sport is close to member of, and far from child. This is achieved by the semantic guidance from high-level meta-information.
(2) The fast adaptation of MAML is highly influenced by noisy instances, and exhibits high variance in adaptation trajectories. In comparison, noisy instances in MAML are assigned with smaller learning rates by the proposed attention mechanism (not shown in figure), and thus produce smaller noisy gradient steps, which results in more stable adaptation trajectories.  Figure 2: Visualization of initialization and adaptation process of meta-learning models, in 5-way-5-shot setting with 20% noise. At each iteration, the adaptation gradients for a class parameter θ i come from three parts: informative instances from class C i (marked in green arrows), noisy instance for class C i (marked in red arrows), and instances for other classes (marked in blue arrows). 1 Best viewed in color.

Related Work
Few-Shot Learning. Few-shot learning aims to grasp new tasks with only a handful of training data. There are mainly two lines of approaches for few-shot learning: (1) Metric-Learning methods learn an embedding space that can well measure the similarities between instances. Koch et al. (2015;Vinyals et al. (2016) use vector distance functions to measure the similarities of examples, while Sung et al. (2018;Garcia and Estrach (2018) use neural networks to learn the metrics. Besides, Snell et al. (2017) propose to calculate prototypes of each few-shot class for classification. Specifically targeting few-shot relation classification, Gao et al. (2019a) introduce a hybrid attention mechanism to alleviate noise data problems. Ye and Ling (2019;Soares et al. (2019;Gao et al. (2019b;Sui et al. (2020) utilize local feature comparison to further improve few-shot performance.
(2) Meta-Learning models, on the other hand, transfer the experience about how to "learn" a new class from the training set to the test domain. One way of meta-learning is to use recurrent networks to grasp the meta knowledge and predict the updated parameters in a black-box manner (Ravi and Larochelle, 2017;Munkhdalai and Yu, 2017;Mishra et al., 2018). Another direction is to learn how to better initialize parameters for new classes (Finn et al., 2017;Finn et al., 2018) or apply faster adaptation (Bertinetto et al., 2018;Zintgraf et al., 2019;Rajeswaran et al., 2019) through meta-training. Our work is mainly based on MAML (Finn et al., 2017), about which we have given a brief introduction in Section 1. Many efforts have been devoted to improving MAML. In addition to initialization parameters, Li et al. (2017) propose to also meta-learn adaptation learning rate from implicit instance statistics. Rusu et al. (2018) learns a data-dependent representation of model parameters for initialization, and performs gradient-based metalearning in the low-dimensional space. Yao et al. (2019) clusters relevant tasks and initialize the tasks within the same cluster with the same parameters. In comparison, MIML integrates meta-information into meta-learning, which provides strong guidance in both initialization and adaptation.
Zero-Shot Learning. Zero-shot learning focuses on grasping new tasks with no training data, which usually takes meta-information, such as names or descriptions to learn new classes. There are many efforts for zero-shot learning in the cross-modal scenario, where class names serve as meta-information for images (Socher et al., 2013;Frome et al., 2013;Norouzi et al., 2014). The general idea of these approaches is to align the semantic spaces of images and their names.
Existing meta-learning approaches provide an efficient framework for transfer learning and fast adaptation, while zero-shot models prove the effectiveness of meta-information. To the best of our knowledge, MIML is the first attempt to combine meta-information with meta-learning for few-shot classification.

Conclusion and Future Work
In this work, we propose a meta-information guided meta-learning framework (MIML) for few-shot relation classification. We conduct comprehensive experiments and achieve human-level performance in few-shot relation classification with only one shot. In addition, we show the advantage and interpretability of MIML in handling noisy instances, and its potential in zero-shot classification.
We plan to explore the following directions as our future work: (1) We will explore more metainformation for meta-learning, such as class descriptions and knowledge graphs. (2) We will develop more sophisticated models to capture the fine-grained interactions between the high-level metainformation and concrete instances, to better guide meta-learning for few-shot classification problem.

A Few-Shot Text Classification
In addition to relation classification, MIML can also potentially be applied to other few-shot classification tasks. We perform experiments on a text classification dataset RCV1 (Lewis et al., 2004), which contains Reuters newswire articles under different topics. Following Bao et al. (2019), we use a subset of RCV1 with 740 articles in 37 topics for training and 680 articles in 34 topics for validation. We compare with baselines from Bao et al. (2019), where models are treated as a combination of text representation and learning algorithm: (1) Text Representations. AVG calculates the average embeddings of the words as representations. IDF weights the word embeddings by inverse frequency. CNN represents text by the outputs after a onedimensional convolution layer and a max-pooling layer. DS (Distributed Signature) uses attention scores learned by a meta-learning framework to weight word embeddings (Bao et al., 2019). In implementing BERT encoder, we obtain the input by the concatenation of the first 60 and the last 40 tokens of the article for better efficiency.
(2) Learning Algorithms. In addition to PROTO and MAML, we compare with another three learning algorithms. NN finds the nearest neighbor of Euclidean distance. FT first pre-trains a classifier using all training examples, and then fine-tunes on the support set . DS-ML estimates the attention score over word embeddings via a meta-learning framework (Bao et al., 2019).
The results are shown in Table 5, from which we observe that MIML achieves competitive performance on few-shot text classification, demonstrating its effectiveness. We leave exploring the potential of MIML in other few-shot classification tasks as future work.