FewRel 2.0: Towards More Challenging Few-Shot Relation Classification

We present FewRel 2.0, a more challenging task to investigate two aspects of few-shot relation classification models: (1) Can they adapt to a new domain with only a handful of instances? (2) Can they detect none-of-the-above (NOTA) relations? To construct FewRel 2.0, we build upon the FewRel dataset by adding a new test set in a quite different domain, and a NOTA relation choice. With the new dataset and extensive experimental analysis, we found (1) that the state-of-the-art few-shot relation classification models struggle on these two aspects, and (2) that the commonly-used techniques for domain adaptation and NOTA detection still cannot handle the two challenges well. Our research calls for more attention and further efforts to these two real-world issues. All details and resources about the dataset and baselines are released at https://github.com/thunlp/fewrel.


Introduction
Few-shot learning, which requires models to handle new classification tasks with only a handful of training instances, has drawn much attention in recent years (Ravi and Larochelle, 2017;Vinyals et al., 2016;Munkhdalai and Yu, 2017;Snell et al., 2017). To advance this field in NLP,  propose FewRel, a large-scale dataset to explore few-shot learning in relation classification. Many efforts (Gao et al., 2019;Soares et al., 2019) have been devoted to the new task and some of the methods even exceed human performance 1 on FewRel. Based on the dataset FewRel, we propose FewRel 2.0, a new task containing two realworld issues that FewRel ignores: (1) few-shot domain adaptation, and (2) few-shot none-of-theabove detection. Few-shot domain adaptation (few-shot DA) aims to evaluate the abilities of few-shot models to transfer across domains, which is crucial for realworld applications, since the test domains usually lack of annotations and could differ vastly from the training domains. To this end, we construct a new test set sharing great disparities with the original FewRel dataset, and carry out extensive experiments on the state-of-the-art few-shot models and commonly-used domain adaptation methods. Some prior experimental results in Figure 1 show that even the performance of the most effective methods on FewRel drops drastically on the new test set, proving that few-shot DA is challenging and requires further investigations.
Few-shot none-of-the-above detection (fewshot NOTA) is an advanced version of the existing N -way K-shot setting in few-shot learning. The original N -way K-shot setting samples N classes, as well as K supporting instances and several queries from each class for each test batch, assuming that all queries belong to the sampled N classes. However, in few-shot NOTA, queries could also be none-of-the-above (NOTA), which brings one more option in classification and challenges existing few-shot methods. Consider-  ing few-shot NOTA has not yet been widely explored, we propose several solutions based on the state-of-the-art few-shot models and evaluate them with few-shot NOTA setting. Figure 1 shows that though achieving promising results, there is still a room of improvements for few-shot NOTA.
In the following sections, we first describe the two newly-added challenges in FewRel 2.0, then introduce possible directions for addressing these two issues, and finally present results and observations from our experiments.
2 FewRel 2.0 Formulation for N -Way K-Shot Setting The original FewRel task adopts the N -way Kshot setting. The whole dataset is divided into training, validation and test subsets, which have no intersection in relation types. Models are evaluated with batches sampled from the test set, each of which consists of (R, S, x, r), where R = {r 1 , r 2 , ..., r N } is the sampled relation set, r ∈ R is the correct relation label for the query x, and S is the supporting set containing K instances for each relation, Models should predict the relation label y ∈ R for the query instance x based on the given S and R. Both of the following two challenges are based on this N -way K-shot setting.

Few-Shot Domain Adaptation
Both the training and test sets of the original FewRel dataset are constructed by manually annotating the distantly supervised (Bunescu and Mooney, 2007;Mintz et al., 2009) results on Wikipedia corpus and Wikidata (Vrandečić and Krötzsch, 2014) knowledge bases. In other words, they are from the same domain, yet in a real-world scenario, we might train models on one domain and perform few-shot learning on a different one. For example, we may train models on Wikipedia, which has large amounts of data and adequate annotations, and then perform few-shot learning on some domains suffering data sparsity, like literature, finance and medicine. Note that, not only do these corpora differ vastly from each other in morphology and syntax, but there are wide disparities between the relation sets defined on these domains as well, which makes transferring knowledge across different domains more challenging.
To explore few-shot DA, we construct a new test set by aligning PubMed 2 , a database containing large amounts of biomedical literature, with UMLS 3 , a large-scale knowledge base in the biomedical sciences. Then we let the annotators classify whether each instance we get from the distant supervision is correct. Every sentence is assigned to at least two annotators, and if their annotation results do not agree with each other, the third annotator is assigned. In the end, we gather a valid dataset with 25 relations and 100 instances for each relation.
For few-shot DA, we adopt the original FewRel training set for training, and the newly-annotated dataset for test, as shown in Table 1. Besides, we use SemEval-2010 task 8 dataset (Hendrickx et al., 2009) as the validation set, since both the corpora and the schema of SemEval-2010 task 8 are in different domains from the original FewRel dataset and the newly-annotated test set.

Few-Shot None-of-the-Above Detection
In a N -way K-shot, all queries are assumed to be in the given relation set, yet sentences expressing no specific relations or relations not in the given set should also be taken into consideration, for they make up the vast majority of text. This calls for the none-of-the-above (NOTA) relation, which indicates that the query instance does not express any of the given relations. Though it is common in some conventional classification tasks, where NOTA is usually regarded as an extra class, detecting NOTA could be hard in few-shot learning, because the given relation sets are not fixed so that the NOTA relation requires to cover a different semantic space each time. An example of NOTA is given in Table 1.
We formalize few-shot NOTA based on the N -way K-shot setting.
For the query instance x, the correct relation label becomes r ∈ {r 1 , r 2 , ..., r N , NOTA} rather than r ∈ {r 1 , r 2 , ..., r N }. We use the parameter NOTA rate to describe the proportion of NOTA queries during the whole test phase. For example, 0% NOTA rate means no queries are NOTA and 50% NOTA rate means half of the queries have the label NOTA.
The NOTA queries are sampled from those relations outside the given N relations. To be more specific, denoting the whole test set as D test , the set containing all instances in the relation set R as D R and the NOTA rate as α, α of the query instances (NOTA queries) are from D test \ D R and 1 − α of the instances are from D R .
Note that during the test phase, all the queries are from the test set, though models can sample instances from the training set as supporting instances for NOTA relation (this method is described explicitly in Section 4). Also note that to better demonstrate the effects of the NOTA relation, we use the original FewRel dataset for fewshot NOTA, instead of the new test set, which can get rid of the influence of domain adaptation.
3 Approaches for Few-Shot DA Many efforts have been devoted for domain adaptation, like subspace mapping (Pan et al., 2010;Fernando et al., 2013), finding domain-invariant spaces (Baktashmotlagh et al., 2013;Ganin et al., 2016), feature augmentation (Blitzer et al., 2006) and minimax estimators (Provost and Fawcett, 2001). Among them, adversarial training (Goodfellow et al., 2015;Ganin et al., 2016; has been proved to be efficient in finding domain-invariant features. It is a game process between an encoder and a discriminator, where the encoder tries to generate domain-invariant features while the discriminator tries to tell which domain the features are from.
Here we follow the adversarial training setting in , where a two-layer perceptron network is used as the discriminator. While training the few-shot learning task, we feed the sentence encoder E and the discriminator D with the corpora from the training domain and the test domain, and optimize the min-max game, where [·] i is the i-th element of the vector, C 0 is the training corpus and C 1 is the test corpus.

Approaches for Few-Shot NOTA
A simple way to handle NOTA is to regard it as an extra class in the N -way K-shot setting. To be more specific, we can sample instances outside the N relations as the supporting data of NOTA, and perform the (N + 1)-way K-shot learning. As compared to the current methods ignoring NOTA, this approach does not bring much improvements, since the supporting data for NOTA actually belong to several different relations and are scattered in the feature space, making it hard to perform classification.
To better address few-shot NOTA, we propose a model named BERT-PAIR based on the sequence classification model in BERT (Devlin et al., 2019). We pair each query instance with all the supporting instances, concatenate each pair as one sequence, and send the concatenated sequence to the BERT sequence classification model to get the score of the two instances expressing the same relation. Denote the BERT model as B, the query instance as x and the paired supporting instance as x j r (the j-th supporting instance for the relation r), B(x, x j r ) outputs a two-element vector corresponding to scores of the pair sharing the same relation and not sharing the same relation. The probability over each relation in the few-shot scenario, including NOTA, is addressed as follows, where y is the predicted label and R = {r 1 , ..., r N , NOTA} is the relation set including NOTA. For r ∈ {r 1 , ..., r N }, o r is calculated by averaging,  Table 2: Accuracies (%) on few-shot DA. "On 1.0" represents the results on the original FewRel dataset and "On 2.0" represents the results on the new test set. The models with "-ADV" use adversarial training described in Section 3.
The score for NOTA o NOTA is calculated by the equation, Then we can treat NOTA the same as other relations and optimize the model with the cross entropy loss, which is commonly-used in few-shot learning and other classification tasks.

Baseline Models for Few-Shot Learning
We pick the two best models from the results in , GNN (Satorras and Estrach, 2018) and Prototypical Networks (Snell et al., 2017), as our baseline models. As for the encoders, besides the CNN encoder used in , we also adopt BERT since it achieves the state-of-the-arts in multiple tasks (Devlin et al., 2019). For all models and encoders, we follow the parameter settings from  and Devlin et al. (2019).  BERT-PAIR model in Section 4. We get three observations from the results:

Evaluation Results on Few-Shot DA
(1) All few-shot models suffer dramatic performance falls when tested on a different domain.
(2) Adversarial training does improve the results on the new test domain, yet still has large space for growth.
(3) BERT-PAIR outperforms all other few-shot models on both 1.0 and 2.0 test set.
Besides, to see where the growth boundary is, we split 10 relations, 1, 000 instances out of the 2.0 test set and add them to the training set, then train and evaluate BERT-PAIR on the new data. We get 72.30% for 5-way 1-shot and 80.50% for 5-way 5shot, 16 and 13 points higher than the current best results. Note that only 1, 000 training instances can lead to such an enormous gap, indicating that  there is still a huge room for improvements.

Evaluation Results on Few-Shot NOTA
We evaluate Prototypical Networks with the naive NOTA solution described in Section 4 and BERT-PAIR under the NOTA setting. All models are trained given 50% NOTA queries and tested under four different NOTA rates: 0%, 15%, 30%, 50%.
To show how accuracy falls if ignoring the NOTA relation, we also demonstrate the results of models without considering NOTA (marked with * in Figure 2). We demonstrate the evaluation results in Figure 2. For detailed numbers of results on fewshot NOTA, please refer to Table 3. From Figure 2 we can conclude that: (1) Treating NOTA as the N + 1 relation is beneficial for handling Few-Shot NOTA, though the results still fall fast when the NOTA rate increases.
(2) BERT-PAIR works better under the NOTA setting for its binary-classification style model, and stays stable with rising NOTA rate.
(3) Though BERT-PAIR achieves promising results, huge gaps still exist between the conventional (0% NOTA rate) and NOTA settings (gaps of 8 points for 5-way 1-shot and 7 points for 5way 5-shot with 50% NOTA rate), which calls for further research to address the challenge.

Conclusion
In this paper, we propose FewRel 2.0, a more challenging few-shot relation classification task with a new test set from the biomedical domain and the none-of-the-above setting. The purpose of the new task is to explore two aspects which are ignored in the previous work: few-shot domain adaptation (few-shot DA) and few-shot none-ofthe-above detection (few-shot NOTA). Extensive experiments demonstrate that the existing stateof-the-art few-shot models struggle on the new task. We also point out some possible directions to handle these two issues, implement several new models and evaluate them with the new task. Though achieving promising improvements, these commonly-used techniques are still not the satisfactory solutions for few-shot DA and fewshot NOTA, which requires further explorations in these two real-world challenges.