FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation

We present a Few-Shot Relation Classification Dataset (dataset), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by crowdworkers. The relation of each sentence is first recognized by distant supervision methods, and then filtered by crowdworkers. We adapt the most recent state-of-the-art few-shot learning methods for relation classification and conduct thorough evaluation of these methods. Empirical results show that even the most competitive few-shot learning models struggle on this task, especially as compared with humans. We also show that a range of different reasoning skills are needed to solve our task. These results indicate that few-shot relation classification remains an open problem and still requires further research. Our detailed analysis points multiple directions for future research.


Introduction
Relation classification (RC) is an important task in NLP, aiming to determine the correct relation between two entities in a given sentence. Many works have been proposed for this task, including kernel methods (Zelenko et al., 2002;Mooney and Bunescu, 2006), embedding methods (Gormley et al., 2015), and neural methods (Zeng et al., 2014). The performance of these conventional models heavily depends on time-consuming and labor-intensive annotated data, which make themselves hard to generalize well. Adopting distant supervision is a primary approach to alleviate this problem for RC (Mintz et al.;Hoffmann et al., 2011;Surdeanu et al., 2012; Zeng * The first four authors contribute equally. The order is determined by dice rolling. † Z. Wang is now at New York University. ‡ Correspondence author. Supporting Set (A) capital of (1) London is the capital of the U.K.
(2) Washington is the capital of the U.S.A.
(B) member of (1) Newton served as the president of the Royal Society.
(2) Leibniz was a member of the Prussian Academy of Sciences.
(C) birth name (1) Samuel Langhorne Clemens, better known by his pen name Mark Twain, was an American writer.
(2) Alexei Maximovich Peshkov, primarily known as Maxim Gorky, was a Russian and Soviet writer.

Test Instance (A) or (B) or (C)
Euler was elected a foreign member of the Royal Swedish Academy of Sciences.  Lin et al., 2016), which heuristically aligns knowledge bases (KBs) and text to automatically annotate adequate amounts of training instances. We evaluate the model proposed by Lin et al. (2016), which is followed by the recent stateof-the-art methods (Zeng et al., 2017;Ji et al., 2017;Wu et al., 2017;Feng et al., 2018;Zeng et al., 2018), on the benchmark dataset NYT-10 (Riedel et al.). Though it achieves promising results on common relations, the performance of a relation drops dramatically when its number of training instances decrease. About 58% of the relations in NYT-10 are long-tail with fewer than 100 instances. Furthermore, distant supervision suffers from the wrong labeling problem, which makes it harder to classify long-tail relations. Hence, it is necessary to study training RC models with insufficient training instances.
We formulate RC as a few-shot learning task in this paper, which requires models capable of handling classification task with a handful of training instances, as shown in Table 1. Many efforts have devoted to few-shot learning. The early works (Caruana, 1995;Bengio, 2012;Donahue et al., 2014) apply transfer learning methods to finetune pre-trained models from the common classes containing adequate instances to the uncommon classes with only few instances. Then metric learning methods (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017) have been proposed to learn the distance distributions among classes. Similar classes are adjacent in the distance space. The metric methods also take advantage of nonparametric estimation to make models efficient and general. Recently, the idea of meta-learning is proposed, which encourages the models to learn fast-learning abilities from previous experience and rapidly generalize to new concepts. Many meta-learning models (Ravi and Larochelle, 2017;Santoro et al., 2016;Finn et al., 2017;Munkhdalai and Yu, 2017) achieve the state-of-the-art results on several few-shot benchmarks. Though meta-learning methods develop fast, most of these works evaluate on two popular datasets, Omniglot (Lake et al., 2015) and mini-ImageNet (Vinyals et al., 2016). Both the datasets concentrate on image classification. Many works in NLP mainly focus on the zero-shot/semisupervised scenario (Xie et al., 2016;Ma et al., 2016;Carlson et al., 2009), which incorporate extra information to classify objects never appearing in the training sets. However, the few-shot scenario needs models to classify objects with few instances without any extra information. Recently, Yu et al. (2018) propose a multi-metric method for few-shot text classification. However, there lack systematic researches about adopting fewshot learning for NLP tasks. We propose FewRel: a new large-scale supervised Few-shot Relation Classification dataset. To address the wrong labeling problem in most distantly supervised RC datasets, we apply crowd-sourcing to manually remove the noise. i Besides constructing the dataset, we systematically implement the most recent state-of-theart few-shot learning methods and adapt them for i Many previous works, such as (Roth et al., 2013;Luo et al., 2017;Xin et al., 2018) have worked on automatically removing noise from distantly supervision. Instead, we use crowd-sourcing methods to achieve a high accuracy.
RC. We conduct a detailed evaluation for all these models on our dataset. Though the state-of-theart few-shot learning methods have much lower results than humans on our challenging dataset, they significantly outperform the vanilla RC models, indicating that incorporating few-shot learning is promising and needs further research. In summary, our contribution is three-fold: (1) We formulate RC as a few-shot learning task, and propose a new large supervised few-shot RC dataset.
(2) We systematically adapt the most recent state-of-the-art few-shot learning methods for RC, which may further benefit other NLP tasks.
(3) We conduct a comprehensive evaluation of few-shot learning methods on our dataset, which indicates some promising research directions for RC.

FewRel Dataset
In this section, we describe the process of creating FewRel in detail. The whole procedure can be divided into two steps: (1) We create a large candidate set of sentences aligned to relations via distant supervision. (2) We ask human annotators to filter out the wrong labeled sentences for each relation to finally achieve a clean RC dataset.

Distant Supervision
For the first step, We use Wikipedia as the corpus ii and Wikidata as the KB. Wikidata is a largescale KB where many entities are already linked to Wikipedia articles. The articles in Wikipedia also contain anchors linking to each other. Thus it is convenient to align sentences in Wikipedia articles to KB facts in Wikidata. We also employ entity linking technique to extract more unanchored entities in articles. We first adopt named entity recognition via spaCy iii to find possible entity mentions, then match each mention with the name of an entity in KBs, and link the mention to the entity if successfully matched.
For each sentence s in Wikipedia articles containing head and tail entities e 1 and e 2 , if there exists a Wikidata statement (e 1 , e 2 , r) meaning e 1 and e 2 have the relation r, we denote the (s, e 1 , e 2 , r) tuple as an instance and add it to the candidate set. Empirically, many instances of a given relation contain the same entity pair. For ii We use whole Wikipedia articles as corpus, not just the first sentence. iii https://spacy.io/ such relation, classifiers may prefer memorizing the entity pairs in the training instances rather than grasping the sentence semantics. Therefore, in the candidate set of each relation, we only keep 1 instance for each unique entity pair. Finally, we remove relations with fewer than 1000 instances, and randomly keep 1000 instances for the rest of the relations. As a result, we get a candidate set of 122 relations and 122, 000 instances.

Human Annotation
Next, we invite some well-educated annotators to filter the raw data on a platform similar to Amazon MTurk developed by ourselves. The platform presents each annotator with one instance each time, by showing the sentence, two entities in the sentence, and the corresponding relation labeled by distant supervision. The platform also provides the name of the entities and relation in Wikidata accompanied with the detailed description of that relation. Then the annotator is asked to judge whether the relation could be deduced only from the sentence semantics. We also ask the annotator to mark an instance as negative if the sentence is not complete, or the mention is falsely linked with the entity.
Relations are randomly assigned to annotators from the candidate set, and each annotator will consecutively annotate 20 instances of the same relation before switching to next relation. To ensure the labeling quality, each instance is labeled by at least two annotators. If the two annotators have disagreements on this instance, it will be assigned to a third annotator. As a result, each instance has at least two same annotations, which will be the final decision. After the annotation, we remove relations with fewer than 700 positive instances. For the remaining 105 relations, we calculate the inter-annotator agreement for each relation using the free-marginal multirater kappa (Randolph, 2005), and keep the top 100 relations.

Dataset Statistics
The final FewRel dataset consists of 100 relations, each has 700 instances. A full list of relations, including their names and descriptions, is provided in Appendix A.2. The average number of tokens in each sentence is 24.99, and there are 124, 577 unique tokens in total. Following recent metalearning tasks (Vinyals et al., 2016), which use separate sets of classes for training and testing, we use 64, 16, and 20 relations for training, val-   idation, and testing respectively. Table 2 provides a comparison of our FewRel dataset to two other popular few-shot classification datasets, Omniglot and mini-ImageNet.  (Strassel et al., 2008), TACRED dataset (Zhang et al., 2017), and NYT-10 dataset (Riedel et al., 2010). While some RC datasets contain instances with no relations (negative), we ignore such instances for comparison.

Experiments
We conduct comprehensive evaluations of vanilla RC models with simple strategies such as finetune or kNN on our new dataset. We also evaluate the recent state-of-the-art few-shot learning methods.
For relation classification, a data instance x j i is a sentence accompanied with a pair of entities. The query data x is an unlabeled instance to classify, and y ∈ R is the prediction of x given by F . In recent research on few-shot learning, N way K shot setting is widely adopted. We follow this setting for the few-shot relation classification problem. To be exact, for N way K shot learning

Experiment Settings
We consider four types of few-shot tasks in our experiments: 5 way 1 shot, 5 way 5 shot, 10 way 1 shot, 10 way 5 shot. Under this setting, we evaluate different few-shot training strategies and stateof-the-art few-shot learning methods built upon two widely used instance encoders, CNN (Zeng et al., 2014) and PCNN (Zeng et al., 2015). For both CNN and PCNN, the sentence is first represented to the input vectors by transforming each word into concatenation of word embeddings and position embeddings. In CNN, the input vectors pass a convolution layer, a max-pooling layer, and a non-linear activation layer to get the final output sentence embedding. PCNN is a variant of CNN, which replaces the max-pooling operation with a piecewise max-pooling operation.
To evaluate this two vanilla models in few-shot RC task, we first consider two training strategies, namely Finetune and kNN. For the Finetune baseline, it learns to classify all relations on the training set with CNN/PCNN, and tune parameters on the support set. We only tune the parameters of output layer, and keep other parameters unchanged. For the kNN baseline, it also jointly classifies all relations during training, while at the test time, it uses the neural networks to embed all the instances and then adopts k-nearest-neighbor (kNN) to classify the test instances.
By adapting them to relation classification, we also evaluate four recently proposed fewshot learning methods, including Meta Network (Munkhdalai and Yu, 2017), GNN (Satorras and Estrach, 2018), SNAIL (Mishra et al., 2018), and Prototypical Network (Snell et al., 2017). We describe briefly about these baselines in Sec. 3.3. If you are familiar with these methods, you can safely skip that subsection. The hyperparameters of each model are selected via grid search against the validation set.
Human performance is also evaluated under 5 way 1 shot setting and 10 way 1 shot setting. A human labeler is given 5/10 instances from different relations and one extra test instance. Human labelers are asked to decide which relation the test instance belongs to. Note that these labelers are not provided the name of the relations and any extra information. Since 5 way 5 shot and 10 way 5 shot settings are easier, we only evaluate performance of 5 way 1 shot and 10 way 1 shot.

Baselines of Few-shot Learning Models
Meta Network Meta Network (Munkhdalai and Yu, 2017) is a meta learning algorithm utilizing a high level meta learner on top of the traditional classification model, or base learner, to supervise the training process. The weights of base learner are divided into two groups, fast weights and slow weights. Fast weights are generated by the meta learner, whereas slow weights are simply updated by minimizing classification loss. The fast weights are expected to help the model generalize to new tasks with very few training instances.
GNN GNN (Satorras and Estrach, 2018) tackles the few-shot learning problem by considering each supporting instance or query instance as a node in the graph. For those instances in the support sets, label information is also embedded into the corresponding node representations. Graph neural networks are then employed to propagate the in-formation between nodes. A query instance is expected to receive information from support sets in order to make the classification. In our adaption, while the instances are encoded by CNNs, labels are represented by one-hot encoding.
SNAIL SNAIL (Mishra et al., 2018) is a meta learning model that utilizes temporal convolutional neural networks and attention modules for fast learning from past experience. SNAIL arranges all the supporting instance-label pairs into a sequence and appends the query instance behind them. Such an order agrees with the temporal order of learning process where we learn information by reading supporting instances before making predictions for unlabeled instances. Temporal convolution (a 1-D convolution) is then performed along the sequence to aggregate information across different time steps and a causally masked attention model is used over the sequence to aggregate useful information from former instances to latter ones.
Prototypical Networks Prototypical Network (Snell et al., 2017) is a few-shot classification model based on the assumption that for each class there exists a prototype. The model tries to find the prototypes for classes from supporting instances, and compares the distance between the query instance and each prototype under certain distance metric. Prototypical network learns a embedding function u to embed each class's instances, and computes each prototype by averaging over all the output embeddings of instances in the support set S that are labeled with the corresponding class.

Result Analysis and Future Work
We report evaluation results in Table 4. From our preliminary experiments, PCNN with few-shot learning methods perform 3-10 percentages worse than CNN, therefore only CNN results are shown in our experimental results. From the results, we observe that integrating few-shot learning methods into CNN significantly outperforms CNN/PCNN with finetune or kNN, which means adapting fewshot learning methods for RC is promising. However, there are still huge gaps between their performance and humans', which means our dataset is a challenging testbed for both relation classification and few-shot learning.
In this paper, we propose a new large and high quality dataset, FewRel, for few-shot relation clas-

Sentence Reasoning
Chris Bohjalian graduated from Amherst College Summa Cum Laude, where he was a member of the Phi Beta Kappa Society.

Simple Pattern
James Alty obtained a 1st class honours (Physics) at Liverpool University.
Commonsense Reasoning He was a professor at Reed College, where he taught Steve Jobs, and replaced Lloyd J. Reynolds as the head of the calligraphy program.

Logical Reasoning
He and Cesare Borgia were thought to be close friends since childhood, going on to accompany one another during their studies at the University of Pisa.
Coreference Reasoning Table 5: Examples from relation "educated at". Different colors indicate different entities, blue for head entity, and red for tail entity. sification task. This dataset provides a new point of view for RC, and also a new benchmark for fewshot learning. Through the evaluation of different few-shot learning methods, we find even the best model performs much worse than humans, which suggests there is still large space for fewshot learning methods to improve.
The most challenging characteristic of our dataset is the diversity in expressing the same relation. We provide some examples from FewRel in Table 5, showing different reasoning modes needed for classifying some instances. Future researches may consider incorporating commonsense knowledge or improved causal modules.