Learning to Decouple Relations: Few-Shot Relation Classification with Entity-Guided Attention and Confusion-Aware Training

This paper aims to enhance the few-shot relation classification especially for sentences that jointly describe multiple relations. Due to the fact that some relations usually keep high co-occurrence in the same context, previous few-shot relation classifiers struggle to distinguish them with few annotated instances. To alleviate the above relation confusion problem, we propose CTEG, a model equipped with two novel mechanisms to learn to decouple these easily-confused relations. On the one hand, an Entity -Guided Attention (EGA) mechanism, which leverages the syntactic relations and relative positions between each word and the specified entity pair, is introduced to guide the attention to filter out information causing confusion. On the other hand, a Confusion-Aware Training (CAT) method is proposed to explicitly learn to distinguish relations by playing a pushing-away game between classifying a sentence into a true relation and its confusing relation. Extensive experiments are conducted on the FewRel dataset, and the results show that our proposed model achieves comparable and even much better results to strong baselines in terms of accuracy. Furthermore, the ablation test and case study verify the effectiveness of our proposed EGA and CAT, especially in addressing the relation confusion problem.


Introduction
Relation classification (RC) aims to identify the relation between two specified entities in a sentence. Previous supervised approaches on this task heavily depend on human-annotated data, which limit their performance on classifying the relations with insufficient instances. Therefore, making the RC models capable of identifying relations with few training instances becomes a crucial challenge. Inspired by the success of few-shot learning methods in the computer vision community (Vinyals et al., 2016;Sung et al., 2017;Santoro et al., 2016) and some other natural language processing tasks (Chen et al., 2016;Qin et al., 2020;, Han et al. (2018) first introduce the few-shot learning to RC task and propose the FewRel dataset. Recently, many works focus on this task and achieve remarkable performance (Gao et al., 2019a;Snell et al., 2017;Ye and Ling, 2019).
Previous few-shot relation classifiers perform well on sentences with only one relation of a single entity pair. However, in real natural language, a sentence usually jointly describes multiple relations of different entity pairs. Since these relations usually keep high co-occurrence in the same context, previous few-shot RC models struggle to distinguish them with few annotated instances. For example, Table 1 shows three instances from the FewRel dataset, where each sentence describes multiple relations with corresponding keyphrases highlighted (colored) as evidence. When specified two entities (bold black) in the sentence, there is a great opportunity for the instance to be incorrectly categorized as a confusing relation (red) instead of the true relation (blue). Specifically, the first instance should be categorized as the true relation 'parents-child' based on the given entity pair and natural language (NL) expression 'a daughter of '. However, since it also includes the NL expression 'his wife', it is probably misclassified into this confusing relation 'husband-wife'. In this paper, we name it as a relation confusion problem.
True Relation Confusing Relation Instance parents-child husband-wife She was a daughter of prince Wilhelm of Baden and his wife princess Maria of Lichtenberg, as well as an elder sister of prince Maximilian.
husband-wife uncle-nephew He was the youngest son of Prescott Sheldon Bush and his wife Dorothy Walker Bush, and the uncle of former president George W Bush.
uncle-nephew parents-child Snowdon is the son of princess Margaret, countess of Snowdon, and the 1st earl of Snowdon, thus he is the nephew of queen Elizabeth ii. Table 1: Example sentences containing confusing relations. Their specified entities are marked as italics in bold. The blue and red words respectively correspond to true and confusing relations.
To address the relation confusion problem, it is crucial for a model to be aware of which NL expressions cause confusion and learn to avoid mapping the instance into its easily-confused relation. From these perspectives, we propose two assumptions. Firstly, in a sentence, words that keep high relevance to the given entities are more important in expressing the true relation. Secondly, explicitly learning of mapping an instance into its confusing relation with augmented data in turn boosts a few-shot RC model on identifying the true relation. Based on these assumptions, we propose CTEG, a few-shot RC model with two novel mechanisms: (1) An Entity-Guided Attention (EGA) encoder, which leverages the syntactic relations and relative positions between each word and the specified entity pair to softly select important information of words expressing the true relation and filter out the information causing confusion.
(2) A Confusion-Aware Training (CAT) method, which explicitly learns to distinguish relations by playing a pushing-away game between classifying a sentence into a true relation and its confusing relation. In addition, inspired by the success of pre-trained language models, our approaches are based on BERT (Devlin et al., 2018), which has been proved effective especially for few-shot learning tasks.
Specifically, the backbone of the encoder of our model is a transformer equipped with the proposed EGA which guides the calculation of self-attention distributions by weighting the attention logits with entity-guided gates. The gates are used to measure the relevance between each word and the given two entities. Two types of information for each word are used to calculate its gate. One is the relative position (Zeng et al., 2015a) information, which is the relative distance between a word and an entity in the input sequence. The other is syntactic relation which is proposed in this paper, defined as the dependency relations between each word and the entities. Based on these information, the entity-guided gates in EGA are able to select those important words and control the contribution of each word in self-attention.
We also propose CAT to explicitly force the model to asynchronously learn the classification from an instance to its true relation and its confusing relation. After each training step, the CAT first selects those misclassified sentences, and regards the relations they are misclassified into as the confusing relations. After that, The CAT uses these misclassified instances and their confusing relations as augmented data to conduct an additional training process, which aims to learn the mapping between these instances into the confusing relations. Afterwards, the CAT adopts the KL divergence (Kullback and Leibler, 1951) to teach the model to distinguish the difference between the true and confusing relations, which benefits the true relation classification from the confusing relation identification.
The contributions of this paper are summarized as follows: (1) We propose an Entity-Guided Attention encoder, which can select crucial words and filter out NL expressions causing confusion based on their relevance to the specified entities. (2) We propose a Confusion-Aware Training process to enhance the model with the ability of distinguishing true and confusing relations. (3) We conduct extensive experiments on few-shot RC dataset FewRel, ans the results show that our model achieves comparable and even much better results to strong baselines. Furthermore, ablation and case studies verify the effectiveness of the proposed EGA and CAT, especially in addressing the relation confusion problem.

EGA: Entity-Guided Attention Encoder
The inputs of our model include a sentence S = w 1 , ..., w n with n words, and two pairs of integers s 1 = (l 1 , r 1 ) and s 2 = (l 2 , r 2 ) representing the start and end positions of the two specified entities. Firstly, we convert the words into a sequence of vectors e w 1 , ..., e w n , using an embedding layer initialized by BERT. We then use two types of relevance information, i.e., relative position and syntactic relation, between each word and the specified entity pair to construct entity-guided gates for information selection.
Relative Position. Relative position information is typically used in relation classification task (Zeng et al., 2015a), which is defined as the relative distances pos 1 and pos 2 from the current word to the two specified entities in the sentence. The relative position information of the i-th word is represented as Syntactic Relation Except for the relative position, we further introduce the syntactic relations to measure the relevance between each word and the specified entities. The syntactic relations are derived based on dependency parse trees, which are obtained from the Standford Parser 1 . For example, Figure 2(a) shows the original dependency tree of the sentence "Chen-chun-chang is a mathematician who works in model-theory", where "Chen-chun-chang" and "model-theory" are entities. In this paper, we assume that words that directly connect to the given entities are more important in expressing the true relations. Therefore, dependency relations that connect the specified entities and other words are remained and the others are discarded, which derives a pruned dependency tree, as one shown in Figure 2(b). Based on the pruned dependency tree, each word in the sentence is assigned two tags t i = (t i,1 , t i,2 ) as the proposed syntactic relations. Taking the tag t i,1 of word w i which corresponds to the first entity as an example, if w i is part of the first entity, the tag t i,1 is assigned the value 'self ', and if w i is directly connected to the first entity in the dependency tree, t i,1 is assigned the dependency relation, e.g., 'nmod'. In addition, if w i is neither connected to nor part of the first entity, t i,1 is assigned 'other'. Based on the above strategy, the syntactic relations of the sentence in Figure 2 are shown in Table 2. Finally, the two dependency  tags of each word t i = (t i,1 , t i,2 ) are converted into continuous vectors based on an embedding lookup operation, and then concatenated into a vector e syn Entity-Guided Gate The proposed EGA learns entity-guided gates G = (g 1 , ..., g n ) for all words in a given sentence based on the above two types of information. Intuitively, if a word w i is directly connected to the given entities in the dependency tree, the corresponding information tends to be more important. Specifically, the relative position embedding and the syntactic relation embedding are first concatenated into e p i = [e pos i , e syn i ], where e p i ∈ R 2dpos+2dsyn . We then adopt a transformer encoder (Vaswani et al., 2017) followed by a single layer feed-forward neural network (FNN) with sigmoid(·) activation to derive a sequence of entity-guided gates as follows: Gated Self-Attention A pre-trained transformer encoder with M layers equipped with the proposed EGA is used to learn the representation for a sentence. The backbone of each layer is a self-attention layer, which calculates attention weights for word pairs in the sentence. We define self-attention weights of the tth-layer as Att t , and the corresponding hidden states of the sentence is represented as H t .
To obtain the attention weights Att t , the scaled attention is multiplied by the entityguided gate G with broadcasting followed by a softmax(·) operation. The gated self-attention and the calculation of H t are formalized as follows, where W k ,W q ,W v are trainable parameters.
Finally, vector s as the representation of the sentence is obtained based on Equation 6, where H M represents the output of the M -th layer of the encoder.

Classification
The classifier performs N -way-K-shot classification following few-shot learning paradigm and the prototypical network (Snell et al., 2017). Specifically, for a relation r j where j ∈ [1, N ], K sentences are sampled from its instances firstly, and then these sentences are used to calculate the representation named prototype c j of the relation. We define the representations of the K sentences as s c j,1 , ..., s j,K c , and prototype c j is calculated as follows: Given the representation s q of a sentence as query and the prototypes (c 1 , ..., c N ) of N relations, the model aims to classify s q into one of the N candidate relations. We first obtain the distance distribution δ = (δ 1 , ..., δ N ) by calculating the Euclidean Distance between s q and each prototype. Then, according to δ, the sentence will be classified into the nearest relationr.
To enable the classifier to learn confusing relations, we further project the distance distrubution δ intoδ via a FFN with a tanh(·) activation function defined as follows. Theδ is used to predict the confusing relation defined asr during a confusion-aware training (CAT) stage which is introduced in Section 2.3.

CAT: Confusion-Aware Training
The confusion-aware training is based on two asynchronous processes: true relation identification and confusing relation identification. During classifying a sentence, the former uses its true relation as the target, and the latter uses its confusing relation as the target. Specifically, given a sentence with its true relation as r, the training objective of the true relation identification is defined as: For the confusing relation identification, we first pick up those misclassified sentences after each training step of true relation identification, and use their prediction results as the targets. In formulation, assuming the sentence is misclassified into an incorrectr, the objective function of the confusing relation identificationL is defined as:L = CrossEntropy(OneHot(r), Softmax(δ)) Besides, the KL divergence is adopted as another objective function, which allows the model to learn to perform confusion decoupling. The KL divergence has the ability to push away the distance distribution δ andδ, and the formula is defined as follows: Through minimizing L kl , the model is able to explicitly learn to distinguish relations by playing a pushing-away game between classifying a sentence into a true relation and its confusing relation. In other words, our model learns to explicitly decouple r andr for classification based on specified entities in a given sentence. It is worth noting that, only those misclassified sentences are used for updating the objective L kl . The final objective function of our model L all is defined as L all = L +L + L kl .

Experiments
In this section, we report our experiment results from the following four aspects. We first show the comparison results of our model CTEG and baselines on FewRel dataset in Section 3.3. We then demonstrate the effectiveness of the proposed entity-gated attention (EGA) and confusion-aware training (CAT) through the ablation studies in Section 3.4. In order to more intuitively and clearly show the role of EGA and CAT, we show their visualized examples in case study in Section 3.5. Furthermore, we verify that our model is capable of addressing the relation confusion problem to some extent in Section 3.6.

Implementation Details
Dataset The FewRel dataset (Han et al., 2018) contains 100 relations, which are split up into 64 for training, 16 for validation and 20 for testing. Each relation has 700 instances generated by distant supervision (Mintz et al., 2009). All the instances are annotated with a specified entity pair.
Settings The dimension of word embedding is set to 768 for consistency with the base model of BERT (Devlin et al., 2018). The max length of the input is set to 100. Following BERT, the layer number M of the transformer encoder with EGA is 12, and all parameters in it is initialized with the pretrained BERT model. The relative position and syntactic relation embedding dimensions are both set to 50, and the transformer encoder for obtaining entity-guided gates is set up with hidden size as 230, head number of self-attention as 2. In addition, the model is optimized by Adam algorithm with the learning rate and the weight decay as 1 × 10 −5 and 1 × 10 −6 , respectively.

Baselines
We implement four baselines on FewRel dataset: Proto, Proto-HATT (Gao et al., 2019a), MLMAN (Ye and Ling, 2019) and BERT-PAIR (Gao et al., 2019b). All the baselines are based on the few-shot learning framework. Specifically, for each training step, N relations are first sampled from the training set. For each of the above relation, K out of 700 instances are sampled to construct a supporting set, based on which a relation representation named prototype is calculated. Given an instance of the N relations to be classified, the models classify it by calculating the distances from it to N prototypes.
Proto & Proto-HATT Both of the two models adopt the convolutional neural network (CNN) as encoders. Proto calculates the prototype by averaging the representations of the K-instances in the supporting set, and classify the query using the Euclidean Distance. Differently, Proto-HATT further proposes a hybrid attention scheme which includes an instance-level attention and a feature-level attention, where the former is used to highlight the crucial support sentences in calculating the prototype, and the latter is to select more efficient features when calculating distances.
MLMAN Different from the Proto and Proto-HATT, MLMAN encodes each query and the supporting set in an interactive way by considering their matching information on multiple levels. At local level, the representations of an instance and a supporting set are matched following the sentence matching framework (Chen et al., 2017b) and aggregated by max and average pooling. At instance level, the matching degree is first calculated via a multi-layer perception (MLP). Then, taking the matching degrees as weights, the instances in a supporting set are aggregated to obtain the class prototype for final classification.
BERT-PAIR This model is based on the sentence classification model in BERT. The sentence to be classified is first paired with all the supporting instances, and then each pair is concatenated to a sequence. BERT takes this sequence as input and returns a relevance score, which is used to measure whether the given sentence expresses the same relation with the corresponding supporting instance.

Comparison with Baselines
Same as Proto, we set N = 5, 10 and K = 1, 5 for N vK few-shot learning. Average accuracy is used as the evaluation metric to evaluate the relation classification performance. The results in Table 3 show that our model EGA with CAT, named CTEG, outperforms the three strong baselines including Proto, Proto-HATT and MLMAN by a significant margin on all the settings. These improvements are mainly brought by our EGA and CAT, which help the model to classify those easily-confused instances into correct ones. In addition, applying pre-trained BERT also contributes to improving the performance. Compared with BERT-Pair, our CTEG achieves better result on 5v5, 10v1 and 10v5 settings and comparable results on 5v1 settings on the test set, while on the dev set our CTEG is slight lower than BERT-pair on 5v1 and 10v1 settings. We think that the lower performance on the dev set on 5v1 and 10v1 is due to the fact that BERT-Pair encodes two sentences together which benefits for information fusion, while models based on prototypical network rely on larger K supporting facts to get a better prototype.

Ablation Study
We conduct ablation study and show the results in Table 3. Firstly, we turn off the CAT of our full model, which is represented as "w/o CAT". In this case, the average results drops 0.43-1.76 point on the four settings. These drops indicate that the CAT has the ability to improve the classification performance. We then report three groups of results to verify the effectiveness of EGA. Specially, our model without EGA which only adopts the BERT as the encoder is denoted as "w/o EGA". It is worth noting that in this case, the model can not identify which words in a given sentence are entities. When the EGA is removed, the performance decreases obviously by 5.81-14.13 point. It is proved that the entity information is crucial for relation classification. Furthermore, "w/ Pos" means the entity-guided gates in EGA are calculated only using the relative position information, and "w/ Syn" only using the syntactic relation information. Compared with "w/o EGA", the results of these two groups are significantly improved. It shows that the syntactic relation information is more powerful than the relative position information, which means considering the dependency relations between each word and the specified entity pair boosts the performance of simply adopting traditional relative position information. In addition, it can be seen that the smaller size of the supporting set (1-shot v.s. 5-shot), the more absolute gain our CAT and EGA modules achieve. This phenomenon shows that our method performs well with fewer available supporting instances.  How to Gate The self-attention mechanism is used to update the representation of each query word by fusing the information of all key words in a given sentence. In this process, an attention score is calculated to leverage the contribution of each key word. In this paper, we propose to use gates to further adjust these attention scores. In our proposed EGA, each entity-guided gate reflects the relation between the key word and the specified entity pair, which is different for all key words. We also implement a baseline QGG with query-guided gates, where each gate reflects the relation between the key word and the query word. Specifically, the relation is modeled based on their syntactic relation if the key word is a specified entity, otherwise a 'other' relation. The results of using these two kinds of gates in Table 4 shows that our model CTEG w/ Syn only modeling syntactic relations outperforms the QGG baseline, which further verifies that our EGA with entity-guided gates has the ability of effectively leveraging specified entity information to select input information.
What and When to Gate In our EGA, the entity-guided gates are used since the beginning of the encoding process by multiplying them with the self-attention scores in each transformer layer. It means that the information of the words is selected during learning the representation of them. Another baseline is to multiply gates with the Final transformer Hidden states of the words as Gating mechanism, which is defined as FHG. In this case, the information of all words has been fully fused before adopting gating mechanism for selection. As the results shown in Table 4, compared with our model CTEG, the accuracy of the FHG drops 1.95 point. The results indicate that earlier to gate the attention score during encoding as our EGA is more reasonable than only to adopt gating at the final hidden states.

Case Study
EGA visualized example The entity-guided gates in EGA are expected to emphasize the words which are more related to the true relation. To verify the effectiveness of EGA intuitively, we show the entityguided gates heat map of a given instance in Figure 3. This instance is sampled from 'parent-children' relation in the validation set of FewRel. As shown in the map, the words 'his mother is' are given higher scores. Obviously, the three words are important for expressing the 'parent-children' relation.
CAT visualized example In Figure 4, we visualize the distance distributions between the given sentence and its candidate relations. The four subfigures respectively show the distance distributions calcu- Figure 3: An example of the entity-guided gates of a given sentence.
lated by different models including our true relation identification and confusing relation identification. Among the five candidate relations, R2 in green is the true relation of the sentence, and R 1 in red is the confusing relation that the sentence is usually misclassified into. Each edge in the subfigures represents the distance from the sentence to the corresponding relation, and the solid edge indicates the nearest one. Specifically, (a) is the distances calculated by a randomly initialized network. (b) is the classification result of Proto, in this case, the query is misclassified into R 1 . (c) and (d) are the final classification results of our CAT. The distance distribution between the query and the confusing relation calculated by our CAT is shown in (d), and it can be seen that the model succeeds in making the query closer to the confusing relation R 2 as we expected. After that, the distance distribution information is propagated to the true relation training by KL divergence, this operation is used to push the distance distribution of the true relation prediction away from the distribution of the confusing relation. As (c) shows, the sentence is pushed away from R 1 and get closer to the true relation R 2 . This example validates our assumption that explicit learning of confusing relations facilitates the identification of true relations.

Relation Confusion Problem
In this section, we discuss the effectiveness of our model on confusion decoupling and use the confusion matrices as our evaluation metric.
Confusing Relations Selection We first analyze the classification results of the baseline models Proto and Proto-HATT. Based on our statistics, we find three of the 16 relations in the FewRel validation set that are most easily confused with each other. Their relation indexes are P25, P26 and P40, and the corresponding true relations are "Parents-Child","Husband-Wife" and "Uncle-Nephew". We test our model and the baseline models under the 5-way-5-shot configuration. For the three easily-confused relations, we respectively record the number of their sentences which are correctly classified and misclassified into the other two relations, and use the results to conduct the confusing matrices. Figure 5, we report the classification results of different models on the three confusing relations P25, P26 and P40. In the confusion matrices, the horizontal axis represent the true relation of the sentences, and the vertical axis represent the classification results of these sentences by different models. For each matrix, supposing a given relation such as P25 has X sentences to participate in the test, and the numbers of the sentences classified into P25, P26 and P40 are respectively a,b and c, than the elements in the first row of the matrix are calculated as (a, b, c/X). Given a relation, we expect the models classify more its sentences into the true relation, and fewer its sentences into confusing relations. From this perspective, through comparing the confusion matrices of "CTEG" and the baseline models, it can be seen that our full model CTEG achieves  Figure 5: Confusion matrices of the three easily confused relations, where different colors represent the classification results of different models. the best performance in identifying these easily confused relations. "w/o EGA" has the weakest ability to decouple the confusing relations, because it is not provided with any entity information to identify the true relation. Based on results of "w/o EGA", "w/ Pos" and "w/ Syn" we can see that both of the relative position and the syntax position bring significant improvements. In addition, compared with our full model, the performance of "w/o CAT" proves that the CAT help to decouple the confusing relations.

Related Work
Few-shot Relation Classification Relation classification (RC) aims to identify the semantic relation between two entities in a sentence, which is the basis of many natural language processing task, such as question answering (Yu et al., 2017) and knowledge graph completion (Shang et al., 2019). It has attracted more and more attention over past few years (Jia et al., 2019;Feng et al., 2018;Vinyals et al., 2018;Adel and Schütze, 2017;Yang et al., 2016a). Previous supervised approaches on this task heavily rely on labeled data for training, that limits their ability to classify the relations with insufficient instances. To address this problem, Han et al. (2018) first introduce few-shot learning to RC task, which has been proved effective in the computer vision community and has many applications (Vinyals et al., 2016;Sung et al., 2017;Santoro et al., 2016). Earlier works on few-shot RC are based on the widely used model prototypical network (Snell et al., 2017;Ye and Ling, 2019). Recently, the pre-trained language models (LM) has shown significant power in many natural language processing tasks. To this end, Gao et al. (2019c) adopt the most representative pre-trained LM BERT (Devlin et al., 2018) to few-shot RC, and their work shows that BERT brings significant improvements on classification performance. Furthermore, the approach proposed by Soares et al. (2019) are also based on BERT and achieve the state-of-art result on the few-shot RC task.
Syntactic Relation Previous RC models usually use the relative position information to identify which words are the entities in a sentence, e.g., Zeng et al. (2015b). In addition, the syntax information of the sentences is proved useful in many natural language processing tasks (Faleńska and Kuhn, 2019;Ma et al., 2020;Chen et al., 2017a). Inspired by Yang et al. (2016b), which adopt the dependency parse tree for RC (Ma et al., 2020), we also introduce the dependency relation as another type of position to emphasize the specific entities, and propose a novel application of the syntax positions.

Conclusions
In this paper, we propose CTEG equipped with two novel mechanisms, namely the Entity-Guided Attention (EGA) and the Confusion-Aware Training (CAT), to address the relation confusion problem in few-shot relation classification (RC). We conduct extensive experiments on benchmark dataset FewRel, and experiment results shows that our model achieves significant improvements on few-shot RC. Ablation studies verify the effectiveness of the proposed EGA and CAT mechanisms. Case study and further analysis demonstrate that our model has the ability of decoupling easily-confused relations.