Logic-guided Semantic Representation Learning for Zero-Shot Relation Classification

Relation classification aims to extract semantic relations between entity pairs from the sentences. However, most existing methods can only identify seen relation classes that occurred during training. To recognize unseen relations at test time, we explore the problem of zero-shot relation classification. Previous work regards the problem as reading comprehension or textual entailment, which have to rely on artificial descriptive information to improve the understandability of relation types. Thus, rich semantic knowledge of the relation labels is ignored. In this paper, we propose a novel logic-guided semantic representation learning model for zero-shot relation classification. Our approach builds connections between seen and unseen relations via implicit and explicit semantic representations with knowledge graph embeddings and logic rules. Extensive experimental results demonstrate that our method can generalize to unseen relation types and achieve promising improvements.


Introduction
Relation Classification (RC) is an important task in information extraction, aiming to extract the relation between two given entities based on their related context. RC has attracted increasing attention due to its broad applications in many downstream tasks, such as knowledge base construction (Luan et al., 2018) and question answering (Yu et al., 2017).
Conventional supervised RC approaches can not satisfy the practical needs of the relation classification. In the real world, there exist massive amounts of fine-grained relations. And, the labeled relation types are limited, and each type usually has a certain number of labeled samples. Naturally, it is prohibitive to generalize to new (unseen) relations (i.e., the model will fail when predicting a type with no training examples). For example, in Figure 1, basin country is an unseen relation type with no labeled sentence in the training stage. To this end, it is urgent for models to be able to extract relations in a zero-shot scenario.
Previous zero-shot relation classification (ZSRC) approaches leverage transfer learning procedures by reading comprehension (Levy et al., 2017), textual entailment (Obamuyide and Vlachos, 2018), and so on. However, those methods have to rely on artificial descriptive information to improve the understandability of relation types. Inspired by the zero-shot learning in computer vision (Palatucci et al., 2009), it is natural to learn a mapping from the feature space of input samples to the semantic space such as class labels through a projection function. The hypothesis is to build the semantic connections between seen and unseen relations. Conventional approaches usually leverage word embeddings  of labels as a common semantic space. We argue that for relation classification, rich semantic knowledge is neglected in the relation labels space: Implicit Semantic Connection with Knowledge Graph Embedding. Previous studies (Yang et al., 2014) have shown that the Knowledge Graph Embeddings (KGEs) of semantically similar relations are

Feature representation
Beijing is the capital city which is located in northern part of China.
Western Sahara borders the North Atlantic Ocean. located near each other in the latent space. For instance, the relation place lived and nationality are more relevant, whereas the relation profession has less correlation with the former two relations. Thus, it is natural to leverage this knowledge from KGs to build connections between seen and unseen relations.
Explicit Semantic Connection with Rule Learning. We human can easily recognize unseen relations via symbolic reasoning. As the example shown in Figure 1, with the rule that basin country of(y,z) can be deduced if located in country(x,y) and next to body of water(x,z), we can recognize the unseen relation basin country of based on seen relations located in country and next to body of water. To this end, it is intuitive to infuse rule knowledge to bridge the connections between seen and zero-shot relations.
Motivated by this, we take the first step to propose a novel approach, namely, Logic-guided Semantic Representation Learning (LSRL) for zero-shot relation classification. To begin with, we propose to utilize pre-trained knowledge graph embedding such as TransE (Bordes et al., 2013) to build the implicit semantic connection. KGE embeds entities and relations into a continuous semantic vector space and can capture semantic connections between relations in semantic space. Further, we leverage logic rules mined via AMIE (Galárraga et al., 2013) from the knowledge graph and introduce rule-guided representation learning to obtain explicit semantic connection. It should be noted that our approach is model-agnostic, and therefore orthogonal to existing approaches. We integrate our approach with two well-known zero-shot methods, namely DeViSE  and ConSE (Norouzi et al., 2013). Extensive experimental results demonstrate the efficacy of our approach.
The main contributions of this work are as follows: • We introduce implicit semantic connection with knowledge graph embedding and explicit semantic connection with rule learning for zero-shot relation classification.
• We propose a novel rule-guided semantic representation learning to build connections between the seen and unseen relations. Our work is model-agnostic and can be plugged into different kinds of zero-shot learning approaches.
• Extensive experimental results show the efficacy of our approach and also reveals the usefulness of knowledge graph embedding and rule learning.

Related Work
Relation Classification. Relation classification (RC) has been firstly proposed in MUC 1998, which aims to predict the relation between two entities by a specific context. Many mature models have been developed to figure out this problem, including traditional methods like (GuoDong et al., 2005), deep neural networks approach like (Zeng et al., 2015;Zhang et al., 2018;Zhang et al., 2019a;Yu et al., 2020;Wang et al., 2020), and some joint models like (Zheng et al., 2017;Ye et al., 2020). However, those methods are all supervised approaches which can only infer relations existing in the train set but are incapable of making predictions for newly-add relations.
Zero-shot relation classification (ZSRC) was first proposed by (Levy et al., 2017), which is able to extract new relations by reducing relation classification to answering simple reading comprehension questions. Lately, (Obamuyide and Vlachos, 2018) formulates relation extraction as a textual entailment problem and considers the input instance and relation description as the premise and hypothesis. However, these approaches require human annotators to construct questions or write descriptions for relations, which is labor-intensive. On the contrary, our zero-shot relation classification approach does not need any human involvement and can be integrated into most existing RC models.
Zero-shot Learning In the computer vision field, zero-shot learning (ZSL) has attracted a lot of attention. The key that underpins ZSL in image recognition is to exploit the shared semantic representations between seen and unseen classes and transfer them to the visual representations of samples. ) proposes a ZSL model called DeViSE to learn a linear mapping between image features and semantic space using an efficient ranking loss formulation. (Norouzi et al., 2013) proposes ConSE, which first predicts seen class posteriors, then projects image features into the word2vec space by considering the convex combination of top T most possible seen classes. The semantic representation of those approaches is learned by certain auxiliary information attached to the class labels, such as attribute description (Jayaraman and Grauman, 2014;Farhadi et al., 2009) and embedding representation (Romera-Paredes and Torr, 2015;Akata et al., 2016). Different from the zero-shot approaches in computer vision, we construct the semantic space by considering information from the knowledge graph rather than word embedding or attribute.
Knowledge Graph Embedding. In recent years, various KG embedding methods, including translation-based, semantic matching and neural network methods, have been devised to learn vector representations for entities and relations of a KG. Translation-based models (Bordes et al., 2013;Ji et al., 2015;Lin et al., 2015b) use distance-based scoring functions to assess the plausibility of a triple. For example, in TransE (Bordes et al., 2013), the score function is f r (h, t) = ||h + r − t|| 2 l 1/2 . Semantic matching models (Yang et al., 2014;Nickel et al., 2016;Liu et al., 2017) employ similarity-based scoring functions to compute the energy of relational triples, where the scoring function of the representative model DistMult (Yang et al., 2014) Neural network models learn to express entities and relations through neural networks, such as CNN-based methods (Dettmers et al., 2018) and GNN-based methods (Schlichtkrull et al., 2018). We utilize KG embedding models to learn the representations of relations instead of word embeddings so that the representations of relations are only related to the structure of a KG but not relations' name. Meanwhile, connections of relations are harvested.
Rule Learning. Rules over a KG can capture connections between relations, and a variety of methods for rule learning have been studied, such as Inductive Logic Programming(ILP) algorithms, rule mining methods, and embedding-based methods. ILP is formalized by first-order logic and has strong representation powers, but does not scale to large datasets (Sadeghian et al., 2019). To address this, several efficient rule miners for KGs have been developed, such as RDF2rules (Wang and Li, 2015), ScaleKB (Chen et al., 2016) and AMIE+ (Galárraga et al., 2015). In addition, embedding-based rule learning methods have gained attention. RLvLR (Omran et al., 2018) utilizes embeddings to guide rule extraction and reduce the search space. DistMult (Yang et al., 2014) utilizes learned embeddings of entities and relations to extract logical rules. And (Ho et al., 2018) introduces a framework for rule learning guided by external sources. We adopt the widely used rule mining method proposed in (Galárraga et al., 2013) to extract rules from KG in this paper. To the best of our knowledge, we are the first approach to address zero-shot relation classification with the assistance of logical rules from KG.

Preliminaries
We start by defining some notations and terms. R S denotes the set of seen relations during training and R U denotes the set of unseen relations for testing, and .., N s } denotes training dataset, where s i represents one sentence and (h i , t i ) is an entity pair mentioned in s i . N s is the number of seen relations and y i ∈ R S denotes the relation of the entity pair. Analogously, D ts = {(s j , h j , t j , y j ), j = 1, ..., N u } denotes the test dataset, where N u is the number of unseen relation and y j ∈ R U denotes the relation of h j and t j in the sentence s j . The overall framework is illustrated in Figure 2, which is composed of three modules: Feature Representation ( §3.2) encodes the input sentence into the feature space, which is aimed at capturing syntax features of sentences.
Semantic Representation Learning ( §3.3) maps relation types into a semantic space and builds up the connections between seen and unseen relations. Specifically, we propose logic-guided semantic representation learning with knowledge graph embedding and rule learning.
Inference ( §3.4) predicts relation types via computing similarity between feature representation of current input sentence and semantic representations for all unseen relations. We infer the unseen relation with the label that is most similar in the semantic space.

Feature Representation
The input of feature representation is a sentence, and the output is its vector representation. Firstly, we use the Piecewise Convolutional Neural Networks(PCNNs) (Zeng et al., 2015) model to encode input instance, and then use two types of projection functions including DeViSE and ConSE to get the final feature representation of the input instance.
PCNNs has been proven to be effective in RC. It inputs the concatenation of word embedding and position embedding into a Convolution Neural Network(CNN) to obtain the hidden layer representation h. Then, h is divided into three parts based on the two entities' positions, and max pooling is perfomed on each part to obtain (pl 1 , pl 2 , pl 3 ). Final feature encoding of the input sentence f = [pl 1 ; pl 2 ; pl 3 ] is the concatenation of the three pooling segments. We denote the process as below: DeViSE formulates ZSL as a regression problem and learns a linear function to project the input representation to target semantic space: ConSE maps input representation into target semantic space via convex combination. It trains a classifier C on training dataset D tr and obtains top T probable seen relation types R S t together with their probability p t . E(R S t ) is the embedding of R S t , and then weighted sum on E(R S t ) is regarded as the feature representation of inputs. The process can be formulated as follows:

Semantic Representation Learning
Semantic representation builds connections between unseen and seen relations in ZSRC via external resources. We describe the following three kinds of embedding representations in a semantic space. Word Embedding denoted as E wd is the commonly used method. However, this way faces challenges as analyzed in the introduction. In order to capture the rich explicit or implicit semantic connection between relations, two forms of embedding methods based on KG are introduced.
KG Embedding embeds relations and entities into latent low-dimensional continuous-space vectors, denoted as E kg . In KG embedding methods, the score of a triple (h,r,t) can be calculated via head entity embedding E(h), relation embedding E(r), and tail embedding E(t). Different methods follow different assumptions. For example, the typical method TransE (Bordes et al., 2013) supposes a translation law in the semantic space where E(h) + E(r) = E(t) for positive triples in a KG. Therefore, relation embedding from KG embedding methods is related to triples it belongs to and is not affected by what words it contains. While sometimes the word contained in the relation string also reveals the semantic of this relation, word embedding and KG embedding can complement each other at that time. Hence, we also consider combining them together through a linear transformation, KG+Word embedding is defined as: where [x; y] means concatenation of x and y. Rule-guided Embedding represents rules in vector space instead of symbolization, denoted as E rl . In a knowledge graph, logic rules show the connections between relations. They are in the form of body ⇒ head, where head is a binary atom and body is a conjunction of binary and unary atoms, such as rule spouse(x,y) ∧ father(y,z) ⇒ mother(x,z), and the number of atoms in body is the length of the corresponding rule. We adopt typical rule mining methods such as AMIE (Galárraga et al., 2013) to generate rules from structural KGs. In addition to rules, AMIE also produces the PCA confidence conf to filter out rules. Inspired by (Zhang et al., 2019b), we apply an simple but effective embeddingbased method to incorporate symbolic rules into semantic space and generate E rl . Taking TransE as an example, it assumes h + r ≈ t for a positive triple (h, r, t). According to this assumption, we can get r 1 + r 2 = r 3 if rule r 1 (x, y) ∧ r 2 (y, z) ⇒ r 3 (x, z) exists as mentioned in (Lin et al., 2015a). Thus if the rule contains an unseen relation, embedding of the unseen relation can be calculated based on other seen relations in this rule. Besides, it is possible that one unseen relation involves multiple rules, for that, we calculate unseen relation's embedding as follows: where R U i represents the i th unseen relation, Rule U ij is the j th rule in the set of rules about R U i with top K highest PCA confidence score and conf j represents the PCA confidence of Rule U ij . For example, with two rules about unseen relation r, R1 : r A ∧ r B ⇒ r and R2 : r C ∧ r ⇒ r D , following TransE's assumption, we calculate embedding of r via E rl (r) = conf 1 +conf 2 . Similar to Word+KG embedding, we also consider KG+Rule Embedding, denoted as E kr , as they might also complement each other. E kr can be calculated as follows: where λ is a hyperparameter representing the combination weight between KG embedding and rules. It is set as 0.5 in our experiment. Meanwhile, we calculate Rule+Word Embedding, denoted as E rw , by replacing E kg in Equation 5 with E rl .

Inference
During prediction, we compare the similarity between feature representation f of input sentence and the semantic representation of unseen relations as follows: The similarity function sim() can be cosine similarity or Euclidean Distance. Unseen relations with higher similarity with sentence feature representation are more likely to be predicted.

Experiment
In experiments, we want to explore: 1) Whether embeddings based on KGs are more useful for the ZSRC task than word embeddings? 2) What is the factor that can strengthen semantic representations in ZSRC?
3) Whether and how can logical knowledge help build better semantic space?

Datasets
Different from zero-shot learning relation classification dataset of (Levy et al., 2017) and (Obamuyide and Vlachos, 2018), our method considers rules of relations during training rather than question templates or relation descriptions. We construct a new dataset based upon Wikipedia-Wikidata (Sorokin and Gurevych, 2017) relation extraction dataset which contains 353 relations and 856,217 instances. To evaluate the capability of injecting rule logic into the zero-shot prediction models, we ensure that relations have certain connections in our dataset. We firstly cluster all 353 relations based on word embeddings, then divide seen and unseen relations according to the instance number of relations by the given threshold (1200) for one cluster. We drop relations from the cluster where all relations' instance number is less than 500 with the assumption that there is no support from related seen labels. Manual adjustments are further applied to get the final dataset. In this dataset, there are ultimately 100 relations, 70 of which are seen relations and the rest 30 are unseen ones. We use Wikipedia documents 1 to train word embeddings, where words appear more than ten times are preserved in vocabulary, following a common setting as other works. Word2vec  is applied for word embedding training with window size set as 5. For KG embeddings, we use TransE to train the embedding of entities and relations on Wikidata, which contains about 20,982,733 entities and 594 relations in total. The embedding size is set as 100, the margin as 1.0, and the learning rate as 0.01. For the PCNN layer, we set kernel size as 3, position embedding size as 5, the number of channels as 250, margin as 2.0, learning rate as 0.01, and dropout as 0.5. For ConSE, Top 3 seen classes are chosen for prediction. For DeViSE, the margin is set as 1.0. For rule mining, we set the max length of rules as 2.

Modules Parameters
4.3 Whether KG-based embeddings are more useful than word embeddings?

ConSE(Hit@n)
DeViSE (  To compare the effectiveness of KG-based embeddings and word embeddings, we regard methods that use E wd as baselines. Results of two kinds of methods, including KG-based embeddings and a combination of any two embeddings, are listed to explore the usefulness of KGs in the ZSRC task. The experiments also distinguish the results when using two different projection functions in the ZSL problem, i.e., ConSE and DeViSE. During testing, we rank the similarity scores between feature representations of test sentences and all unseen relations' semantic representations based on cosine similarity and get ranks of true labels. Hit@K(K=1, 2, 5) are used as evaluation metrics.
The overall results are shown in Table 2. Under the ConSE structure, methods that incorporate semantic representation based on the knowledge graph significantly outperform word embedding. Specifically, KG embedding gains improvement with 18% and rule embedding with 19% on Hit@1. The performance of word+KG embedding and word+rule-based embedding also improves a lot, and the combination of KG+rule-based embedding achieves the best performance. Thus, we can conclude that KG-based embeddings are superior to word embedding.
Additional inspection of the table shows that the results over ConSE are better than DeViSE in all embedding settings. The reason is associated with the difference between the two models. The representation space of ConSE is limited to a combined space consisting of seen classes, while DeViSE enables mapping instance embedding to the whole relation space. Thus the dataset with stronger relevance between relations is more friendly to ConSE, which is the case in our ZSRC dataset.

What is the factor that can strengthen semantic representations in ZSRC?
We analyze the question via a comparison between word embeddings and KG embeddings. A closer inspection of a specific relation is illustrated in Table 3. For most unseen relations, KG embeddings perform better than word embeddings. Especially for drafted by, occupant and office contested, word embeddings predict almost nothing, while KG embeddings achieve 81%, 31% and 26% respectively. The reason may be that word embedding is less than enough to capture complete, accurate, or even logic-level connections between relations. For example, for the relation drafted by which means "allocate certain players to teams in some sports", it is difficult for word embedding to capture its connection to the relations member of sports team and educated at because the word draft has other senses such as "draft a document" that appears much more commonly in the corpus. By contrast, KG embeddings are trained based upon the entity pairs and relationships existing within a whole knowledge base, thereby capturing  Table 3: Results of KG embedding and word embedding on F1 score when using ConSE as projection function. And top 3 most influential seen relations of the corresponding unseen relation are presented.
the more accurate meaning of, and connections between these relations.
Negative examples are also found in Table 3, such as mother, for which KG embedding predicts poorly, but word embedding achieves 83%. The reason may be that the number of training triples for mother is relatively small in our dataset, leading to poor embeddings. KG embedding suffers from its sparsity problem due to imperfect KG, whereas word embedding excels at capturing contextually similar words such as mother to father or spouse.
These results show that successfully building accurate or even logical-level connections between seen and unseen relations is an essential factor for zero-shot tasks, and this is why KG embeddings perform better than word embeddings.  We investigate this question via logic rule analysis. General inspection of Table 2 reveals that rule-based embedding is slightly better than single KG embedding, and KG+Rule embedding achieves the best result with 3∼4% improvement in overall Hit@1 score under ConSE. A further examination of case studies over different kinds of semantic representations is listed in Table 4. It shows that most relations based on rule embedding achieve at least comparable results with KG embedding such as nominated for, producer, lyrics by. Some relations, such as producer, outperform KG embedding slightly. This may because rule embedding can capture logic-level connections between seen and unseen relations. For example, the unseen relation nominated for is logically related with two seen relations award received and winner with the rule nominated for(x,z) ⇐ award received(x,y) ∧ winner(y,z). The most interesting aspect is about the relation mother for which KG embedding fails to compare with word embedding because of being poorly trained, while rule embedding achieves comparable scores with word embedding. The reason may be that rule embedding helps strengthen the embedding by incorporating more knowledge from related relations contained in the rules, thus making a correction to the relation embedding.  Figure 3: This heatmap is constructed from the result of ConSE+KG, reflecting the incidence of seen relations on unseen relations. Where the horizontal axis represents seen classes and the vertical axis represents unseen classes.
We also represent a heatmap of ConSE+KG results corresponding to Table 4, as shown in Figure3. From the heatmap, we can discover that KG embeddings could capture logical connections for the semantic connections between unseen and seen relations. For example, the relation award received plays an important role in prediction of the unseen relation nominated for, which is exactly consistent with the rule award received(x,y) ∧ winner(y,z) ⇒ nominated for(x,z) in Table 4. Similar matched correspondence for other relations are found. These results and analysis show that logical connections between relations expressed by rules could help build right and explicit connections between unseen and seen relations, thereby building a better semantic space for ZSRC task.

Conclusion and Future Work
We have studied the zero-shot relation classification task and took the first step towards bridging symbolic reasoning with semantic representations. Extensive experiments demonstrate the efficacy of our approach, revealing the advantages of knowledge graph embeddings and rules. In the future, we plan to exploit more efficient approaches to obtain symbolic rules an build end-to-end reasoning approaches for zero-shot tasks.