ZS-BERT: Towards Zero-Shot Relation Extraction with Attribute Representation Learning

While relation extraction is an essential task in knowledge acquisition and representation, and new-generated relations are common in the real world, less effort is made to predict unseen relations that cannot be observed at the training stage. In this paper, we formulate the zero-shot relation extraction problem by incorporating the text description of seen and unseen relations. We propose a novel multi-task learning model, Zero-Shot BERT (ZS-BERT), to directly predict unseen relations without hand-crafted attribute labeling and multiple pairwise classifications. Given training instances consisting of input sentences and the descriptions of their seen relations, ZS-BERT learns two functions that project sentences and relations into an embedding space by jointly minimizing the distances between them and classifying seen relations. By generating the embeddings of unseen relations and new-coming sentences based on such two functions, we use nearest neighbor search to obtain the prediction of unseen relations. Experiments conducted on two well-known datasets exhibit that ZS-BERT can outperform existing methods by at least 13.54% improvement on F1 score.


Introduction
Relation extraction is an important task in the natural language processing field, which aims to infer the semantic relation between a pair of entities within a given sentence. There are many applications based on relation extraction, such as extending knowledge bases (KB) (Lin et al., 2015) and improving question answering task . Existing approaches to this task usually require large-scale labeled data. However, the labeling cost is a considerable difficulty. Some recent studies generate labeled data based on distant supervision (Mintz et al., 2009;Ji et al., 2017). Nevertheless, when putting the relation extraction task in the wild, existing supervised models cannot well recognize the relations of instances that are extremely rare or even never covered by the training data. That said, in the real-world setting, we should not presume the relations/classes of newcoming sentences are always included in the training data. Thus it is crucial to invent new models to predict new classes that are not defined or observed beforehand. Such a task is referred as zeroshot learning (ZSL) (Norouzi et al., 2013;Lampert et al., 2014;Ba et al., 2015;Kodirov et al., 2017). The idea of ZSL is to connect seen and the unseen classes by finding an intermediate semantic representation. Unlike the common way to train a supervised model, seen and unseen classes are disjoint at training and testing stages. Hence, ZSL models need to generate transferable knowledge between them. With a model for ZSL relation extraction, we will be allowed to extract unobserved relations, and to deal with new relations resulting from the birth of new entities.
Existing studies on ZSL relation extraction are few and face some challenges. First, while the typical study (Levy et al., 2017) cannot perform zeroshot relation classification without putting more human effort on it, as they solve this problem via pre-defining question templates. However, it is infeasible and impractical to manually create templates of new-coming unseen relations under the zero-shot setting. We would expect a model that can produce accurate zero-shot prediction without the effort of hand-crafted labeling. In this work, we take advantage of the description of relations, which are usually publicly available, to achieve the goal. Second, although there exists studies that also utilize the accessibility of the relation descriptions (Obamuyide and Vlachos, 2018), they simply treat zero-shot prediction as the text entailment task and only output a binary label that indicates whether the entities in the input sentence can be depicted by a given relation description. Such problem formulation requires the impractical execution : member of (seen) : organization, musical group, or club to which the subject belongs : He had roles in two 2008 films: the sci-fi film "Jumper" and the World War II drama "Defiance" : main subject (seen) : primary topic of a work

Testing
: During the Philippine-American War, Mark Twain wrote a short pacifist story titled "The War Prayer" of multiple classifications over all relation descriptions, and cannot make relations comparable with each other. This paper presents a novel model, Zero-shot BERT (ZS-BERT), to perform zero-shot learning for relation extraction to cope with the challenges mentioned above. ZS-BERT takes two model inputs. One is the input sentence containing the pair of target entities, and the other is the relation description, i.e., text describing the relation of two target entities. The model output is the attribute vector 1 depicting the relation. The attribute vector can be considered as a semantic representation of the relation, and will be used to generate the final prediction of unseen relations. We think a better utilization of relation descriptions by representation learning is more cost-effective than collecting tons of instances with labeled relations. Therefore, an essential benefit of ZS-BERT is free from heavycost crowdsourcing or annotation, i.e., annotating what kind of attribute does a class have, which is commonly used in zero-shot learning problem (Lu et al., 2018;Lampert et al., 2009). Figure 1 depicts the overview of the proposed ZS-BERT, which consists of five steps. Each training instance is a pair of input sentence X i and its corresponding relation's description D j . First, we learn a projection function f that projects the input sentence X i to its corresponding attribute vector, i.e., sentence embedding. Second, we learn another mapping function g that encodes the relation description D j as into its corresponding attribute vector, which is the semantic representation of D j . Third, given the training instance (X i , D j ), we train ZS-BERT by minimizing the distance be-1 The terms, "attribute vector", "embedding", and "representation", are used interchangeably throughout this paper.
tween attribute vectors f (X i ) and g(D j ) in the embedding space. Fourth, with the learned g(D l ), we are allowed to project the unseen relation's description D l into the embedding space so that unseen classes can be as separate as possible for zero-shot prediction. Last, given a new input sentence Z k , we can use its attributed vector f (Z k ) to find the nearest neighbor in the embedding space as the final prediction. In short, the main idea of ZS-BERT is to learn the representations of relations based on their descriptions, and to align the representations with input sentences, at the training stage. In addition, we exploit the learned alignment projection functions f and g to generate the prediction of unseen relations for the new sentence so that the zero-shot relation extraction can be achieved. Our contributions can be summarized as below.
• Conceptually, we formulate the zero-shot relation extraction problem by leveraging text descriptions of seen and unseen relations. To the best of our knowledge, we are the first attempt to directly predict unseen relation under the zero-shot setting via learning the representations from relation descriptions.
• Technically, we propose a novel deep learningbased model, ZS-BERT 2 , to tackle the zeroshot relation extraction task. ZS-BERT learns the projection functions to align the input sentence with its relation in the embedding space, and thus is capable of predicting relations that were not seen during the training stage.
• Empirically, experiments conducted on two well-known datasets exhibit that ZS-BERT can significantly outperform state-of-the-art methods for predicting unseen relations under the ZSL setting. We also show that ZS-BERT can be quickly adapted and generalized to fewshot learning when a small fraction of labeled data for unseen relations is available.

Related Work
BERT-based Relation Extraction. Contextual representation of words is effective for NLP tasks.
BERT (Devlin et al., 2019) is a pre-training language model that learns useful contextual word representations. BERT can be moderately adopted for supervised or few-shot relation extraction. R-BERT (Wu and He, 2019) utilize BERT to generate contextualized word representation, along with entities' information to perform supervised relation extraction and have shown promising result. BERT-PAIR (Gao et al., 2019) makes use of the pre-train BERT sentence classification model for few-shot relation extraction. By pairing each query sentence with all sentences in the support set, they can get the similarity between sentences by pretrained BERT, and accordingly classify new classes with a handful of instances. These models aim to solve the general relation extraction task, which are more or less having ground truth, rather than having it under the zero-shot setting.
Zero-shot Relation Extraction. Relevant studies on zero-shot relation extraction are limited. To the best of our knowledge, there are two most similar papers, which consider zero-shot relation extraction as two different tasks. Levy et al. (2017) treat zero-shot relation extraction as a question answering task. They manually define 10 question templates to represent relations, and generate the prediction by training a reading comprehension model to answer which relation satisfies the given sentence and question. However, it is required to have human efforts on defining question templates for unseen relations so that ZSL can be performed. Such annotation by domain knowledge is unfeasible in the wild when more unseen relations come. On the contrary, the data requirement of ZS-BERT is relatively lightweight. For each relation, we only need one description that could express the semantic meaning. The descriptions of relations are easier to be collected as we may access them from open resources. Under such circumstances, we may be free from putting additonal effort to the annotation.
Obamuyide and Vlachos (2018) formulate ZSL relation extraction as a textual entailment task, which requires the model to predict whether the input sentence containing two entities matches the description of a given relation. They use Enhanced Sequential Inference Model (ESIM) (Chen et al., 2016) and Conditioned Inference Model (CIM) (Rocktäschel et al., 2015) as their entailment methods. By pairing each input sentence with every relation description, they train the models to answer whether the paired texts are contradiction or entailment. This allow the model to inference on input sentence and unseen relation description pair, thus is able to predict unseen relation accordingly.

Problem Definition
Let Y s = {y 1 s , ..., y n s } and Y u = {y 1 u , ..., y m u } denote the sets of seen and unseen relation labels, respectively, in which n = |Y s | and m = |Y u | are the numbers of relations in two sets. Such two sets are disjoint, i.e., Y s ∩ Y u = ∅. For each relation label in seen and unseen sets, we denote the corresponding attribute vector as a i s ∈ R n×d and a i u ∈ R m×d , respectively. Given the training set with N samples, consisting of input sentence X i , entities e i1 and e i2 , and the description D i of the corresponding seen relation y j s , denoted as Our goal is to train a zero-shot relation extraction model M, i.e., M(S i ) → y i s ∈ Y s , based on the training set such that using M to predict the unseen relation y k u of a testing instance S , i.e., M(S ) → y j u ∈ Y u , can achieve as better as possible performance.
We train the model M so that the semantics between input sentence and relation description can be aligned. We learn M by minimizing the distance between two embedding vectors f (X i ) and g(D i ), where learnable functions f and g project X i and D i into the embedding space, respectively. When new unseen relation y j u and its description is in hand, we can project the description of y j u to the embedding space by function g. When testing, new instance S = (Z j , e j1 , e j2 , D j ) is input, in which Z i denotes new sentence containing entities e j1 and e j2 , we project Z i to the embedding space by our learned function f , and find the nearest neighboring unseen relation y j u , where Z i and y i u are both unknown at the training stage.

The Proposed ZS-BERT Model
We give an overview of our ZS-BERT in Figure 2. The input sentence X i is tokenized and sent into the upper-part ZS-BERT encoder to obtain contextual representation. We specifically extract the representation of [CLS], H 0 , and two entities' representations H 1 e , H 2 e , and then concatenate them to derive sentence embeddingsâ i s , by a fully-connected layer and activation operation. In the bottom part, we use Sentence-BERT (Reimers and Gurevych, 2019) to obtain attribute vector a i s for seen relations by encoding the corresponding description of relation D i . We train ZS-BERT under a multitask learning structure. One task is to minimize the distance between attribute vector a i s and sentence embeddingâ i s . The other is to classify the seen relation y j s at the training stage, in which a softmax   layer that accepts relation embedding is used to produce the relation classification probability. At the testing stage, by obtaining the embeddings of new-coming sentences and unseen relations, we useâ i s and nearest neighbor search to obtain the prediction of unseen relations.

Learning Relation Attribute Vectors
For each seen and unseen relation, we learn its representation that depicts the corresponding semantic attributes based on relation description D i . Most relations are well-defined and their descriptions are accessible from online open resources such as Wikidata 3 . We feed relation description D i into a pre-trained Sentence-BERT encoder (Reimers and Gurevych, 2019) to generate the sentence-level representation as the attribute vector a i of relations. This procedure is shown in the bottom part of Figure 2. The ground truth relation of the example is publisher, along with its description Organization or person responsible for publishing books, games or software. We feed only the relation description to the Sentence-BERT in order to get the attribute vector. That said, we consider the derived Sentence-BERT to be a projection function g that transforms the relation description D i into a i . Note that the relation attribute vectors produced by Sentence-BERT are fixed during model training.

Input Sentence Encoder
We utilize BERT (Devlin et al., 2019) to generate the contextual representation of each token. We first tokenize the input sentences X i with Word-Piece tokenization (Sennrich et al., 2016). Two special tokens [CLS] and [SEP] are appended to the first and last positions, respectively. Since the entity itself does matter in relation extraction, we use an entity marker vector, consisting of all zeros except the indices that entities appear in a sentence, to indicate the positions of entities e i1 and e i2 . Let H 0 be the hidden state of the first special token [CLS]. We use a tanh activation function, together with a fully connected layer, to derive the representation vector H 0 , given by: where W 0 and b 0 are learnable parameters for weights and biases. We obtain the hidden state vectors of two entities, H 1 e and H 2 e , by averaging their respective tokens' hidden state vectors. The entity can be recognized via simple element-wise multiplication between entity marker vector and token hidden vector. Specifically, if an entity e consists of multiple tokens and the indices range from q to r, we average the hidden state vectors, and also add an activation operation with a fully connected layer to generate its representation of that entity, given by: H c e = W e tanh 1 r−q+1 r t=q H t + b e , where c = 1, 2. Note that the representations of two entities H c e (c = 1, 2) in the sentence shares the same parameters W e and b e . Then we learn the attribute vectorâ i s by concatenating H 0 , H 1 e , and H 2 e , followed by a hidden layer, given by: where W 1 and b 1 are learnable parameters , the dimensionality ofâ i s is d, and ⊕ is the concatenation operator.

Model Training
The training of our ZS-BERT model consists of two objectives. The first is to minimize the distance between input sentence embedding a i s and the corresponding relation attribute vectorâ i s (i.e., positive pairs), meanwhile to ensure embedding pairs between input sentence embedding and mismatched relation (i.e., negative pairs) to be farther away from each other. The black arrow connecting a i s andâ i s in Figure 2 is a visualization to indicate that we take both a i s andâ i s into consideration to achieve this goal. This is also reflected in the first term of our proposed loss function introduced below. The second objective is to maximize the accuracy of relation classification based on seen relations using cross entropy loss. We transform the relation embedding, along with a softmax layer, to generate a n-dimensional (n = |Y s |) classification probability distribution over seen relations: p(y s |X i , θ) = sof tmax(W * (tanh(â i s )) + b * ), where y s ∈ Y s is the seen relation, θ is the model parameter, W * ∈ R n×h , h is the dimension of hidden layer, and b * ∈ R n . Note that we do not use the probability distribution but the input sentence embeddingâ i s produced intermediately for predicting unseen relations under zero-shot settings.
The objective function of ZS-BERT is as follows: where N is the number of samples, a i s is the relation attribute vector, andâ i s is the input sentence embedding. The first term in Eq. (2) sets a margin γ > 0 such that the inner product of the positive pair (i.e., a i s ·â i s ) must be higher than the maximum of the negative one (i.e., max i =j (a i s ·â j s )) for more than a pre-decided threshold γ. With the introduction of γ, the loss will be increased owing to the difference between the positive and the closest negative pairs. This design of loss function can be viewed as ranking the correct relation attribute higher than the closest incorrect one. In addition, γ is also utilized to avoid the embedding space from collapsing. If we consider only minimizing the distance of positive pair using loss like Mean Squared Error, the optimization may lead to the result that every vector in the embedding space is too close to one another. We will examine how different γ values affect the performance in the experiment. To maintain low computational complexity, we consider only those mismatched relations within a batch as the negative samples j. The second term in Eq. (2) is a commonly used cross entropy loss, which decreases as the predictionŷ i s is correctly classified. Such a multi-task structure is expected to refine the input sentence embeddings and simultaneously bring high prediction accuracy of seen relations.

Generating Zero-Shot Prediction
With the trained model, when the descriptions of new relations are in hand, we can generate their attribute vectors a j u . As the new input sentence Z i arrives, we can also produce its sentence embeddinĝ a i u via:â i u = W 1 (tanh([H 0 ⊕ H 1 e ⊕ H 2 e ])) + b 1 , where W 1 and b 1 are learned parameters. The prediction on unseen relations can be achieved by the nearest neighbor search. For the input sentence embeddingâ i u , we find the nearest attribute vector a j u and consider the corresponding relation as the predicted unseen relation. This can be depicted by: , where function C returns the predicted relation of new input sentence Z i , a j u is the j-th attribute vector among all unseen relations in the embedding space,â i u is the new input sentence embedding, and dist is a distance computing function. Here negative inner product is used as dist since we aim to consider the nearest neighboring relation as the predicted outcome.

Evaluation Settings
Datasets. Two datasets are employed, Wiki-ZSL and FewRel (Han et al., 2018). Wiki-ZSL is originated from Wiki-KB (Sorokin and Gurevych, 2017), and is generated with distant supervision. That said, in Wiki-ZSL, entities are extracted from complete articles in Wikipedia, and are linked to the Wikidata knowledge base so that their relations can be obtained. Since 395, 976 instances (about 26% of the total data) do not contain relations in the original Wiki-KB data, we neglect instances with relation "none". To ensure having sufficient data instances for each relation in zero-shot learning, we further filter out the relations that appear fewer than 300 times. Eventually, we can have yields Wiki-ZSL, a subset of Wiki-KB.
On the other hand, FewRel (Han et al., 2018) is compiled by a similar way to collect entity-relation triplet with sentences, but had been further filtered by crowd workers. This ensures the data quality and class balance. Although FewRel is originally proposed for few-shot learning, it is also suitable for zero-shot learning as long as the relation labels within training and testing data are disjoint. The statistics of Wiki-KB, Wiki-ZSL and FewRel datasets are shown in Table 1.
ZSL Settings. We randomly select m relations as unseen ones (m = |Y u |), and randomly split the whole dataset into training and testing data, meanwhile ensuring that these m relations do not appear in training data so that Y s ∩ Y u = ∅. We repeat the experiment 5 times for random selection of m relations and random training-testing splitting, and report the average results. We will also vary m to examine how performance is affected. We use Precision (P), Recall (R), and F1 as the evaluation metrics. As for the hyperparameters and configuration of ZS-BERT, we use Adam (Kingma and Ba, 2014) as the optimizer, in which the initial learning rate is 5e − 6, the hidden layer size is 768, the dimension of input sentence embedding and attribute vector is 1024, the batch size is 4, γ = 7.5, and α = 0.4.
Competing Methods. The compared methods consist of two categories, supervised relation extraction (SRE) models and text entailment models. The former includes CNN-based SRE (Zeng et al., 2014), Bi-LSTM SRE (Zhang et al., 2015), Attentional Bi-LSTM SRE (Zhou et al., 2016), and R-BERT (Wu and He, 2019). These SRE models use different ways to extract features from the input sentences and perform prediction. They have achieved great performance with fully supervision but fail to carry out zero-shot prediction. To make them capable of zero-shot prediction, also to have fair comparison, instead of originally using a softmax layer to output a probability vector whose dimension is equal to the seen relations, we change the last hidden layer of each SRE competing method to a fully-connected layer with a tanh activation function, and the embedding dimension d is the same as ZS-BERT. The nearest neighbor search is applied over input sentence embeddings and relation attribute vectors to generate zero-shot prediction.
Two text entailment models, ESIM (Chen et al., 2016) and CIM (Rocktäschel et al., 2015), are also used for comparison. These two models follow a well-known implementation (Obamuyide and Vlachos, 2018) that formulates zero-shot relation extraction as a text entailment task, which accepts sentence and relation description as input, and output a binary label indicating whether they are semantically matched. ESIM uses bi-LSTM (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) to encode two input sequences, passes them through the local inference model, and produces the prediction via a softmax layer. CIM replaces the bi-LSTM block with a conditional version, i.e., the representation of sentence is conditioned on its relation description. Note that although there exist other zero-shot relation extraction approaches such as the approach proposed by Levy et al. (2017), their approach to formulate the ZSL task and their data requirement are quite different with our present work. To be specific, their method requires pre-defined question template, whereas our model does not. Hence it would be unfair to compare with those approaches.

Experimental Results
Main Results. The experiment results by varying m unseen relations are shown in Table 2. First, it can be apparently found that the proposed ZS-BERT steadily outperforms the competing methods over two datasets when targeting at different numbers of unseen relations. The superiority of ZS-BERT gets more significant on m = 5. Such results not only validate the effectiveness of leveraging relation descriptions, but also prove the usefulness of the proposed multi-task learning structure that better encodes the semantics of input sentences and have relation attribute vectors been differentiated from each other. Second, although the text entailment models ESIM and CIM perform well among competing methods, their performance is still obviously lower than ZS-BERT. The reason is that their approaches cannot precisely distinguish the semantics of input sentences and relation descriptions in the embedding space. Third, we also find that the improvement of ZS-BERT gets larger when m is smaller. Increasing m weakens the superiority of ZS-BERT. It is straightforward that as the number of unseen relations increases, it becomes more difficult to predict the right relation since the possible choices have increased. We also speculate another underlying reason is that although ZS-BERT can effectively capture the latent attributes for each relation, relations themselves could be to some extent semantically similar to one another, and more unseen relations will increase the possibility that obtains a predicted relation that is semantically close but actually wrong. To verify this conjecture, we will give an example in the case study. Hyperparameter Sensitivity. We examine how primary hyperparameters, including the value of margin parameter γ and the balance coefficient α in Eq. 2, affect the performance of ZS-BERT. By fixing m = 10 and varying γ and α, the results in terms of F1 scores on two datasets are exhibited on Figure 3. It is noteworthy that γ does have an impact on performance, since it brings the condition on whether to increase the loss value, which is determined by the difference between the positive pair and negative pair. Nevertheless, not always the higher values of γ lead to better performance. This is reasonable that when γ is too low, the distance between the positive pair and negative pair would not be far enough. Thus, when performing nearest neighbor search, it is more likely to reach the wrong relations. In contrast, when γ gets too high, it is hard for the training process to converge at the point that the distance between relations is expected to be that high. We would suggest setting γ = 7.5 to derive satisfying results across datasets. As for the balance coefficient α in the loss function, we find that α = 0.4 can achieve the best performance, indicating that the margin loss plays a more significant role in training ZS-BERT. Also notice that when α = 1.0, the performance drops dramatically, showing that the margin loss is essential to our model. This is also reasonable that since our model relies on the quality of embeddings, therefore totally relying on cross entropy loss leads to failure of zero-shot prediction. The better separation between embeddings of different relations, the more likely our model can generate the accurate zero-shot prediction. In addition, while the nearest neighbor search is performed to generate the zero-shot prediction, we think the choice of distance computing function dist() can also be an hyperparameter. By applying inner product, Euclidean distance, and the cosine similarity as dist() in ZS-BERT, we report their F1 scores with different m on two datasets in the right of Figure 4. The results inform us that inner product is a proper distance function for zero-shot relation extraction with ZS-BERT.
Few-shot Prediction. To understand the capability of ZS-BERT, we conduct the experiment of few-shot prediction. By following the setting of an existing work (Obamuyide and Vlachos, 2018), we make a small fraction of unseen data instances available at the training stage. That said, for each originally unseen relation, we move a small fraction of its sentences, along with the relation de-  scription, from the testing to the training stage. By varying the fraction in x-axis, we report the results of few-shot prediction in Figure 4. We can find that that ZS-BERT can reach about 80% on F1 score with only 2% of unseen instances as supervision.
Such results demonstrate the ability to recognize rare samples and the capability of few-shot learning for the proposed ZS-BERT. As expected, the more instances belonging to unseen relations available at the training stage, the higher the F1 score is. When the fraction equals to 10%, ZS-BERT can even achieve 90% F1 score on Wiki-ZSL dataset.

Case Study
We categorize four types of incorrectly predicted unseen relations for the analysis: (1) The predicted relation is not precise for the targeted entity pair but may be suitable for other entities that also appear in the sentence.
(2) The true relation is not appropriate because it comes from distant supervision.
(3) The predicted relation is ambiguous or is a synonym of other relations. (4) The relation is wrongly predicted but should be able to be correctly classified. For each of these four types, we provide an example listed in Table 3. In case (1), the targeted entities are Anaconda and The Pinkprint, and ZS-BERT yields publisher as the prediction, which is actually correct if the targeted entities are Anaconda and Minaj. This shows ZS-BERT is able to infer the possible relation for entities in the given sentence, but sometimes could be misled by non-targeted entities even though we have an entity mask to indicate the targeted entities. In case (2), it shows the noise originated from distant labeling. That is, even human being cannot identify the relation between Heaven and Hell is opposite of in this specific sentence. They just happened to appear together and their relation recorded in Wikidata is opposite of. In case (3), the predicted unseen relation is manufacturer, while the ground truth is publisher. Both manufacturer and publisher describe someone make or produce something, although their domains are slightly different. This exhibits the capability of ZS-BERT to identify the input sentence with an abstract attribute because relations possessing similar semantics will have similar attribute vectors in the embedding space. Finally, in case (4), the model gives a wrong prediction that is not even close or related, which may due to the noise or information loss when transferring knowledge between relations.
Among these four groups, we are especially interested in case (3) since the semantic similarity between relations in the embedding space greatly impacts the performance. We select five semanticallydistant relations, and the other five relations that possess similar semantics between two or three of them, to inspect their distributions in the embedding space. We feed sentences with these relations and generate their embeddings using ZS-BERT and R-BERT (Wu and He, 2019) for comparison. We choose R-BERT because it is the strongest embedding-based competing method for zero-shot prediction by nearest neighbor search. Note that since the predictions by text entailmentbased models, ESIM and CIM, neither resort to similarity search nor directly predict unseen relation at one time, we cannot have them compared in this analysis. We visualize the embedding space by t-SNE (Maaten and Hinton, 2008), as shown in Figure 5. We can find that when the relations are somewhat similar in their meanings ( Figure 5(a),(c)), some of the data points are mingled with different clusters, as they indeed have close semantic relationships. Take subsidiary and owned by as examples, Company A is a subsidiary of company B and Company A is owned by company B refer to the same thing. This happens on both ZS-BERT and R-BERT but to a different extent. It is obvious that the embeddings produced by R-BERT are more tangled. We also plot the other five relations that there is no ambiguity among them ( Figure 5(b),(d)). Apparently their embeddings are more separated between different relations. It is also obvious that the embeddings generated by ZS-BERT lead to larger inter-relation distance. This again exhibits the usefulness of the proposed ranking loss and multi-task learning structure.

Conclusions
In this work, we present a novel and effective model, ZS-BERT, to tackle the zero-shot relation extraction task. With the multi-task learning structure and the quality of contextual representation learning, ZS-BERT can not only well embed input sentences to the embedding space, but also substantially improve the performance. We have also conducted extensive experiments to study different aspects of ZS-BERT, from hyperparameter sensitivity to case study, and eventually show that ZS-BERT can steadily outperform existing relation extraction models under zero-shot settings. Furthermore, learning effective embeddings for relations might also be helpful to semi-supervised learning or few-shot learning by utilizing prototypes of relations as the auxiliary information.