Distantly Supervised Relation Extraction with Sentence Reconstruction and Knowledge Base Priors

We propose a multi-task, probabilistic approach to facilitate distantly supervised relation extraction by bringing closer the representations of sentences that contain the same Knowledge Base pairs. To achieve this, we bias the latent space of sentences via a Variational Autoencoder (VAE) that is trained jointly with a relation classifier. The latent code guides the pair representations and influences sentence reconstruction. Experimental results on two datasets created via distant supervision indicate that multi-task learning results in performance benefits. Additional exploration of employing Knowledge Base priors into theVAE reveals that the sentence space can be shifted towards that of the Knowledge Base, offering interpretability and further improving results.


Introduction
Distant supervision (DS) is a setting where information from existing, structured knowledge, such as Knowledge Bases (KB), is exploited to automatically annotate raw data. For the task of relation extraction, this setting was popularised by Mintz et al. (2009). Sentences containing a pair of interest were annotated as positive instances of a relation, if and only if the pair was found to share this relation in the KB. However, due to the strictness of this assumption, relaxations were proposed, such as the at-least-one assumption introduced by Riedel et al. (2010): Instead of assuming that all sentences in which a known related pair appears express the relationship, we assume that at least one of these sentences (namely a bag of sentences) expresses the relationship. Figure 1 shows example bags for two entity pairs. Figure 1: Example of the bag-level setting in distantly supervised relation extraction and the main idea of our approach. Sentences are adapted from the NYT10 dataset (Riedel et al., 2010).
The usefulness of distantly supervised relation extraction (DSRE) is reflected in facilitating automatic data annotation, as well as the usage of such data to train models for KB population (Ji and Grishman, 2011). However, DSRE suffers from noisy instances, long-tail relations and unbalanced bag sizes. Typical noise reduction methods have focused on using attention (Lin et al., 2016;Ye and Ling, 2019) or reinforcement learning (Qin et al., 2018b;. For long-tail relations, relation type hierarchies and entity descriptors have been proposed (She et al., 2018;Hu et al., 2019), while the limited bag size is usually tackled through incorporation of external data (Beltagy et al., 2019), information from KBs (Vashishth et al., 2018) or pre-trained language models (Alt et al., 2019). Our goal is not to investigate noise reduction, since it has already been widely addressed. Instead, we aim to propose a more general framework that can be easily combined with existing noise reduction methods or pre-trained language models.
Methods that combine information from Knowledge Bases in the form of pre-trained Knowledge Graph (KG) embeddings have been particularly effective in DSRE. This is expected since they capture broad associations between entities, thus assisting the detection of facts. Existing approaches either encourage explicit agreement between sentence-and KB-level classification decisions Xu and Barbosa, 2019), minimise the distance between KB pairs and sentence embeddings  or directly incorporate KB embeddings into the training process in the form of attention queries (Han et al., 2018;She et al., 2018;Hu et al., 2019). Although these signals are beneficial, direct usage of KB embeddings into the model often requires explicit KB representations of entities and relations, leading to poor generalisation to unseen examples. In addition, forcing decisions between KB and text to be the same makes the connection between contextagnostic (from the KB) and context-aware (from sentences) pairs rigid, as they often express different things.
Variational Autoencoders (VAEs) (Kingma and Welling, 2013) are latent variable encoder-decoder models that parameterise posterior distributions using neural networks. As such, they learn an effective latent space which can be easily manipulated. Sentence reconstruction via encoder-decoder networks helps sentence expressivity by learning semantic or syntactic similarities in the sentence space. On the other hand, signals from a KB can assist detection of factual relations. We aim to combine these two using a VAE together with a bag-level relation classifier. We then either force each sentence's latent code to be close to the Normal distribution (Bowman et al., 2016), or to a prior distribution obtained from KB embeddings. This latent code is employed into sentence representations for classification and is responsible for sentence reconstruction. As it is influenced by the prior we essentially inject signals from the KB to the target task. In addition, sentence reconstruction learns to preserve elements that are useful for the bag relation. To the best of our knowledge, this is the first attempt to combine a VAE with a bag-level classifier for DSRE.
Finally, there are methods for DSRE that follow a rather flawed evaluation setting, where several test pairs are included in the training set. Under this setting, the generalisability of such methods can be exaggerated. We test these approaches under data without overlaps and find that their performance is severely deprecated. With this comparison, we aim to promote evaluation on the amended version of existing DSRE data that can prevent memori-sation of test pair relations. Our contributions are threefold: • Propose a multi-task learning setting for DSRE.
Our results suggest that combination of both bag classification and bag reconstruction improves the target task. • Propose a probabilistic model to make the space of sentence representations resemble that of a KB, promoting interpretability. • Compare existing approaches on data without train-test pair overlaps to enforce fairer comparison between models.
2 Proposed Approach

Task Description
In DSRE, the bag setting is typically adopted. A model's input is a pair of named entities e 1 , e 2 (mapped to a Knowledge Base), and a bag of sentences B = {s 1 , s 2 , . . . , s n }, where the pair occurs, retrieved from a raw corpus. The goal of the task is to identify the relation(s), from a predefined set R, that the two entities share, based on the sentences in the bag B. Since each pair can share multiple relations at the same time, the task is considered a multi-label classification problem.

Overall Framework
Our proposed approach is illustrated in Figure 2. The main goal is to create a joint learning setting where a bag of sentences is encoded and reconstructed and, at the same time, the bag representation is used to predict relation(s) shared between two given entities. The architecture receives as input a bag of sentences for a given pair and outputs (i) predicted relations for the pair and (ii) the reconstructed sentences in the bag. The two outputs are produced by two branches: the left branch, corresponding to bag classification and the right branch, corresponding to bag reconstruction. Both branches start from a shared encoder and they communicate via the latent code of a VAE that is responsible for the information used in the representation and reconstruction of each sentence in the bag. Naturally, both branches have an effect on one another during training.
(e.g. a sentence). They learn an informative representation of the input into a dense and smaller feature vector, namely the latent code. This intermediate representation is then used to fully reconstruct the original input. Variational Autoencoders (VAE) (Kingma and Welling, 2013) offer better generalisation capabilities compared to the former by sampling the features of the latent code from a prior distribution that we assume to be similar to the distribution of the data.

Encoder
We form the input of the network similarly to previous work. Each sentence in the input bag is transformed into a sequence of vectors. Words and positions are mapped into real-valued vectors via word embedding E (w) and position embedding layers E (p) , similarly to Lin et al. (2016). The concatenation of word (w) and position (p) embeddings ] forms the representation of each word in the input sentence. A Bidirectional Long-Short Term Memory (BiLSTM) network (Hochreiter and Schmidhuber, 1997) acts as the encoder, producing contextualised representations for each word.
The representations of the left-to-right and rightto-left passes of the BiLSTM are summed to produce the output representation of each word t, of the input sentence. We use the last hidden and cell states of each sentence s to construct the pa-rameters of a posterior distribution q φ (z|s) using two linear layers, where µ and σ 2 are the parameters of a multivariate Gaussian, representing the feature space of the sentence. This distribution is approximated via a latent code z, using the reparameterisation trick (Kingma and Welling, 2013) to enable back-propagation, as follows: This trick essentially forms the posterior as a function of the normal distribution.

Decoder
The decoder network is a uni-directional LSTM network, that reconstructs each sentence in the input bag. The input is formed in two steps. Firstly, the latent code z is given as the initial hidden state of the decoder h 0 via a linear layer transformation. Secondly, the same latent code is concatenated with the representation of each word w t in the input sequence of the decoder.
A percentage of words in the decoder's input is randomly replaced by the UNK word to force the decoder to rely on the latent code for word prediction, similar to Bowman et al. (2016).

Learning
The optimisation objective of the VAE, namely Evidence Lower BOund (ELBO), is the combination of two losses. The first is the reconstruction loss that corresponds to the cross entropy between the actual sentence s and its reconstructionŝ. The second is the Kullback-Leibler divergence (D KL ) between a prior distribution p θ (z), which the latent code is assumed to follow, and the posterior q φ (z|h), which the decoder produces, The first loss is responsible for the accurate reconstruction of each word in the input, while the second acts as a regularisation term that encourages the posterior of each sentence to be close to the prior. Typically, an additional parameter β is introduced in front of the D KL to overcome KL vanishing, a phenomenon where the posterior collapses to the prior and the VAE essentially behaves as a standard autoencoder (Bowman et al., 2016).

Bag Classification
Moving on to the left branch of Figure 2, in order to represent a bag we first need to represent each sentence inside it. We realise this using information produced by the VAE as follows.

Sentence Representation
Given the contextualised output of the encoder o, we construct entity representations e 1 and e 2 for a given pair in a sentence by averaging the word representations included in each entity. A sentence representation s is formed as follows: where |e i | corresponds to the number of words inside the mention span of entity e i and z is the latent code of the sentence that was produced by the VAE, as described in Equation (2).

Bag Representation
In order to form a unified bag representation B for a pair, we adopt the popular selective attention approach introduced by Lin et al. (2016). In particular, we first map relations into real-valued vectors, via a relation embedding layer E (r) . Each relation embedding is then used as a query over the sentences in the bag, resulting in |R| bag representations for each pair, where r is the embedding associated with relation r, s i is the representation of sentence s i ∈ B, a (s i ) r is the weight of sentence s i with relation r and B r is the final bag representation for relation r.
During classification, we select the probability of predicting a relation category r, using the bag representation that was constructed when the respective relation embedding r was the query. Binary cross entropy loss is applied on the resulting predictions, where W c and b c are learned parameters of the classifier, σ is the sigmoid activation function, p(r|B) is the probability associated with relation r given a bag B and y r is the ground truth for this relation with possible values 1 or 0.

Knowledge Base Priors
In the scenario where no KB information is incorporated into the model, we simply assume that the prior distribution of the latent code p θ (z) is a standard Gaussian with zero mean and identity covariance N (0, I).
To integrate information about the nature of triples into the bag-level classifier, we create KBguided priors as an alternative to the standard Gaussian. In particular, we train a link prediction model, such as TransE , on a subset of the Knowledge Graph that was used to originally create the dataset. Using the link prediction model, we obtain entity embeddings for the subset KB. A KB-guided prior can thus be constructed for each pair, as another Gaussian distribution with mean value equal to the KB pair representation and covariance as the identity matrix, (8) where e h and e t are the vectors for entities e head and e tail as resulted from training a link prediction algorithm on a KB.
The link prediction algorithm is trained to make representations of pairs expressing the same relations to be close in space. Hence, by using KB priors we try to force the distribution of sentences in a bag to follow the distribution of the pair in the KB. If one of the pair entities does not exist in the KB subset, the mean vector of the pair's prior will be zero, resulting in a standard Gaussian prior. Finally, KB priors are only used during training. Consequently, the model does not use any direct KB information during inference.

Training Objective
We train jointly bag classification and sentence reconstruction. The final optimisation objective is formed as, where λ corresponds to a weight in [0, 1]. We weigh the classification loss more than the ELBO to allow the model to better fit the target task.
3 Experimental Settings

Datasets
We experiment with the following two datasets: NYT10.
The widely used New York Times dataset (Riedel et al., 2010) contains 53 relation categories including a negative relation (NA) indicating no relation between two entities. We use the version of the data provided by the OpenNRE framework (Han et al., 2019), which removes overlapping pairs between train and test data. The dataset statistics are shown in Table 1. Additional information can be found in Appendix A.1.
For the choice of the Knowledge Base, we use a subset of Freebase 2 that includes 3 million entities with the most connections, similar to Xu and Barbosa (2019). For all pairs appearing in the test set of NYT10 (both positive and negative), we remove all links in the subset of Freebase to ensure that we will not memorise any relations between them . The resulting KB contains approximately 24 million triples.
WIKIDISTANT. The WikiDistant dataset is almost double the size of the NYT10 and contains 454 target relation categories, including the negative relation. It was recently introduced by Han et al. (2020) as a cleaner and more well structured bag-level dataset compared to NYT10, with fewer negative instances.
For the Knowledge Base, we use the version of Wikidata 3 provided by Wang et al. (2019b) (in particular the transductive split 4 ), containing approximately 5 million entities. Similarly to Freebase, we remove all links between pairs in the test set from the resulting KB, which contains approximately 20 million triples after pruning.

Evaluation Metrics
Following prior work, we consider the Precision-Recall Area Under the Curve (AUC) as the primary metric for both datasets. We additionally report Precision at N (P@N), that measures the percentage of correct classifications for the top N most confident predictions.

Training
To obtain the KB priors, we train TransE on the subsets of Freebase and Wikidata using the implementation of the DGL-KE toolkit (Zheng et al., 2020) for 500K steps and a dimensionality equal to the dimension of the latent code. The main model was implemented with PyTorch (Paszke et al., 2019). We use the Adam (Kingma and Ba, 2014) optimiser with learning rate 0.001. KL logistic annealing is incorporated only in the case where the prior is the Normal distribution to avoid KL vanishing (Bowman et al., 2016). Early stopping is used to determine the best epoch based on the AUC score on the validation set. Words in the vocabulary are initialised with pre-trained, 50-dimensional GloVe embeddings (Pennington et al., 2014). We limit the vocabulary size to the top 40K and 50K most frequent words for NYT10 and WIKIDIS-TANT, respectively. To enable fast training, we use Adaptive Softmax (Grave et al., 2017). The maximum sentence length is restricted to 50 for NYT10 and 30 words for WIKIDISTANT. Each bag in the training set is allowed to contain maximum 500 sentences selected randomly. For prediction on the validation and test sets, all sentences (with full length) are used.

Baselines
In this work we compare with various models applied on the NYT10 dataset: PCNN-ATT (Lin et al., 2016) is one of the first neural models that uses a PCNN encoder and selective attention over the instances in a bag, similar to our approach. RE-SIDE (Vashishth et al., 2018), utilises syntactic, entity and relation type information as additional input to the network to assist classification.    We report results on both the filtered data (520K) that do not contain train-test pair overlaps, as well as the non-filtered version (570K) to better compare with prior work 5 . With the exception of DISTRE, all prior approaches were originally applied on the 570K version. Hence, performance of prior work on the 520K version corresponds to re-runs of existing implementations (via their open-source code). For the non-filtered version, results are taken from the respective publications 6 . 5 More information about the two versions can be found in Appendix A.1 6 For PCNN-ATT we re-run both the 520K and the 570K ver-For the WIKIDISTANT dataset, we compare with the PCNN-ATT model as this is the only model currently applied on this data (Han et al., 2020). We also compare our proposed approach with two additional baselines. The first baseline model (Baseline) does not use the VAE component at all. In this case the sentence representation is simply created using the last hidden state of the encoder, s = [h; e 1 ; e 2 ], instead of the latent code. The second model (p θ (z) ∼ N (0, I)) incorporates reconstruction with a standard Gaussian prior and the final model (p θ (z) ∼ N (µ KB , I)) corresponds to our proposed model with KB priors.

Results
The results of the proposed approach versus existing methods on the NYT10 dataset are shown in Table 2. The addition of reconstruction further improves performance by 3.6 percentage points (pp), while KB priors offer an additional of 4.3pp. Compared with DISTRE, our model achieves comparable performance, even if it does not use a pretrained language model. As we observe from the precision-recall curve in Figure 3, our model is competitive with DISTRE for up to 35% of the recall range but for the tail of the distribution a pretrained language model has better results. This can be attributed to the world knowledge it has obtained via pre-training, which is much more vast than a KB subset. Overall, for the reduced version of the dataset VAE with KB-guided priors surpasses the entire recall range of all previous methods. For the 570K version, our model is superior to other approaches in terms of AUC score, even for the baseline. We speculate this is because we incorposions using the OpenNRE toolkit. rate argument representations into the bag representation. As a result, overlapping pairs between training and test set have learnt strong argument representations.
Regarding the results on the WIKIDISTANT dataset in Table 3, once again we observe that reconstruction helps improve performance. However, it appears that KB priors have a negative effect. We find that in the NYT10 dataset 96% of the training pairs are associated with a prior. Instead, this portion is only 72% for WIKIDISTANT. The reason for this discrepancy could be the reduced coverage that potentially causes a confusion between the two signals 7 . To test this hypothesis, we re-run our models on a subset of the training data, removing pairs that do not have a KB prior. As observed in the second half of Table 3, priors do seem to have a positive impact under this setting, indicating the importance of high coverage in prior-associated pairs. We use this setting for the remainder of the paper.

Analysis
We then check whether the latent space has indeed learned some information about the KB triples, by visualising the t-SNE plots of the priors, i.e. the µ KB vectors as resulted from training TransE (Equation (8)) and the posteriors, i.e. the µ vectors as resulted from the VAE encoder (Equation (1)). Figure 4a illustrates the space of the priors in Freebase for the most frequent relation categories in the NYT10 training set 8 . As it can be observed, 7 If a pair does not have a KB prior it will be assigned the Normal prior instead. 8 We plot t-SNEs for the training set instead of the validation/test sets because the WIKIDISTANT validation set contains too few pairs belonging to the top-10 categories. NYT10 validation set t-SNE can be found in the Appendix A.5 the separation is obvious for most categories, with a few overlaps. Relations place of birth, place lived and place of death appear to reside in the same region. This is expected as these relations can be shared by a pair simultaneously. Another overlap is identified for contains, administrative divisions and capital. Again, these are similar relations found between certain entity types (e.g. location, province, city). Figure 4b shows the t-SNE plot for a collection of latent vectors (random selection of 2 sentences in a positive bag). The space is very similar to that of the KB and the same overlapping regions are clearly observed. A difference is that it appears to be less compact, as not all sentences in a bag express the exact same relation. Similar observations stand for Wikidata priors, as shown in Figure 4c. By looking at the space of the posteriors, we can see that although for most categories separation is achieved, there are 2 relations that are not so well separated in the posterior space. We find that has part (cyan) and part of (orange) are opposite relations, that TransE can effectively learn thanks to its properties. However, the model appears to not be able to fully separate the two. These relations are expressed in the same manner, by only changing the order of the arguments. As there is no restriction regarding the argument order in our model directionality can sometimes be an issue.
Finally, in order to check how the prior constraints affect sentence reconstruction, we illustrate reconstructions of sentences in the validation set of the NYT10 in Table 4 and WIKIDISTANT in Table  5. In detail, we give the input sentence to the network and employ greedy decoding using either the mean of the latent code or a random sample.
Manual inspection of reconstruction reveals that KB-priors generate longer sentences than the Normal prior by repeating several words (especially the UNK). In fact, VAE with KB-priors fails to generate plausible and grammatical examples for NYT10, as shown in Table 4. Instead, reconstructions for WIKIDISTANT are slightly better, due to the less noisy nature of the dataset. In both cases, we see that the reconstructions contain words that are useful for the target relation, e.g. words that refer to places such as new york, new jersey for the relation contains between bay village and ohio, or sport-related terms (football, team, league) for the statistical leader relationship between wayne rooney and england national team.  INPUT wayne rooney plays as a striker for manchester united and the england national team ng 's first role was in the # michael hui comedy film " the private eyes " .

MEAN
_ 's first game was the first time in the game against the new york yankees .
the film was adapted into the # film ' the _ ' , directed by _ .

SAMPLE
he made his debut for the club in the # fa cup final against arsenal at wembley stadium .
in # , he appeared in ' the _ ' , a # film adaptation of the same name by _ .

MEAN
he was a member of the club 's first team , and was a member of the club 's _ club _ 's first film was ' the _ ' , starring _ and starring _ .

SAMPLE
he made his debut in the russian professional football league for fc _ ... _ , who was the first female actress to win the academy award for best actress .  (Mintz et al., 2009) with the widely used NYT10 corpus by Riedel et al. (2010). Methods investigating this problem can be divided into several categories. Initial approaches were mostly graphical models, adopted to perform multi-instance learning (Riedel et al., 2010), sentential evaluation (Hoffmann et al., 2011;Bai and Ritter, 2019) or multi-instance learning and multi-label classification (Surdeanu et al., 2012). Subsequent approaches utilised neural models, with the approach of Zeng et al. (2015) introducing Piecewise Convolutional Neural Networks (PCNN) into the task. Later approaches focused on noise reduction via selection of informative instances using either soft constraints, i.e., attention mechanisms (Lin et al., 2016;Ye and Ling, 2019;, or hard constraints by explicitly selecting non-noisy instances with reinforcement (Feng et al., 2018;Qin et al., 2018b,a; and curriculum learning (Huang and Du, 2019). Noise at the word level was addressed in Liu et al. (2018a) via sub-tree parsing on sentences. Adversarial training has been shown to improve DSRE in Wu et al. (2017), while additional unlabelled examples were exploited to assist classification with Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) in Li et al. (2019). Recent methods use additional information from external resources such as entity types and relations (Vashishth et al., 2018), entity descriptors (Ji et al., 2017;She et al., 2018;Hu et al., 2019) or Knowledge Bases Xu and Barbosa, 2019;Li et al., 2020b).
Sequence-to-Sequence Methods. Autoencoders and variational autoencoders have been investigated lately for relation extraction, primarily for detection of relations between entity mentions in sentences. Marcheggiani and Titov (2016) proposed discrete-state VAEs for link prediction, reconstructing one of the two entities of a pair at a time. Ma et al. (2019) investigated conditional VAEs for sentence-level relation extraction, showing that they can generate relation-specific sentences. Our overall approach shares similarities with this work since we also use VAEs for RE, though in a bag rather than a sentence-level setting. VAEs have also been investigated for RE in the biomedical domain (Zhang and Lu, 2019), where additional non-labelled examples were incorporated to assist classification. This work also has commonalities with our work but the major difference is that the former uses two different encoders while we use only one, shared among bag classification and bag reconstruction. Other SEQ2SEQ methods treat RE as a sequence generation task. Encoder-decoder networks were proposed for joint extraction of entities and relations (Trisedya et al., 2019;Nayak and Ng, 2020), generation of triples from sequences (Liu et al., 2018b) or generation of sequences from triples (Trisedya et al., 2018;Zhu et al., 2019). VAE Priors. Different types of prior distributions have been proposed for VAEs, such as the Vamp-Prior (Tomczak and Welling, 2018), Gaussian mixture priors (Dilokthanakul et al., 2016), Learned Accept/Reject Sampling (LARs) priors (Bauer and Mnih, 2019), non-parametric priors (Goyal et al., 2017) and others. User-specific priors have been used in collaborative filtering for item recommendation (Karamanolakis et al., 2018), while topicguided priors were employed for generation of topic-specific sentences (Wang et al., 2019a). In our approach we investigate how to incorporate KB-oriented Gaussian priors in DSRE using a link prediction model to parameterise their mean vector.

Conclusions
We proposed a probabilistic approach for distantly supervised relation extraction, which incorporates context agnostic knowledge base triples information as latent signals into context aware bag-level entity pairs. Our method is based on a variational autoencoder that is trained jointly with a relation classifier. KB information via a link prediction model is used in the form of prior distributions on the VAE for each pair. The proposed approach brings close sentences that contain the same KB pairs and it does not require any external information during inference time.
Experimental results suggest that jointly reconstructing sentences with relation classification is helpful for distantly supervised RE and KB priors further boost performance. Analysis of the generated latent representations showed that we can indeed manipulate the space of sentences to match the space of KB triples, while reconstruction is enforced to keep topic-related terms.
Future work will target experimentation with different link prediction models and handling of noninformative sentences. Finally, incorporating large pre-trained language models (LMs) into VAEs is a recent and promising study (Li et al., 2020a) which can be combined with KBs as injecting such information into LMs has been shown to further improve their performance (Peters et al., 2019).

A Appendix
A.1 The NYT10 Dataset As described in Bai and Ritter (2019), the NYT10 dataset has been released in several versions. The original one, follows the setting of Riedel et al. (2010), where two sets of data were created. Later versions (Lin et al., 2016)  It is also important to note that NYT10 has been used by the community in two settings: bag-level and sentence-level. In the bag-level setting, a pair's relation is defined based on a bag of sentences that contain the pair. On the contrary, in the sentencelevel setting a pair's relation is predicted for each sentence. Training data are obtained using distant supervision, while test data are manually annotated (Hoffmann et al., 2011).

A.2 Data Pre-processing Details
We found that the dataset includes several duplicate instances, i.e. the exact same sentence with the exact same pair. We remove such cases from our training data since they can bias the training process. However, they are preserved on the validation and test sets for a fair comparison with other methods. We convert the dataset to lowercase and replace all digits with the hash character (#). We randomly select 10% of the training bags as our validation set.  Sentence Length Filtering. We restrict the length of a sentence to 50 words for the NYT10 dataset and to 30 for the WIKIDISTANT dataset. If at least one of the arguments of a pair is located in a span after the maximum sentence length, then the sentence is resized to contain the words from the first argument until the second. We also add a maximum number of 5 words to the left and 5 words to the right if the total length allows. If the length of the resized sentence is still larger than    Tables 6, 7 and 8.
Vocabulary construction. In order to construct the word vocabulary, we use the unique sentences contained in the training set, as resulted from the removal of duplicate instances and the sentence length filtering. Since each sentence in the dataset can contain multiple pairs, it is repeated for each pair. Using non-unique sentences can lead to counting larger frequencies for certain words and producing a misleading vocabulary. We restrict the vocabulary to contain the 40K most frequent words for NYT10, with a coverage of 97.78% in the training set and to 50K for WIKIDISTANT with a coverage of 96%. Other words are replaced with the UNK token.
A.3 Hyper-parameter Settings DSRE Models. Table 9 shows the parameters used for training the model on the NYT10 and WIKIDISTANT dataset. In the VAE setting Adaptive Softmax (Grave et al., 2017)   Knowledge Base Embeddings. In order to train KB entity embeddings we used the DGL-KE toolkit (Zheng et al., 2020). We use the same set of hyper-parameters for both Freebase and Wikidata as shown in Table 10. For Freebase we select 5, 000 triples as the validation set, while for Wikidata we use the validation set provided in the transductive setting (5, 136 triples).   A.5 Additional Plots Figure 5 illustrates the t-SNE plot of the latent space for the NYT10 validation set. We observe similar clusters to that of the KB (Figure 4a). Figure 6 illustrates the PR-curves for the nonfiltered version of the NYT10 dataset (570K). Here, KB-priors perform comparably with Normal prior but mostly improve the tail of the distribution (after 50% of the recall range). We could not obtain the PR curve for the JOINTNRE method, thus it is not present in the figure.