Jointly Extracting Relations with Class Ties via Effective Deep Ranking

Connections between relations in relation extraction, which we call class ties, are common. In distantly supervised scenario, one entity tuple may have multiple relation facts. Exploiting class ties between relations of one entity tuple will be promising for distantly supervised relation extraction. However, previous models are not effective or ignore to model this property. In this work, to effectively leverage class ties, we propose to make joint relation extraction with a unified model that integrates convolutional neural network (CNN) with a general pairwise ranking framework, in which three novel ranking loss functions are introduced. Additionally, an effective method is presented to relieve the severe class imbalance problem from NR (not relation) for model training. Experiments on a widely used dataset show that leveraging class ties will enhance extraction and demonstrate the effectiveness of our model to learn class ties. Our model outperforms the baselines significantly, achieving state-of-the-art performance.


Introduction
Relation extraction (RE) aims to classify the relations between two given named entities from natural-language text. Supervised machine learning methods require numerous labeled data to work well. With the rapid growth of volume of relation types, traditional methods can not keep up with the step for the limitation of labeled data. In order to narrow down the gap of data sparsity, Mintz et al. (2009) propose distant supervi-place lived (Patsy Ramsey, Atlanta) place of birth (Patsy Ramsey, Atlanta) Sentence Latent Label #1 Patsy Ramsey has been living in Atlanta since she was born.
place of birth #2 Patsy Ramsy always loves Atlanta since it is her hometown.
place lived Table 1: Training instances generated by freebase.
sion (DS) for relation extraction, which automatically generates training data by aligning a knowledge facts database (ie. Freebase (Bollacker et al., 2008)) with texts. Class ties mean the connections between relations in relation extraction. In general, we conclude that class ties can have two types: weak class ties and strong class ties. Weak class ties mainly involve the co-occurrence of relations such as place of birth and place lived, CEO of and founder of. On the contrary, strong class ties mean that relations have latent logical entailments. Take the two relations of capital of and city of for example, if one entity tuple has the relation of capital of, it must express the relation fact of city of, because the two relations have the entailment of capital of ⇒ city of. Obviously the opposite induction is not correct. Further take the sentence of "Jonbenet told me that her mother [Patsy Ramsey] e 1 never left [Atlanta] e 2 since she was born." in DS scenario for example. This sentence expresses two relation facts which are place of birth and place lived. However, the word "born" is a strong bios to extract place of birth, so it may not be easy to predict the relation of place lived, but if we can incorporate the weak ties between the two relations, extracting place of birth will provide evidence for prediction of place lived.
Exploiting class ties is necessary for DS based relation extraction. In DS scenario, there is a chal-lenge that one entity tuple can have multiple relation facts as shown in Table 1, which is called relation overlapping (Hoffmann et al., 2011;Surdeanu et al., 2012). However, the relations of one entity tuple can have class ties mentioned above which can be leveraged to enhance relation extraction for it narrowing down potential searching spaces and reducing uncertainties between relations when predicting unknown relations. If one pair entities has CEO of, it will contain founder of with high possibility.
To exploit class ties between relations, we propose to make joint extraction for all positive labels of one entity tuple with considering pairwise connections between positive and negative labels inspired by (Fürnkranz et al., 2008;Zhang and Zhou, 2006). As the two relations with class ties shown in Table 1, by joint extraction of two relations, we can maintain the class ties (co-occurrence) of them from training samples to be learned by potential model, and then leverage this learned information to extract instances with unknown relations, which can not be achieved by separated extraction for it dividing labels apart losing information of cooccurrence. To classify positive labels from negative ones, we adopt pairwise ranking to rank positive ones higher than negative ones, exploiting pairwise connections between them. In a word, joint extraction exploits class ties between relations and pairwise ranking classify positive labels from negative ones. Furthermore, combining information across sentences will be more appropriate for joint extraction which provides more information from other sentences to extract each relation (Zheng et al., 2016;Lin et al., 2016). In Table  1, sentence #1 is the evidence for place of birth, but it also expresses the meaning of "living in someplace", so it can be aggregated with sentence #2 to extract place lived. Meanwhile, the word of "hometown" in sentence #2 can provide evidence for place of birth which should be combined with sentence #1 to extract place of birth.
In this work, we propose a unified model that integrates pairwise ranking with CNN to exploit class ties. Inspired by the effectiveness of deep learning for modeling sentences (LeCun et al., 2015), we use CNN to encode sentences. Similar to (Santos et al., 2015;Lin et al., 2016) Figure 1: The main architecture of our model.
we introduce two variant methods to combine the embedded sentences into one bag representation vector aiming to aggregate information across sentences, after that we measure the similarity between bag representation and relation class in realvalued space. With two variants for combining sentences, three novel pairwise ranking loss functions are proposed to make joint extraction. Besides, to relieve the bad impact of class imbalance from NR (not relation) (Japkowicz and Stephen, 2002) for training our model, we cut down loss propagation from NR class during training. Our experimental results on dataset of Riedel et al. (2010) are evident that: (1) Our model is much more effective than the baselines; (2) Leveraging class ties will enhance relation extraction and our model is efficient to learn class ties by joint extraction; (3) A much better model can be trained after relieving class imbalance from NR.
Our contributions in this paper can be encapsulated as follows: • We propose to leverage class ties to enhance relation extraction. An effective deep ranking model which integrates CNN and pairwise ranking framework is introduced to exploit class ties.
• We propose an effective method to relieve the impact of data imbalance from NR for model training.
• Our method achieves state-of-the-art performance.

Related Work
We summarize related works on two main aspects:

Distant Supervision Relation Extraction
Previous works on DS based RE ignore or are not effective to leverage class ties between rela-tions. Riedel et al. (2010) introduce multi-instance learning to relieve the wrong labelling problem, ignoring class ties. Afterwards, Hoffmann et al. (2011) and Surdeanu et al. (2012) model this problem by multi-instance multi-label learning to extract overlapping relations. Though they also propose to make joint extraction of relations, they only use information from single sentence losing information from other sentences. Han and Sun (2016) try to use Markov logic model to capture consistency between relation labels, on the contrary, our model leverages deep ranking to learn class ties automatically.
With the remarkable success of deep learning in CV and NLP (LeCun et al., 2015), deep learning has been applied to relation extraction (Zeng et al., 2014(Zeng et al., , 2015Santos et al., 2015;Lin et al., 2016), the specific deep learning architecture can be CNN (Zeng et al., 2014), RNN , etc. Zeng et al. (2015) propose a piecewise convolutional neural network with multi-instance learning for DS based relation extraction, which improves the precision and recall significantly. Afterwards, Lin et al. (2016) introduce the mechanism of attention (Luong et al., 2015;Bahdanau et al., 2014) to select the sentences to relieve the wrong labelling problem and use all the information across sentences. However, the two deep learning based models only make separated extraction thus can not model class ties between relations.

Deep Learning to Rank
Deep learning to rank has been widely used in many problems to serve as a classification model. In image retrieval,  apply deep semantic ranking for multi-label image retrieval. In text matching, Severyn and Moschitti (2015) adopt learning to rank combined with deep CNN for short text pairs matching. In traditional supervised relation extraction, Santos et al. (2015) design a pairwise loss function based on CNN for single label relation extraction. Based on the advantage of deep learning to rank, we propose pairwise learning to rank (LTR) (Liu, 2009) combined with CNN in our model aiming to jointly extract multiple relations.

Proposed Model
In this section, we first conclude the notations used in this paper, then we introduce the used CNN for sentence embedding, afterwards, we present our algorithm of how to learn class ties between relations of one entity tuple.

Notation
We define the relation classes as L = {1, 2, · · · , C}, entity tuples as Dataset is constructed as follows: for entity tuple t i ∈ T and its relation class set L i ⊆ L, we collect all the mentions X i that contain t i , the dataset we use is D = and we use class embeddings W ∈ R |L|×d to represent the relation classes.

CNN for Sentence Embedding
We take the effective CNN architecture adopted from (Zeng et al., 2015;Lin et al., 2016) to encode sentence and we briefly introduce CNN in this section. More details of our CNN can be obtained from previous work.

Words Representations
• Word Embedding Given a word embedding matrix V ∈ R l w ×d 1 where l w is the size of word dictionary and d 1 is the dimension of word embedding, the words of a mention x = {w 1 , w 2 , · · · , w n } will be represented by realvalued vectors from V .
• Position Embedding The position embedding of a word measures the distance from the word to entities in a mention. We add position embeddings into words representations by appending position embedding to word embedding for every word. Given a position embedding matrix P ∈ R l p ×d 2 where l p is the number of distances and d 2 is the dimension of position embeddings, the dimension of words representations becomes

Convolution, Piecewise max-pooling
After transforming words in x to real-valued vectors, we get the sentence q ∈ R n×d w . The set of kernels where d s is the number of kernels. Define the window size as d win and given one kernel K k ∈ R d win ×d w , the convolution operation is defined as follows: where m is the vector after conducting convolution along q for n − d win + 1 times and b ∈ R d s is the bias vector. For these vectors whose indexes out of range of [1, n], we replace them with zero vectors. By piecewise max-pooling, when pooling, the sentence is divided into three parts: m [p 0 :p 1 ] , m [p 1 :p 2 ] and m [p 2 :p 3 ] (p 1 and p 2 are the positions of entities, p 0 is the beginning of sentence and p 3 is the end of sentence). This piecewise max-pooling is defined as follows: where z ∈ R 3 is the result of mention x processed by kernel K k ; 1 ≤ j ≤ 3. Given the set of kernels K, following the above steps, the mention x can be embedded to o where o ∈ R d s * 3 .

Non-Linear Layer, Regularization
To learn high-level features of mentions, we apply a non-linear layer after pooling layer. After that, a dropout layer is applied to prevent overfitting. We define the final fixed sentence repre- where g(·) is a non-linear function and we use tanh(·) in this paper; h is a Bernoulli random vector with probability p to be 1.

Learning Class Ties by Joint Extraction with Pairwise Ranking
As mentioned above, to learn class ties, we propose to make joint extraction with considering pairwise connections between positive labels and negative ones. Pairwise ranking is applied to achieve this goal. Besides, combining information across sentences is necessary for joint extraction. More specifically, as shown in Figure 2, from down to top, all information from sentences is pre-propagated to provide enough information for joint extraction. From top to down, pairwise ranking jointly extracting positive relations by combining losses, which are back-propagated to CNN to learn class ties.

Combining Information across Sentences
We propose two options to combine sentences to provide enough information for joint extraction.  Figure 2: Illustration of mechanism of our model to model class ties between relations.
• AVE The first option is average method. This method regards all the sentences equally and directly average the values in all dimensions of sentence embedding. This AVE function is defined as follows: where n is the number of sentences and s is the representation vector combining all sentence embeddings. Because it weights the importance of sentences equally, this method may bring much noise data from two aspects: (1) the wrong labelling data; (2) irrelated mentions for one relation class, for all sentences containing the same entity tuple being combined together to construct the bag representation.
• ATT The second one is a sentence-level attention algorithm used by Lin et al. (2016) to measure the importance of sentences aiming to relieve the wrong labelling problem. For every sentence, ATT will calculate a weight by comparing the sentence to one relation. We first calculate the similarity between one sentence embedding and relation class as follows: where e j is the similarity between sentence embedding s j and relation class c and a is a bias factor. In this paper, we set a as 0.5. Then we apply Softmax to rescale e (e = {e i } We get the weight α j for s j as follows: so the function to merge s with ATT is as follows:

Joint Extraction by Combining Losses to Learn Class Ties
Firstly, we have to present the score function to measure the similarity between s and relation c.
• Score Function We use dot function to produce score for s to be predicted as relation c. The score function is as follows: There are other options for score function. In , they propose a margin based loss function that measures the similarity between s and W [c] by distance. Because score function is not an important issue in our model, we adopt dot function, also used by Santos et al. (2015) and Lin et al. (2016), as our score function.
Now we start to introduce the ranking loss function.
Pairwise ranking aims to learn the score function F(s, c) that ranks positive classes higher than negative ones. This goal can be summarized as follows: (9) where β is a margin factor which controls the minimum margin between the positive scores and negative scores.
To learn class ties between relations, we extend the formula (9) to make joint extraction and we propose three ranking loss functions with variants of combining sentences. Followings are the proposed loss functions: • with AVE (Variant-1) We define the marginbased loss function with option of AVE to aggregate sentences as follows: where [0, · ] + = max(0, · ); ρ is the rescale factor, σ + is positive margin and σ − is negative margin. Similar to Santos et al. (2015) and , this loss function is designed to rank positive classes higher than negative ones controlled by the margin of σ + − σ − . In reality, F(s, c + ) will be higher than σ + and F(s, c − ) will be lower than σ − . In our work, we set ρ as 2, σ + as 2.5 and σ − as 0.5 adopted from Santos et al. (2015). Similar to Weston et al. (2011) andSantos et al. (2015), we update one negative class at every training round but to balance the loss between positive classes and negative ones, we multiply |L k | before the right term in function (10) to expand the negative loss. We apply mini-bach based stochastic gradient descent (SGD) to minimize the loss function. The negative class is chosen as the one with highest score among all negative classes (Santos et al., 2015), i.e.: • with ATT (Variant-2) Now we define the loss function for the option of ATT to combine sentences as follows: where s c means the attention weights of representation s are merged by comparing sentence embeddings with relation class c and c − is chosen by the following function: which means we update one negative class in every training round. We keep the values of ρ, σ + and σ − same as values in function (10). According to this loss function, we can see that: for each class c + ∈ L k , it will capture the most related information from sentences to merge s c + , then rank F(s c + , c + ) higher than all negative scores which each is F(s c + , c − ) (c − ∈ L − L k ). We use the same update algorithm to minimize this loss.
• Extended with ATT (Variant-3) According to function (12), for each c + , we only select one negative class to update the parameters, which only considers the connections between positive classes and negative ones, ignoring connections between positive classes, so we extend function (12) to better exploit class ties by considering the connections between positive classes. We give out the extended loss function as follows: Pro.   (13), we select c − as follows: and we use the same method to update this loss function as discussed above. From the function (14), we can see that: for c * ∈ L k , after merging the bag representation s with c * , we share s with all the other positive classes and update the class embeddings of other positive classes with s, in this way, the connections between positive classes can be captured and learned by our model. In loss function (10), (12) and (14), we combine losses from all positive labels to make joint extraction to capture the class ties among relations. Suppose we make separated extraction, the losses from positive labels will be divided apart and will not get enough information of connections between positive labels, comparing to joint extraction. Connections between positive labels and negative ones are exploited by controlling margins: σ + and σ − .

Relieving Impact of NR
In relation extraction, the dataset will always contain certain negative samples which do not express relations classified as NR (not relation). Table 2 presents the proportion of NR samples in SemEval-2010 Task 8 dataset 2 (Erk and Strapparava, 2010) and dataset from Riedel et al. (2010), which shows almost data is about NR in the latter dataset. Data imbalance will severely affect the model training and cause the model only sensitive to classes with high proportion (He and Garcia, 2009).
In order to relieve the impact of NR in DS based relation extraction, we cut the propagation of loss from NR, which means if relation c is NR, we set its loss as 0. Our method is similar to Santos et al. (2015) with slight variance. Santos et al. (2015) directly omit the NR class embedding, but we keep it. If we use ATT method to combine information across sentences, we can not omit NR class Algorithm 1: Merging loss function of Variant-3 input : L, (t k , L k , X k ) and S k ; output: G [Exatt] ; Merge representation s c * by function (5), (6), (7); embedding according to function (6) and (7), on the contrary, it will be updated from the negative classes' loss.
In Algorithm 1, we give out the pseudocodes of merging loss with Variant-3 and considering to relieve the impact of NR.

Dataset and Evaluation Criteria
We conduct our experiments on a widely used dataset, developed by Riedel et al. (2010) and has been used by Hoffmann et al. (2011), Surdeanu et al. (2012, Zeng et al. (2015) and Lin et al. (2016)

Experimental Settings
Word Embeddings. We use a word2vec tool that is gensim 3 to train word embeddings on NYT corpus. Similar to Lin et al. (2016), we keep the words that appear more than 100 times to construct word dictionary and use "UNK" to represent the other ones.  Table 3: Hyper-parameter settings.
Hyper-parameter Settings. Three-fold validation on the training dataset is adopted to tune the parameters following Surdeanu et al. (2012). We use grid search to determine the optimal hyperparameters. We select word embedding size from {50, 100, 150, 200, 250, 300}. Batch size is tuned from {80, 160, 320, 640}. We determine learning rate among {0.01, 0.02, 0.03, 0.04}. The window size of convolution is tuned from {1, 3, 5}. We keep other hyper-parameters same as Zeng et al. (2015): the number of kernels is 230, position embedding size is 5 and dropout rate is 0.5. Table 3 shows the detailed parameter settings.

Comparisons with Baselines
Baseline. We compare our model with the following baselines: • Mintz (Mintz et al., 2009) the original distantly supervised model.
• MultiR (Hoffmann et al., 2011) a multiinstance learning based graphical model which aims to address overlapping relation problem.
• PCNN+ATT (Lin et al., 2016) the state-ofthe-art model in dataset of Riedel et al. (2010) which applies ATT to combine the sentences. Results and Discussion. We compare our three variants of loss functions with the baselines and the results are shown in Figure 3. From the results we can see that: (1) Rank + AVE (Variant-1) achieves comparable results with PCNN+ATT; (2) Rank + ATT (Variant-2) and Rank + ExATT (Variant-3) significantly outperform PCNN + ATT with much higher precision and slightly higher recall in whole view; (3) Rank + ExATT (Variant-3) exhibits the best performances comparing with all the other methods including PCNN + ATT, Rank + AVE and Rank + ATT.

Impact of Joint Extraction and Class Ties
In this section, we conduct experiments to reveal the effectiveness of our model to learn class ties with three variant loss functions mentioned above, and the impact of class ties for relation extraction. As mentioned above, we make joint extraction to learn class ties, so to achieve the goal of this set of experiments, we compare joint extraction with separated extraction. To make separated extraction, we divide the labels of entity tuple into single label and for one relation label we only select the sentences expressing this relation, then we use this dataset to train our model with the three variant loss functions. We conduct experiments with Rank + AVE (Variant-1), Rank + ATT (Variant-2) and Rank + ExATT (Variant-3) relieving impact of NR. Aggregated P/R curves are drawn and precisions@N (100, 200, · · · , 500) are reported to show the model performances.  Experimental results are shown in Figure 4 and Table 4. From the results we can see that: (1) For Rank + ATT and Rank + ExATT, joint extraction exhibits better performance than separated extraction, which demonstrates class ties will improve relation extraction and the two methods are effective to learn class ties; (2) For Rank + AVE, surprisingly joint extraction does not keep up with separated extraction. For the second phenomenon, the explanation may lie in the AVE method to aggregate sentences will incorporate noise data consistent with the finding in Lin et al. (2016). When make joint extraction, we will combine all sentences containing the same entity tuple no matter which class type is expressed, so it will engender much noise if we only combine them equally.

Comparisons of Variant Joint Extractions
To make joint extraction, we have proposed three variant loss functions including Rank + AVE, Rank + ATT and Rank + ExATT in the above discussion and Figure 3 shows that the three variants achieve different performances. In this experiment, we aim to compare the three variants in detail. We conduct the experiments with the three variants under the setting of relieving im-   pact of NR and joint extraction. We draw the P/R curves and report the top N (100, 200, · · · , 500) precisions to compare model performance with the three variants.
From the results as shown in Figure 5 and Table 5 we can see that: (1) Comparing Rank + AVE with Rank + ATT, from the whole view, they can achieve the similar maximal recall point, but Rank + ATT exhibits higher precision in all range of recall; (2) Comparing Rank + ATT with Rank + ExATT, Rank + ExATT achieves much better performance with broader range of recall and higher precision in almost range of recall.

Impact of NR Relation
The goal of this experiment is to inspect how much relation of NR can affect the model performance. We use Rank + AVE, Rank + ATT, Rank + ExATT under the setting of relieving impact of NR or not to conduct experiments. We draw the aggregated P/R curves as shown in Figure 6, from which we can see that after relieving the impact of NR, the model performance can be improved significantly.
Then we further evaluate the impact of NR for convergence behavior of our model in model train-  Figure 7: Impact of NR for model convergence. "+NR" means not relieving NR impact; "-NR" is opposite.
ing. Also with the three variant loss functions, in each iteration, we record the maximal value of Fmeasure 4 to represent the model performance at current epoch. Model parameters are tuned for 15 times and the convergence curves are shown in Figure 7. From the result, we can find out: "+NR" converges quicker than "-NR" and arrives to the final score at the around 11 or 12 epoch. In general, "-NR" converges more smoothly and will achieve better performance than "+NR" in the end.

Case Study
Joint vs. Sep. Extraction (Class Ties). We randomly select an entity tuple (Cuyahoga County, Cleveland) from test set to see its scores for every relation class with the method of Rank + ATT under the setting of relieving impact of NR with joint extraction and separated extraction. This entity tuple have two relations: /location/./county seat and /location/./contains, which derive from the same root class and they have weak class ties for they all relating to topic of "location". We rescale the scores by adding value 10. The results are shown in Figure 8, from which we can see that: under joint extraction setting, the two gold relations have the highest scores among the other relations but under separated extraction setting, only /location/./contains can be distinguished from the negative relations, which demonstrates that joint extraction is better than separated extraction by capturing the class ties between relations. 4 F = 2 * P * R/(P + R)

Conclusion and Future Works
In this paper, we leverage class ties to enhance relation extraction by joint extraction using pairwise ranking combined with CNN. An effective method is proposed to relieve the impact of NR for model training. Experimental results on a widely used dataset show that leveraging class ties will enhance relation extraction and our model is effective to learn class ties. Our method significantly outperforms the baselines.
In the future, we will focus on two aspects: (1) Our method in this paper considers pairwise intersections between labels, so to better exploit class ties, we will extend our method to exploit all other labels' influences on each relation for relation extraction, transferring second-order to high-order (Zhang and Zhou, 2014); (2) We will focus on other problems by leveraging class ties between labels, specially on multi-label learning problems (Zhou et al., 2012) such as multi-category text categorization (Rousu et al., 2005) and multi-label image categorization (Zha et al., 2008).