Distant Supervised Relation Extraction with Separate Head-Tail CNN

Distant supervised relation extraction is an efficient and effective strategy to find relations between entities in texts. However, it inevitably suffers from mislabeling problem and the noisy data will hinder the performance. In this paper, we propose the Separate Head-Tail Convolution Neural Network (SHTCNN), a novel neural relation extraction framework to alleviate this issue. In this method, we apply separate convolution and pooling to the head and tail entity respectively for extracting better semantic features of sentences, and coarse-to-fine strategy to filter out instances which do not have actual relations in order to alleviate noisy data issues. Experiments on a widely used dataset show that our model achieves significant and consistent improvements in relation extraction compared to statistical and vanilla CNN-based methods.


Introduction
Relation extraction is a fundamental task in information extraction, which aims to extract relations between entities. For example, "Bill Gates is the CEO of Microsoft." holds the relationship /business/company/founders between the head entity Bill Gates and tail entity Microsoft.
Traditional supervised relation extraction systems require a large amount of manually welllabeled relation data (Walker et al., 2005;Doddington et al., 2004;Gábor et al., 2018), which is extremely labor intensive and time-consuming. (Mintz et al., 2009) instead proposes distant supervision which exploits relational facts in knowledge bases. Distant supervision aligns entity mentions in plain texts with those in knowledge base and assumes that if two entities have a relation there, then all sentences containing these two entities will express that relation. If there is no re- * Corresponding author.

Bag Sentence
Correct b 1 Barack Obama was born in the United States.

True
Barack Obama was the 44th president of the United States.
False b 2 Bill Gates is the CEO of Microsoft.

True
Bill Gates announced that he would be transitioning to a part-time role at Microsoft and full-time work in June 2006.
False Table 1: Examples of relations annotated by distant supervision. Sentences in b 1 are annotated with the place of birth relation and sentences in b 2 the business company founders relation.
lation link between a certain entity pair in knowledge base, the sentence will be labeled as a Not A relation (NA) instance. Although distant supervision is an efficient and effective strategy for automatically labeling large-scale training data, it inevitably suffers from mislabeling problems due to its strong assumption. As a result, the dataset created by distant supervision is usually very noisy. According to (Riedel et al., 2010), the precision of using distant supervision aligning Freebase to New York Times corpus is about 70%, an example of labeled sentences in New York Times corpus is shown in Table 1. Therefore, many efforts have been devoted to alleviate noise in distant supervised relation extraction.
Most of previous work used vanilla Convolution Neural Network (CNN) or Piecewise Convolution Neural Network (PCNN) as sentence encoder. CNN/PCNN adopted the same group of weightsharing filters to extract semantic feature of sentences. Though effective and efficient, there is still room to improve if we look deeper into properties of relations. We find that semantic properties of relations such as symmetry and asymmetry are often overlooked when using CNN/PCNN. For example, "Bill Gates is the CEO of Microsoft." holds the relationship /business/company/founders between the head entity Bill Gates and tail entity Microsoft. While in the sentence "The most famous man in Microsoft is Bill Gates." where the head entity Microsoft and the tail Bill Gates do not share that relationship. It indicates that the relation /business/company/founders is asymmetric. Most previous work use position embedding specified by entity pairs and piecewise pooling (Zeng et al., 2015;Lin et al., 2016; to predict relations. However, above examples show that they share similar position embeddings due to their similar position distances to both entities. Vanilla CNN/PCNN is not sufficient to capture such semantic features because it treats the head and tail entities equally. Thus, it tend to "memorize" certain entity pairs and may learn similar context representation when dealing with these noisy asymmetric instances. In addition to relation properties, we also investigate some noise source in distant supervised relation extraction. NA instances usually account for a large portion in distant supervised datasets, making the data highly imbalanced. Similarly, in objection detection task (Lin et al., 2017), extreme class imbalance greatly hinders the performance.
In this paper, in order to deal with above deficiencies, we propose Separate Head-Tail CNN (SHTCNN) framework, an effective strategy for distant supervised relation extraction. The framework is composed of two ideas. First, we employ separate head-tail convolution and pooling to embed the semantics of sentences targeting head and tail entities respectively. By this means, we can capture better semantic properties of relations in the distant supervised data and further alleviate mislabeling problem. Second, relations are classified from coarse to fine. In order to do this, an extra auxiliary network is adopted for NA/Non-NA binary classification, which is expected to filter as many easy NA instances as possible while maintaining high recall of all non-NA relationships. Instances selected by binary network are treated as non-NA examples for fine-grained multi-class classification. Inspired by Retina (Lin et al., 2017), we make use of focal loss in binary classification. We evaluate our model on a real-world distant supervised dataset. Experimental results show that our model achieves significant and consistent improvements in relation extraction compared to selected baselines.

Related Work
Relation extraction is a crucial task and heavily studied area in Natural Language Processing (NLP). Many efforts have been devoted, especially in supervised paradigm. Conventional supervised methods require large amounts of humanannotated data, which is highly expensive and time-consuming. To deal with this issue, (Mintz et al., 2009) proposed distant supervision, which aligned Freebase relational facts with plain texts to automatically generate relation labels for entity pairs. Apparently, such assumption is too strong that inevitably accompanies with mislabeling problem.
Plenty of studies have been done to alleviate such problem. (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012) introduce multi-instance learning framework to the problem. (Riedel et al., 2010) and (Surdeanu et al., 2012) use a graphical model to select valid sentences in the bag to predict relations. However, the main disadvantage in conventional statistical and graphical methods is that using features explicitly derived from NLP tools will cause error propagation and low precision.
As deep learning techniques (Bengio, 2009;Le-Cun et al., 2015) have been widely used, plenty of work adopt deep neural network for distant su-  (Zeng et al., 2015) proposed piecewise convolution neural network to model sentence representations under multi instance learning framework while using piecewise pooling based on entity position to capture structural information. (Lin et al., 2016) proposed sentence level attention, which is expected to dynamically reduce the weights of those noisy instances. (Ji et al., 2017) adopted similar attention strategy and combined entity descriptions to calculate weights over sentences.  proposed a soft-label method to reduce the influence of noisy instances on entity-level. (Jat et al., 2018) used word-level and entity-based attention for efficiently relation extraction. Due to the effectiveness of self-attention mechanism, ) proposed a structured word-level self-attention and sentence-level attention mechanism which are both 2-D matrix to learn rich aspects of data. Also, plenty of knowledge based strategies for distant supervised relation extraction have also been proposed. (Ji et al., 2017) uses hierarchical information of relations for relation extraction and achieve significant performance. (Lei et al., 2018) proposed Cooperative Denoising framework, which consists two base networks leveraging text corpus and knowledge graph respectively. (Vashishth et al., 2018) proposed RE-SIDE, a distantly supervised neural relation extraction method which utilizes additional side information from knowledge bases for improving relation extraction.  aimed to incor-porate the hierarchical information of relations for distantly supervised relation extraction. Although these methods achieved significant improvement in relation extraction, they tend to treat entities in sentences equally or rely more or less on knowledge base information which may be unavailable in other domains.
In order to alleviate mislabeling problem and reduce the burden of integrating external knowledge and resource, we propose SHTCNN to provide better sentence representation and reduce the impact of NA instances.

Methodology
In this section, we introduce our SHTCNN model. The overall framework is shown in Figure 1. Our model is built under multi-instances learning framework. It splits the training set into multiple n bags { h 1 , t 1 , h 2 , t 2 , · · · , h n , t n }, each of which contains m sentences {s 1 , s 2 , · · · , s m } mentioning same head entity h i and tail entity t i . Note that sentence number m may not be the same in each bag. Each sentence consists of a sequence of k words {x 1 , x 2 , · · · , x k }. First, sentence representation s i is acquired using our separate head-tail convolution and pooling on words {x 1 , x 2 , · · · , x k }. Next, selective attention mechanism is used to dynamically merge sentences to its bag representation b i = h i , t i . On bag level, binary classifier filters out easy NA instances with focal loss, leaving others to multi-class classifier for further fine-grained classification.

Sentence Encoder Word Representation
First, the i-th word x i in sentence is mapped into a d w -dimensional word embedding e i . Then, to keep track of head and tail entity position information, two d p -dimensional position embeddings (Zeng et al., 2014(Zeng et al., , 2015 are also adopted for each word as p 1 i and p 2 i recording the distance to two entities respectively. Thus, the final word representation is the concatenation of these three vectors

Separate Head-Tail Convolution and Pooling
Convolution layer are often utilized in relation extraction to capture local features in window form and then perform relation prediction globally. In detail, convolution is an operation between a convolution matrix W and a sequence of vector q i . We define q i ∈ R l×d of w words in the sentence s i = {w 1 , w 2 , w 3 , · · · , w n } with word representations defined above.
Because the window may be out of the sentence boundary when sliding along. We use wide convolution technique by adding special padding tokens on both sides of sentence boundaries. Thus the i-th convolutional filter p i computes as follows: where b is bias vector. Conventional PCNN uses piecewise pooling for relation extraction which divided convolutional filter p i into three segments based on positions of head and tail entities. Piecewise pooling is defined as follows: where j indicates position of segments in sentence.
As mentioned in section, traditional methods get representation of each sentence using same group of convolution filters, which focuses on both head entity and tail entity equally and ignores semantic difference between them. We use two separate groups of convolution filters W 1 , W 2 ∈ R ds×d , where d s is the sentence embedding size. Also, simply piecewise pooling can not well deal with examples of which relations are similar but asymmetric. In detail, we utilize two groups of separate head-tail entity convolution W 1 , W 2 to represent the sentence s i as p 1 i , p 2 i .
To exploit such semantic properties of relations expressed by entity pairs, we use separate headtail entity pooling. Targeting head and tail entities, head-entity pooling and tail-entity pooling are adopted on two convolution results respectively. p 1 i , p 2 i are further segmented by positions of entity pair for head-tail entity pooling. Head entity pooling is defined as: Similarly, tail pooling is defined as: And i-th sentence vector s i is the concatenation of h i and t i : Finally, we apply non-linear function such as ReLU as activation on the output.

Selective Attention
Bags contain sentences sharing the same entity pair. In order to alleviate mislabeling problem on sentence level, we adopted selective attention which is widely used in many works (Lin et al., 2016;Ji et al., 2017;LeCun et al., 2015;. The representation of the bag b i = h i , t i is the weighted sum of all sentence vectors in that bag.
where α i is the weight of sentence representation s i , A and r are diagonal matrix and relation query.

Coarse-to-Fine Relation Classification
Traditional methods directly predict relation classes for each bag after obtaining bag representations. However, large amount of NA instances containing mixed semantic information will hinder the performance. To alleviate such impact of NA instances, we manually utilize a binary classifier to filter out as many NA instances as possible, while leaving hard NA instances for multiclass classification. Binary classification can also be viewed as an auxiliary task about whether the input sentence hold an NA relation. In this method, NA is treated as negative class while all other non-NA labels are treated as positive class. In this method, we adopted focal loss (Lin et al., 2017) for NA/non-NA classification. Focal loss is designed to address class imbalance problem. When predict class label y for binary task y ∈ {0, 1}, we first define the prediction score p t for positive class: Then traditional weighted cross-entropy loss can be defined as follows: where α is a hyper-parameter usually set as class ratio. Focal loss modifies it by changing α to (1−p t ) γ in order to dynamically adjust weights between well-classified easy instances and hard instances as: For easy instances, prediction score p t will be high while the loss low and vise versa for hard instances. As a result, focal loss focuses on those hard NA instances. Finally, instances which are predicted as non-NA are selected for multi-class classifier for fine-grained classification. Due to existence of NA instances which are hard to handle, we also add a "NA class" in multi-class classification for further filtering those instances which do not hold an exact relationship.

Optimization
In this section, we introduce the learning and optimization details for our SHTCNN model. As shown in Figure 1, binary and multi network share only same word representations. We define binary and multi labels as br ∈ {0, 1} and mr ∈ {0, 1, 2, · · · , n} respectively. Both 0 represent NA class. In binary classification, 1 represents all non-NA classes while in multi-class classification, each non-zero number represents a certain non-NA relation. Besides, we use Θ 1 , Θ 2 to denote parameters for binary and multi-class classification network respectively. The objective function for our model is: where n is the number of relation classes. All models are optimized using Stochastic Gradient Descent (SGD).

Experiments
In this section, we first introduce the dataset and evaluation metrics. Then we list our experimental parameter settings. Afterwards, we compare the performance of our method with feature-based and selected neural-based methods. Besides, case study shows our SHTCNN is an effective method to extract better semantic features.

Dataset and Evaluation Metrics
We evaluate our model on a widely used dataset New York Times (NYT) released by (Riedel et al., 2010). The dataset was generated by aligning Freebase (Bollacker et al., 2008)

Comparison with Baseline Methods
Following previous work (Mintz et al., 2009;Lin et al., 2016;Ji et al., 2017;, we evaluate our model in the held-out evaluation. It evaluates models by comparing the relational facts discovered from the test articles with those in Freebase, which provides an approximate measure of precision without requiring expensive human evaluation. We draw precision-recall curves for all models and also report the Precision@N results to further verify the effort of our SHTCNN model.
For fair comparison with sentence encoders, we selected the following baselines: • Mintz: Multi-class logistic regression model used by (Mintz et al., 2009) for distant supervision.
• MultiR: Probabilistic graphical model under multi-instance learning framework proposed by (Hoffmann et al., 2011) • MIMLRE: Graphical model jointly models multiple instances and multiple labels proposed by (Surdeanu et al., 2012) • PCNN: CNN based model under multiinstance learning framework for distant relation extracion proposed by (Zeng et al., 2015) • PCNN-ATT: CNN based model which uses additional attention mechanism on sentence level for distant supervision proposed by (Lin et al., 2016) • SHTCNN: Framework proposed in this paper, please refer to Section 3 for more details.

Experimental Settings Word and Position Embeddings
Our model use pre-trained word embeddings for NYT corpus. Word embeddings of blank words are initialized with zero while unknown words are initialized with the normal distribution of which the standard deviation is 0.05. Position embeddings are initialized with Xavier initialization for all models. Two parts of our model share the same word and position embeddings as inputs.

Parameter Settings
We use cross-validation to determine the parameters in our model. We also use a grid search to select learning rate λ for SGD among {0.5, 0.1, 0.01, 0.001}, sliding windows size l among {1, 3, 5, 7}, sentence embedding size d s among {100, 150, 200, 300, 350, 400} and batch size among {64, 128, 256, 512}. Other parameters proved to have little effect on results. We show our optimal parameter settings in Table 2. Figure 2 shows the overall performance of our proposed SHTCNN against baselines mentioned above. From results, we can observe that: (1) When recall is smaller than 0.05, all models have reasonable precision. When recall is higher, precision of feature-based models decrease sharply compared to neural-based methods, and the latter

Top N Precision
We also conduct Precision@N tests on entity pairs with few instances. In our tests, three settings are used: ONE randomly select an instance in the bag; TWO randomly select two instances for each entity pair; ALL use all bag instances for evaluation.   the instance number increases which shows that more sentences selected in the bag, more information can be utilized.
(2) SHTCNN improves precision by over 8% for PCNN, PCNN-AVE and PCNN-ATT model. It indicates that in noisy textual dataset, our SHTCNN is a more powerful sentence encoder to capture better semantic features.
(3) Average method improves slowly when instances number increases which indicates that it can not effectively extract relations and be easily distracted by noises in the bag.

Effectiveness of Separate Head-Tail CNN
To further verify the contribution and effectiveness of two phase of our SHTCNN, we conduct two extra experiments. First, we evaluate the ability of our model to capture better sentence semantic features under different bag representation calculation methods. PCNN-AVE (Average) assumes that all sentences in the bag contribute equally to the representation of the bag, which brings in more noise from mislabeling sentences. Compared to PCNN-ATT, PCNN-AVE hinders the performance of relation extraction as shown in Table 3. We evaluate our model using Average and Attention respectively. From results in Figure 3, we observe that: (1) Both SHTCNN-AVE and SHTCNN-ATT achieve significant performance than their compared baselines, which proves that SHTCNN offers better sentence semantic features for bag representation with or without selective attention mechanism.
(2) SHTCNN-AVE achieves similar performance as PCNN-ATT when recall is between 0.15 and 0.35. (3) When recall is greater than 0.35, SHTCNN-AVE performs even better than PCNN-ATT. It demonstrates that SHTCNN is relatively more robust and stable on dealing with noisier sentences. Second, we explore the effect of separate head-tail convolution and pooling and contribution of coarse-to-fine relation extraction. From results shown in Figure 4, we can observe that: (1) Both HT-ATT and Coarse-to-Fine improve performance of PCNN-ATT on a wide range of recall, which indicates that separate head-tail convolution and pooling, and coarse-to-fine strategy perform better on predicting relations. (2) Figure 4 and Table 3 both show that separate head-tail convolution and pooling achieve much better results than only using coarse-to-fine strategy, indicating that a better sentence encoder is more important in noisy environment.
(3) Our full model SHTCNN improves performance on the entire recall compared to using separate parts (solely separate head-tail convolution and pooling or only coarse-to-fine) of our model which suggests that combining two proposed methods together can achieve better results.

Case Study
In Table 4, we show some of our SHTCNN model examples corrections compared to traditional PCNN. Left of the arrow is PCNN predicted class label on the below sentence while the right is our prediction. We can observe that the first sentence is labeled as /business/company/founders by both PCNN and SHTCNN since closer entities bring similar position embeddings which benefit both models. However, the second one is similar but does not hold the relationship. PCNN failed to recognize the relation but SHTCNN corrected the label. Finally, the last sentence is longer and entities are not as close as those in first two sentences. Our model outperformed PCNN by successfully giving correct label to the sentence. It indicates that SHTCNN perform better on modelling relationship in relative long sentences.

Conclusion
In this paper, we propose SHTCNN, a novel neural framework using separate head-tail convolution and pooling for sentence encoding and classifies relations from coarse-to-fine. Various experiments conducted show that, in our framework, separate head-tail convolution and pooling can better capture sentence semantic features compared to baseline methods, even in noisier environment. Besides, coarse-to-fine relation extraction strategy can further improve and stabilize the performance of our model. In the future, we will explore the following directions: (1) We will explore effective separate head-tail convolution and pooling on other sentence encoders like RNN. (2) Coarse-to-fine classification is an experimental method, we plan to further investigate noisy source in distant supervised datasets. (3) It will be promising to incorporate well-designed attention and self-attention mechanisms with two parts of our framework to further improve the performance. All codes and data are available at: https://bit.ly/ ds-shtcnn.