Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks

Two problems arise when using distant supervision for relation extraction. First, in this method, an already existing knowledge base is heuristically aligned to texts, and the alignment results are treated as labeled data. However, the heuristic alignment can fail, resulting in wrong label problem. In addition, in previous approaches, statistical models have typically been applied to ad hoc features. The noise that originates from the feature extraction process can cause poor performance. In this paper, we propose a novel model dubbed the Piecewise Convolu-tional Neural Networks (PCNNs) with multi-instance learning to address these two problems. To solve the ﬁrst problem, distant supervised relation extraction is treated as a multi-instance problem in which the uncertainty of instance labels is taken into account. To address the latter problem, we avoid feature engineering and instead adopt convolutional architecture with piecewise max pooling to automatically learn relevant features. Experiments show that our method is effective and outperforms several competitive base-line methods.


Introduction
In relation extraction, one challenge that is faced when building a machine learning system is the generation of training examples. One common technique for coping with this difficulty is distant supervision (Mintz et al., 2009) which assumes that if two entities have a relationship in a known knowledge base, then all sentences that mention these two entities will express that relationship in some way. Figure 1 shows an example of the auto- matic labeling of data through distant supervision. In this example, Apple and Steve Jobs are two related entities in Freebase 1 . All sentences that contain these two entities are selected as training instances. The distant supervision strategy is an effective method of automatically labeling training data. However, it has two major shortcomings when used for relation extraction. First, the distant supervision assumption is too strong and causes the wrong label problem. A sentence that mentions two entities does not necessarily express their relation in a knowledge base. It is possible that these two entities may simply share the same topic. For instance, the upper sentence indeed expresses the "company/founders" relation in Figure 1. The lower sentence, however, does not express this relation but is still selected as a training instance. This will hinder the performance of a model trained on such noisy data.
Second, previous methods (Mintz et al., 2009;Riedel et al., 2010;Hoffmann et al., 2011) have typically applied supervised models to elaborately designed features when obtained the labeled data through distant supervision. These features are often derived from preexisting Natural Language Processing (NLP) tools. Since errors inevitably exist in NLP tools, the use of traditional features leads to error propagation or accumulation. Distant supervised relation extraction generally ad- dresses corpora from the Web, including many informal texts. Figure 2 shows the sentence length distribution of a benchmark distant supervision dataset that was developed by Riedel et al. (2010). Approximately half of the sentences are longer than 40 words. McDonald and Nivre (2007) showed that the accuracy of syntactic parsing decreases significantly with increasing sentence length. Therefore, when using traditional features, the problem of error propagation or accumulation will not only exist, it will grow more serious.
In this paper, we propose a novel model dubbed Piecewise Convolutional Neural Networks (PC-NNs) with multi-instance learning to address the two problems described above. To address the first problem, distant supervised relation extraction is treated as a multi-instance problem similar to previous studies (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012). In multi-instance problem, the training set consists of many bags, and each contains many instances. The labels of the bags are known; however, the labels of the instances in the bags are unknown. We design an objective function at the bag level. In the learning process, the uncertainty of instance labels can be taken into account; this alleviates the wrong label problem.
To address the second problem, we adopt convolutional architecture to automatically learn relevant features without complicated NLP preprocessing inspired by Zeng et al. (2014). Our proposal is an extension of Zeng et al. (2014), in which a single max pooling operation is utilized to determine the most significant features. Although this operation has been shown to be effective for textual feature representation (Collobert et al., 2011;Kim, 2014), it reduces the size of the hidden layers too rapidly and cannot capture the structural information between two entities (Graham, 2014). For example, to identify the relation between Steve Jobs and Apple in Figure 1, we need to specify the entities and extract the structural features between them. Several approaches have employed manually crafted features that attempt to model such structural information. These approaches usually consider both internal and external contexts. A sentence is inherently divided into three segments according to the two given entities. The internal context includes the characters inside the two entities, and the external context involves the characters around the two entities . Clearly, single max pooling is not sufficient to capture such structural information. To capture structural and other latent information, we divide the convolution results into three segments based on the positions of the two given entities and devise a piecewise max pooling layer instead of the single max pooling layer. The piecewise max pooling procedure returns the maximum value in each segment instead of a single maximum value over the entire sentence. Thus, it is expected to exhibit superior performance compared with traditional methods.
The contributions of this paper can be summarized as follows.
• We explore the feasibility of performing distant supervised relation extraction without hand-designed features. PCNNS are proposed to automatically learn features without complicated NLP preprocessing.
• To address the wrong label problem, we develop innovative solutions that incorporate multi-instance learning into the PCNNS for distant supervised relation extraction.
• In the proposed network, we devise a piecewise max pooling layer, which aims to capture structural information between two entities.

Related Work
Relation extraction is one of the most important topics in NLP. Many approaches to relation extraction have been developed, such as bootstrapping, unsupervised relation discovery and supervised classification. Supervised approaches are the most commonly used methods for relation extraction and yield relatively high performance (Bunescu and Mooney, 2006;Zelenko et al., 2003;Zhou et al., 2005). In the supervised paradigm, relation extraction is considered to be a multi-class classification problem and may suffer from a lack of labeled data for training. To address this problem, Mintz et al. (2009) adopted Freebase to perform distant supervision. As described in Section 1, the algorithm for training data generation is sometimes faced with the wrong label problem.
To address this shortcoming, (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012) developed the relaxed distant supervision assumption for multi-instance learning. The term 'multiinstance learning was coined by (Dietterich et al., 1997) while investigating the problem of predicting drug activity. In multi-instance learning, the uncertainty of instance labels can be taken into account. The focus of multi-instance learning is to discriminate among the bags.
These methods have been shown to be effective for relation extraction. However, their performance depends strongly on the quality of the designed features. Most existing studies have concentrated on extracting features to identify the relations between two entities. Previous methods can be generally categorized into two types: feature-based methods and kernel-based methods. In feature-based methods, a diverse set of strategies is exploited to convert classification clues (e.g., sequences, parse trees) into feature vectors (Kambhatla, 2004;Suchanek et al., 2006). Feature-based methods suffer from the necessity of selecting a suitable feature set when converting structured representations into feature vectors. Kernel-based methods provide a natural alternative to exploit rich representations of input classification clues, such as syntactic parse trees. Kernelbased methods enable the use of a large set of features without needing to extract them explicitly. Several kernels have been proposed, such as the convolution tree kernel (Qian et al., 2008), the subsequence kernel (Bunescu and Mooney, 2006) and the dependency tree kernel (Bunescu and Mooney, 2005).
Nevertheless, as mentioned in Section 1, it is difficult to design high-quality features using existing NLP tools. With the recent revival of interest in neural networks, many researchers have investigated the possibility of using neural networks to automatically learn features (Socher et al., 2012;Zeng et al., 2014). Inspired by Zeng et al. (2014), we propose the use of PCNNs with multi-instance learning to automatically learn features for distant supervised relation extraction. Dietterich et al. (1997) suggested that the design of multi-instance modifications for neural networks is a particularly interesting topic.  successfully incorporated multiinstance learning into traditional Backpropagation (BP) and Radial Basis Function (RBF) networks and optimized these networks by minimizing a sum-of-squares error function. In contrast to their method, we define the objective function based on the cross-entropy principle.

Methodology
Distant supervised relation extraction is formulated as multi-instance problem. In this section, we present innovative solutions that incorporate multi-instance learning into a convolutional neural network to fulfill this task. PCNNs are proposed for the automatic learning of features without complicated NLP preprocessing. Figure 3 shows our neural network architecture for distant supervised relation extraction. It illustrates the procedure that handles one instance of a bag. This procedure includes four main parts: Vector Representation, Convolution, Piecewise Max Pooling and Softmax Output. We describe these parts in detail below.

Vector Representation
The inputs of our network are raw word tokens. When using neural networks, we typically transform word tokens into low-dimensional vectors. In our method, each input word token is transformed into a vector by looking up pre-trained word embeddings. Moreover, we use position features (PFs) to specify entity pairs, which are also transformed into vectors by looking up position embeddings.

Word Embeddings
Word embeddings are distributed representations of words that map each word in a text to a 'k'dimensional real-valued vector. They have recently been shown to capture both semantic and syntactic information about words very well, setting performance records in several word similarity tasks (Mikolov et al., 2013;Pennington et al., 2014). Using word embeddings that have been trained a priori has become common practice for A common method of training a neural network is to randomly initialize all parameters and then optimize them using an optimization algorithm. Recent research (Erhan et al., 2010) has shown that neural networks can converge to better local minima when they are initialized with word embeddings. Word embeddings are typically learned in an entirely unsupervised manner by exploiting the co-occurrence structure of words in unlabeled text. Researchers have proposed several methods of training word embeddings (Bengio et al., 2003;Collobert et al., 2011;Mikolov et al., 2013). In this paper, we use the Skip-gram model (Mikolov et al., 2013) to train word embeddings.

Position Embeddings
In relation extraction, we focus on assigning labels to entity pairs. Similar to Zeng et al. (2014), we use PFs to specify entity pairs. A PF is defined as the combination of the relative distances from the current word to e 1 and e 2 . For instance, in the following example, the relative distances from son to e 1 (Kojo Annan) and e 2 (Kofi Annan) are 3 and -2, respectively.

-2
Two position embedding matrixes (PF 1 and PF 2 ) are randomly initialized. We then transform the relative distances into real valued vectors by looking up the position embedding matrixes. In the example shown in Figure 3, it is assumed that the size of the word embedding is d w = 4 and that the size of the position embedding is d p = 1. In combined word embeddings and position embeddings, the vector representation part transforms an instance into a matrix S ∈ R s×d , where s is the sentence length and d = d w + d p * 2. The matrix S is subsequently fed into the convolution part.

Convolution
In relation extraction, an input sentence that is marked as containing the target entities corresponds only to a relation type; it does not predict labels for each word. Thus, it might be necessary to utilize all local features and perform this prediction globally. When using a neural network, the convolution approach is a natural means of merging all these features (Collobert et al., 2011).
Convolution is an operation between a vector of weights, w, and a vector of inputs that is treated as a sequence q. The weights matrix w is regarded as the filter for the convolution. In the example shown in Figure 3, we assume that the length of the filter is w (w = 3); thus, w ∈ R m (m = w * d).
We consider S to be a sequence {q 1 , q 2 , · · · , q s }, where q i ∈ R d . In general, let q i:j refer to the concatenation of q i to q j . The convolution operation involves taking the dot product of w with each w-gram in the sequence q to obtain another sequence c ∈ R s+w−1 : where the index j ranges from 1 to s + w − 1. Outof-range input values q i , where i < 1 or i > s, are taken to be zero. The ability to capture different features typically requires the use of multiple filters (or feature maps) in the convolution. Under the assumption that we use n filters (W = {w 1 , w 2 , · · · , w n }), the convolution operation can be expressed as follows: The convolution result is a matrix C = {c 1 , c 2 , · · · , c n } ∈ R n×(s+w−1) . Figure 3 shows an example in which we use 3 different filters in the convolution procedure.

Piecewise Max Pooling
The size of the convolution output matrix C ∈ R n×(s+w−1) depends on the number of tokens s in the sentence that is fed into the network. To apply subsequent layers, the features that are extracted by the convolution layer must be combined such that they are independent of the sentence length. In traditional Convolution Neural Networks (CNNs), max pooling operations are often applied for this purpose (Collobert et al., 2011;Zeng et al., 2014). This type of pooling scheme naturally addresses variable sentence lengths. The idea is to capture the most significant features (with the highest values) in each feature map. However, despite the widespread use of single max pooling, this approach is insufficient for relation extraction. As described in the first section, single max pooling reduces the size of the hidden layers too rapidly and is too coarse to capture finegrained features for relation extraction. In addition, single max pooling is not sufficient to capture the structural information between two entities. In relation extraction, an input sentence can be divided into three segments based on the two selected entities. Therefore, we propose a piecewise max pooling procedure that returns the maximum value in each segment instead of a single maximum value. As shown in Figure 3, the output of each convolutional filter c i is divided into three segments {c i1 , c i2 , c i3 } by Kojo Annan and Kofi Annan. The piecewise max pooling procedure can be expressed as follows: For the output of each convolutional filter, we can obtain a 3-dimensional vector p i = {p i1 , p i2 , p i3 }. We then concatenate all vectors p 1:n and apply a non-linear function, such as the hyperbolic tangent. Finally, the piecewise max pooling procedure outputs a vector: g = tanh(p 1:n ) where g ∈ R 3n . The size of g is fixed and is no longer related to the sentence length.

Softmax Output
To compute the confidence of each relation, the feature vector g is fed into a softmax classifier.
W 1 ∈ R n 1 ×3n is the transformation matrix, and o ∈ R n 1 is the final output of the network, where n 1 is equal to the number of possible relation types for the relation extraction system. We employ dropout (Hinton et al., 2012) on the penultimate layer for regularization. Dropout prevents the co-adaptation of hidden units by randomly dropping out a proportion p of the hidden units during forward computing. We first apply a "masking" operation (g•r) on g, where r is a vector of Bernoulli random variables with probability p of being 1. Eq.(5) becomes: Each output can then be interpreted as the confidence score of the corresponding relation. This score can be interpreted as a conditional probability by applying a softmax operation (see Section 3.5). In the test procedure, the learned weight vectors are scaled by p such thatŴ 1 = pW 1 and are used (without dropout) to score unseen instances.

Multi-instance Learning
In order to alleviate the wrong label problem, we use multi-instance learning for PCNNs. The PCNNs-based relation extraction can be stated as a quintuple θ = (E, PF 1 , PF 2 , W, W 1 ) 2 . The input to the network is a bag. Suppose that there are T bags {M 1 , M 2 , · · · , M T } and that the i-th bag contains q i instances M i = {m 1 i , m 2 i , · · · , m q i i }. The objective of multi-instance learning is to predict the labels of the unseen bags. In this paper, all instances in a bag are considered independently. Given an input instance m j i , the network with the parameter θ outputs a vector o, where the r-th component o r corresponds to the score associated Algorithm 1 Multi-instance learning 1: Initialize θ. Partition the bags into minibatches of size b s . 2: Randomly choose a mini-batch, and feed the bags into the network one by one. 3: Find the j-th instance m j i (1 ≤ i ≤ b s ) in each bag according to Eq. (9). 4: Update θ based on the gradients of m j i (1 ≤ i ≤ b s ) via Adadelta. 5: Repeat steps 2-4 until either convergence or the maximum number of epochs is reached.
with relation r. To obtain the conditional probability p(r|m, θ), we apply a softmax operation over all relation types: The objective of multi-instance learning is to discriminate bags rather than instances. To do so, we must define the objective function on the bags. Given all (T ) training bags (M i , y i ), we can define the objective function using cross-entropy at the bag level as follows: where j is constrained as follows: Using this defined objective function, we maximize J(θ) through stochastic gradient descent over shuffled mini-batches with the Adadelta (Zeiler, 2012) update rule. The entire training procedure is described in Algorithm 1. From the introduction presented above, we know that the traditional backpropagation algorithm modifies a network in accordance with all training instances, whereas backpropagation with multi-instance learning modifies a network based on bags. Thus, our method captures the nature of distant supervised relation extraction, in which some training instances will inevitably be incorrectly labeled. When a trained PCNN is used for prediction, a bag is positively labeled if and only if the output of the network on at least one of its instances is assigned a positive label.

Experiments
Our experiments are intended to provide evidence that supports the following hypothesis: automatically learning features using PCNNs with multiinstance learning can lead to an increase in performance. To this end, we first introduce the dataset and evaluation metrics used. Next, we test several variants via cross-validation to determine the parameters to be used in our experiments. We then compare the performance of our method to those of several traditional methods. Finally, we evaluate the effects of piecewise max pooling and multiinstance learning 3 .

Dataset and Evaluation Metrics
We evaluate our method on a widely used dataset 4 that was developed by (Riedel et al., 2010) and has also been used by (Hoffmann et al., 2011;Surdeanu et al., 2012). This dataset was generated by aligning Freebase relations with the NYT corpus, with sentences from the years 2005-2006 used as the training corpus and sentences from 2007 used as the testing corpus.
Following previous work (Mintz et al., 2009), we evaluate our method in two ways: the held-out evaluation and the manual evaluation. The heldout evaluation only compares the extracted relation instances against Freebase relation data and reports the precision/recall curves of the experiments. In the manual evaluation, we manually check the newly discovered relation instances that are not in Freebase.

Pre-trained Word Embeddings
In this paper, we use the Skip-gram model (word2vec) 5 to train the word embeddings on the NYT corpus. Word2vec first constructs a vocabulary from the training text data and then learns vector representations of the words. To obtain the embeddings of the entities, we concatenate the tokens of a entity using the ## operator when the entity has multiple word tokens. Since a comparison of the word embeddings is beyond the scope

Parameter Settings
In this section, we experimentally study the effects of two parameters on our models: the window size, w, and the number of feature maps, n.
Following (Surdeanu et al., 2012), we tune all of the models using three-fold validation on the training set. We use a grid search to determine the optimal parameters and manually specify subsets of the parameter spaces: w ∈ {1, 2, 3, · · · , 7} and n ∈ {50, 60, · · · , 300}. Table 1 shows all parameters used in the experiments. Because the position dimension has little effect on the result, we heuristically choose d p = 5. The batch size is fixed to 50. We use Adadelta (Zeiler, 2012) in the update procedure; it relies on two main parameters, ρ and ε, which do not significantly affect the performance (Zeiler, 2012). Following (Zeiler, 2012), we choose 0.95 and 1e −6 , respectively, as the values of these parameters. In the dropout operation, we randomly set the hidden unit activities to zero with a probability of 0.5 during training.

Held-out Evaluation
The held-out evaluation provides an approximate measure of precision without requiring costly human evaluation. Half of the Freebase relations are used for testing. The relation instances discovered from the test articles are automatically compared with those in Freebase.
To evaluate the proposed method, we select the following three traditional methods for comparison. Mintz represents a traditional distantsupervision-based model that was proposed by (Mintz et al., 2009). MultiR is a multi-instance learning method that was proposed by (Hoffmann et al., 2011). MIML is a multi-instance multilabel model that was proposed by (Surdeanu et al., 2012). Figure 4 shows the precision-recall curves for each method, where PCNNs+MIL denotes our method, and demonstrates that PCNNs+MIL achieves higher precision over the entire range of recall. PCNNs+MIL enhances the recall to ap-   proximately 34% without any loss of precision. In terms of both precision and recall, PCNNs+MIL outperforms all other evaluated approaches. Notably, the results of the methods evaluated for comparison were obtained using manually crafted features. By contrast, our result is obtained by automatically learning features from original words. The results demonstrate that the proposed method is an effective technique for distant supervised relation extraction. Automatically learning features via PCNNs can alleviate the error propagation that occurs in traditional feature extraction. Incorporating multi-instance learning into a convolutional neural network is an effective means of addressing the wrong label problem.

Manual Evaluation
It is worth emphasizing that there is a sharp decline in the held-out precision-recall curves of PC-NNs+MIL at very low recall (Figure 4). A manual check of the misclassified examples that were produced with high confidence reveals that the ma-jorities of these examples are false negatives and are actually true relation instances that were misclassified due to the incomplete nature of Freebase. Thus, the held-out evaluation suffers from false negatives in Freebase. We perform a manual evaluation to eliminate these problems. For the manual evaluation, we choose the entity pairs for which at least one participating entity is not present in Freebase as a candidate. This means that there is no overlap between the held-out and manual candidates. Because the number of relation instances that are expressed in the test data is unknown, we cannot calculate the recall in this case. Instead, we calculate the precision of the top N extracted relation instances. Table 2 presents the manually evaluated precisions for the top 100, top 200, and top 500 extracted instances. The results show that PC-NNs+MIL achieves the best performance; moreover, the precision is higher than in the held-out evaluation. This finding indicates that many of the false negatives that we predict are, in fact, true relational facts. The sharp decline observed in the held-out precision-recall curves is therefore reasonable.

Effect of Piecewise Max Pooling and
Multi-instance Learning In this paper, we develop a method of piecewise max pooling and incorporate multi-instance learning into convolutional neural networks for distant supervised relation extraction. To demonstrate the effects of these two techniques, we empirically study the performance of systems in which these techniques are not implemented through held-out evaluations ( Figure 5). CNNs represents convolutional neural networks to which single max pooling is applied. Figure 5 shows that when piecewise max pooling is used (PCNNs), better results are produced than those achieved using CNNs. Moreover, compared with CNNs+MIL, PCNNs achieve slightly higher precision when the recall is greater than 0.08. Since the parameters for all the model are determined by grid search, we can observe that CNNs cannot achieve competitive results compared to PCNNs when increasing the size of the hidden layer of convolutional neural networks. It means that we cannot capture more useful information by simply increasing the network parameter. These results demonstrate that the proposed piecewise max pooling technique is beneficial and Figure 5: Effect of piecewise max pooling and multi-instance learning.
can effectively capture structural information for relation extraction. A similar phenomenon is also observed when multi-instance learning is added to the network. Both CNNs+MIL and PCNNs+MIL outperform their counterparts CNNs and PCNNs, respectively, thereby demonstrating that incorporation of multi-instance learning into our neural network was successful in solving the wrong label problem. As expected, PCNNs+MIL obtains the best results because the advantages of both techniques are achieved simultaneously.

Conclusion
In this paper, we exploit Piecewise Convolutional Neural Networks (PCNNs) with multi-instance learning for distant supervised relation extraction. In our method, features are automatically learned without complicated NLP preprocessing. We also successfully devise a piecewise max pooling layer in the proposed network to capture structural information and incorporate multi-instance learning to address the wrong label problem. Experimental results show that the proposed approach offers significant improvements over comparable methods.