Relation Extraction: Perspective from Convolutional Neural Networks

Up to now, relation extraction systems have made extensive use of features generated by linguistic analysis modules. Errors in these features lead to errors of relation detection and classiﬁcation. In this work, we depart from these traditional approaches with complicated feature engineering by introducing a convolutional neural network for relation extraction that automatically learns features from sentences and minimizes the dependence on ex-ternal toolkits and resources. Our model takes advantages of multiple window sizes for ﬁl-ters and pre-trained word embeddings as an initializer on a non-static architecture to improve the performance. We emphasize the relation extraction problem with an unbalanced corpus. The experimental results show that our system signiﬁcantly outperforms not only the best baseline systems for relation extraction but also the state-of-the-art systems for relation classiﬁcation.


Introduction
Learning to extract semantic relations between entity pairs from text plays a vital role in information extraction, knowledge base population and question answering, to name a few. The relation extraction (RE) task can be divided into two steps: detecting if a relation utterance corresponding to some entity mention pair of interest in the same sentence represents some relation and classifying the detected relation mentions into some predefined classes. If we only need to categorize the given relation mentions that are known to express some expected relation (perfect detection), we are left with the relation classification (RC) task. One variation of relation classification is that one might have non-relation examples in the dataset but the number of those is comparable to the number of the other examples. The non-relation examples, therefore, can be treated as a usual relation class. Relation extraction, on the other hand, often comes with a tremendously unbalanced dataset where the number of the non-relation examples far exceeds the others, making relation extraction more challenging but more practical than relation classification. Our present work focuses on the relation extraction task with an unbalanced corpus.
In the last decade, the relation extraction literature has been dominated by two methods, distinguished by the nature of the relation representation: the feature-based method (Kambhatla, 2004;Boschee et al., 2005;Zhou et al., 2005;Grishman et al., 2005;Jiang and Zhai, 2007;Chan and Roth, 2010;Sun et al., 2011;Nguyen and Grishman, 2014) and the kernel-based method (Zelenko et al., 2003;Culotta and Sorensen, 2004;Bunescu and Mooney, 2005a;Bunescu and Mooney, 2005b;Zhang et al., 2006;Zhou et al., 2007;Qian et al., 2008;Nguyen et al., 2009;Sun and Han, 2014). The common characteristic of these methods is the leverage of a large body of linguistic analysis and knowledge resources to transform relation mentions into some rich representation to be used by some statistical classifier such as Support Vector Machines (SVM) or Maximum Entropy (MaxEnt). The linguistic analysis pipeline which is hand-designed itself includes tokenization, part of speech tagging, chunking, name tagging as well as parsing, often performed by existing natural language processing (NLP) modules.
While these methods allow the RE systems to inherit the knowledge discovered by the NLP community for the pre-processing tasks, they might be subject to the error propagation introduced by the imperfect quality of the supervised NLP toolkits. For instance, all the tasks mentioned in the pipeline above are known to suffer from a performance loss when they are applied to out-of-domain data (Blitzer et al., 2006;Daumé III, 2007;McClosky et al., 2010), causing the collapse of the RE systems based on them. In this paper, we target an independent RE system that both avoids complicated feature engineering and minimizes the reliance on the supervised NLP modules for features, potentially alleviating the error propagation and advancing our performance in this area.
To be concrete, our relation extraction system is provided only with raw sentences marked with the positions of the two entities of interest 1 . The only elements we can derive from this structure are the words, the n-grams and their positions in the sentences, suggesting a paradigm in which relation mentions are represented by features that depend on these elements. Eventually, word embeddings that are capable of capturing latent semantic and syntactic properties of words (Bengio et al., 2001;Mnih and Hinton, 2007;Collobert and Weston, 2008;Mnih and Hinton, 2009;Turian et al., 2010;Mikolov et al., 2013) and convolutional neural networks (CNNs) that are able to recognize specific classes of n-gram and induce more abstract representations (Kalchbrenner et al., 2014) are a natural combination one should apply to obtain more effective representations for RE in this setting.
Convolutional neural networks (dating back to the 1980s) are a type of feed-forward artificial neural networks whose layers are formed by a convolution operation followed by a pooling operation (LeCun et al., 1988;Kalchbrenner et al., 2014). Recently, with the emerging interests of the community in deep learning, CNNs have been revived and effectively applied in various NLP tasks, including semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling and clas-sification (Kalchbrenner et al., 2014;Kim, 2014), name tagging and semantic role labeling (Collobert et al., 2011). For relation classification and extraction, there are two very recent works on CNNs for relation classification (Liu et al., 2013) 2 and (Zeng et al., 2014); however, to the best of our knowledge, there has been no work on employing CNNs for relation extraction so far. This paper is the first attempt to fill in that gap and serves as a baseline for future research in this area.
Our convolutional neural network is built upon that of Kalchbrenner et al. (2014) and Kim (2014) which are originally proposed for sentence classification and modeling. We adapt the network for relation extraction by introducing the position embeddings to encode the relative distances of the words in the sentence to the two entities of interest. Compared to the models in Liu et al. (2013) and Zeng et al. (2014) for relation classification that apply a single window size, our model for relation extraction incorporates various window sizes for convolutional filters, allowing the network to capture wider ranges of n-grams to be helpful for relation extraction. In addition, rather than initializing the word embeddings randomly as do Liu et al. (2013) and fixing the randomly generated position embeddings during training as do Zeng et al. (2014), we use pretrained word embeddings for initialization and optimize both word embeddings and position embeddings as model parameters. More importantly, rather than using exterior features (either from human annotation or other pre-processing modules) to enrich the representation as do Liu et al. (2013) and Zeng et al. (2014), our model (adapted for RC where entity heads are given) avoids usage of manual linguistic resources and supervised NLP toolkits constructed externally, utilizing word embeddings that can be trained automatically in an unsupervised framework as the only external resource for the whole system. We explore different model architectures systematically and demonstrate that the best model performance is achieved when multiple window sizes are implemented and the word embeddings, once initialized by some "universal" embeddings, are allowed to vary during the optimization process to reach an effective state for relation extraction. We evaluate our models on both relation classification and relation extraction tasks. For relation classification, experiments show that our model (without any external features and resources) outperforms the stateof-the-art models whether the external features are included in these models or not. For relation extraction, our model is significantly better than the baseline models that use the words and the embeddings themselves as the features. In the following, we discuss related work in Section 2 and present our model in Section 3. We detail an extensive evaluation in Section 4 and finally conclude in Section 5.

Related Work
As our present work focuses on the supervised framework for relation extraction, we concentrate on the supervised systems in this section. Besides the supervised systems (either feature-based or kernelbased) mentioned above, some recent systems have employed the distant supervision (DS) approach for relation extraction. This approach is essentially similar to the traditional systems in representing relation mentions but attempts to generate training data automatically by leveraging large knowledge bases of facts and corpus (Mintz et al., 2009;Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012).
Regarding neural networks, their first application to NLP is language modeling which has been useful to learn distributed representations (embeddings) for words (Bengio et al., 2001;Mnih and Hinton, 2007;Collobert and Weston, 2008;Mnih and Hinton, 2009;Turian et al., 2010;Mikolov et al., 2013). These word embeddings have opened a new direction for many other NLP tasks grounded on neural networks. Some of them are mentioned above. Other than that, a class of recursive neural networks (RNNs) and neural tensor networks are proposed for paraphrase detection (Socher et al., 2011), parsing (Socher et al., 2013a), sentiment analysis (Socher et al., 2013b), knowledge base completion (Socher et al., 2013c), question answering (Mohit et al., 2014) etc. Among these RNN systems, the study that is most related to our relation extraction problem is Socher et al. (2012) that learns compositional vector representations for phrases and sentences through syntactic parse trees and applies these representations for relation classification. However, this method inherently requires syntactic parse trees in contrast to our target of avoiding use of any external features and resources for RC.

Convolutional Neural Network for Relation Extraction
Our convolutional neural network for relation extraction consists of four main layers: (i) the look-up tables to encode words in sentences by real-valued vectors, (ii) the convolutional layer to recognize ngrams, (iii) the pooling layer to determine the most relevant features and (iv) a logistic regression layer (a fully connected neural network with a softmax at the end) to perform classification (Collobert et al., 2011;Kim, 2014;Kalchbrenner et al., 2014).

Word Representation
The input to the CNN for relation extraction consists of sentences marked with the two entity mentions of interest. As CNNs can only work with fixed length inputs, we compute the maximal separation between entity mentions linked by a relation and choose an input width greater than this distance. We insure that every input (relation mention) has this length by trimming longer sentences and padding shorter sentences with a special token. Let n be the length of the relation mentions and x = [x 1 , x 2 , . . . , x n ] be some relation mention where x i is the i-th word in the mention. Also, let x i 1 and x i 2 be the two heads of the two entity mentions of interest . Before entering the network, each word x i is first transformed into a vector e i by looking up the word embedding table W that can be initialized either by a random process or by some pretrained word embeddings. Besides, in order to embed the positions of the two entity heads as well as the other words in the relation mention into the representation, for each word x i , its relative distances to the two entity heads i−i 1 and i−i 2 are also mapped into real-value vectors d i 1 and d i 2 respectively using a position embedding table D (initialized randomly) (Collobert et al., 2011;Liu et al., 2013;Zeng et al., 2014). Note that the relative distances only range from −n + 1 to n − 1 so the position embedding matrix D has size (2n − 1) × m d (m d is a hyperparameter indicating the dimensionality of the position embedding vectors). Finally, the word embeddings  e i and the position embeddings d 1 and d 2 are concatenated into a single vector represent the word x i . As a result, the original sentence x can now be viewed as a matrix x of size (m e + 2m d ) × n where m e is the dimensionality of the word embedding vectors.

Convolution
In the next step, the matrix x representing the input relation mention is fed into the convolutional layer to extract higher level features. Given a widow size w, a filter is seen as a weight matrix The core of this layer is obtained from the application of the convolutional operator on the two matrices x and f to produce a score sequence s = [s 1 , s 2 , . . . , s n−w+1 ]: where b is a bias term and g is some non-linear function. This process can then be replicated for various filters with different window sizes to increase the n-gram coverage of the model.
For relation extraction, we call the n-grams accompanied with relative positions of its words the augmented n-grams. It is instructive to think about the filter f as representing some hidden class of the augmented n-grams and the scores s i as measuring the possibility the augmented n-gram at position i belongs to the corresponding hidden class (although these scores are not probabilities at all). The trained weights of the filter f would then amount to a feature detector that learns to recognize the hidden class of the augmented n-grams (Kalchbrenner et al., 2014).

Pooling
The rationale of the pooling layer is to further abstract the features generated from the convolutional layer by aggregating the scores for each filter to introduce the invariance to the absolute positions but preserve the relative positions of the n-grams between themselves and the entity heads at the same time. The popular aggregating function is max as it bears responsibility for identifying the most important or relevant features from the score sequence. Concretely, for each filter f , its score sequence s is passed through the max function to produce a single number: p f = max{s} = max{s 1 , s 2 , . . . s n−w+1 } which can be interpreted as estimating the possibility some augmented n-gram of the hidden class of f appears in the context.

Regularization and Classification
In the final step, the pooling scores for every filter are concatenated into a single feature vector z = [p 1 , p 2 , . . . , p m ] to represent the relation mention. Here, m is the number of filters in the model and p i is the pooling score of the i-th filter. Before actually applying this feature vector, following (Kim, 2014;Hinton et al., 2012), we execute a dropout for regularization by randomly setting to zero a proportion ρ of the elements of the feature vector 3 z to produce the vector z d . The dropout vector z d is then fed into a fully connected layer of standard neural networks followed by a softmax layer in the end to perform classification. The fully connected layer induces a weight matrix C as model parameters. At test time, the unseen relation mentions are scored using the feature vectors that are not dropped out. We also rescale the weights whose l 2 -norms exceed a hyperparameter as Kim (2014).
Overall, the parameters for the presented CNN are: the word embedding matrix W, the position embedding matrix D, the m filter matrices, the weight matrix C for the fully connected layer. The gradients are computed using back-propagation while training is done via stochastic gradient descent with shuffled mini-batches and the AdaDelta update rule (Zeiler, 2012;Kim, 2014).

Hyperparameters and Resources
For all the experiments below, we use: tanh for the non-linear function, 150 filters for each window size in the model and position embedding vectors with dimensionality of m d = 50 4 . Regarding the other parameters, we use the same values as do Kim (2014), i.e, the dropout rate ρ = 0.5, the mini-batch size of 50, the hyperparameter for the l 2 of 3.
Finally, we utilize the pre-trained word embeddings word2vec from Mikolov et al. (2013) which have dimensionality of m e = 300 and are trained on 100 billion words of Google News using the continuous bag-of-words architecture. These embeddings are publicly available here 5 . Vectors for the words not included in the pre-trained embeddings are initialized randomly. Besides the word embeddings word2vec, the model does not use any other NLP toolkits or resources.

Datasets
We evaluate our models on two datasets: the SemEval-2010 Task 8 dataset (Hendrickx et al., 2010) Table 1. As we can see, the ACE dataset is much more biased toward the Other class than the SemEval dataset and thus more appropriate for relation extraction experiments.   (3,4,5) and (2,3,4,5).

Evaluation of Model Architectures
In each of these window size configurations, we evaluate the system on three different scenarios: (i) the word embeddings and the position embeddings are randomly initialized and optimized during the training process (denoted by nonstatic.rand), (ii) the word embeddings are initialized by the pretrained word embeddings; the position embeddings are initialized randomly and the two embeddings are kept unchanged during the training (denoted by static.word2vec), (iii) the two embeddings are initialized as in case (ii) but they are optimized as model parameters when the model is trained (denoted by nonstatic.word2vec). These experiments are carried out for relation extraction on the ACE 2005 dataset via 5-fold cross validation. Table 2 presents the system performance on Precision (P), Recall (R) and F1 score (F).
The key observations from the table are 7 : (i) From rows 1, 2, 3, 4, we see that evaluating window sizes individually is quite intricate. It is unclear which window size is the best size for CNNs on relation extraction. For instance, on the nonstatic.rand mode, the window size 4 seems to outperform the others while on the other modes, the window sizes 3 and 5 turn out to be better. Besides, the performance gaps between the window sizes are small, making it hard to draw a conclusive judgement. In any case, the window size 2 seems to be the worst, suggesting that the 2-grams might be less informative than the others on representing relation mentions for CNNs on this dataset.
(ii) While the results on evaluating single window sizes are hard to analyze, the results for multiple window sizes are quite clear and conclusive. Moving from single window sizes of 2, 3, 4 or 5 (rows 1, 2, 3 and 4 respectively) to the configuration with two window sizes 4 and 5 (row 5) gives us consistent improvements on all the model architectures. The performance is then consistently enhanced when more window sizes are included, resulting in the best performance when all the window sizes 2, 3, 4 and 5 are employed. This demonstrates the advantages of the models with multiple window sizes over the single window size models in Liu et al. (2013) and Zeng et al. (2014).
(iii) Regarding different model architectures, the picture is even clearer. No matter which window size configuration is applied, we constantly see the nonstatic.word2vec architecture performs most effectively, followed by the static.word2vect setting which is in turn followed by the nonstatic.rand model. This suggests the undeniable benefits of initializing the word embeddings by some "universal" pre-trained values and updating the embeddings to reflex RE specific embeddings when training the models (Collobert et al., 2011;Kim, 2014). For the next experiments, we always use all the window sizes 2, 3, 4 and 5 with the nonstatic.word2vec architecture.

Relation Extraction Experiment
We compare our system with the traditional featurebased relation extraction systems when these system are only allowed to use the same information and resources as our systems, i.e, the words in the relation mentions, the positions of the two entity heads and the word embeddings. Given the sentences and the positions of the two entity heads, the features that the state-of-the-art feature-based systems extract in-clude: the heads of the two entity mentions; the words in the context before mention 1; after mention 2 and between two mentions; the bigrams, the word sequences between two entities, the order of two mentions, the number of words between two mentions (Zhou et al., 2005;Jiang and Zhai, 2007;Sun et al., 2011). The feature-based system using this feature set is called Words. Armed with the word embeddings, one can further introduce these embeddings into the head words or the words in the context as additional features (Nguyen and Grishman, 2014). We call the system Words augmented with the embeddings for the two heads Words-HM-Wed and Words augmented with the embeddings for words in the contexts Words-WC-Wed. We apply the MaxEnt framework with L2 regularization in the Mallet toolkit 8 to train these feature-based models (as (Jiang and Zhai, 2007;Sun et al., 2011;Nguyen and Grishman, 2014)).  The first observation is that adding the word embeddings to the words in the context hurt the performance of the feature-based systems while augmenting the heads of the entities with word embeddings significantly improves the feature-based systems. This is consistent with the results reported by Nguyen and Grishman (2014) and demonstrates that the ability to wisely pick the words for embeddings and avoid embeddings on specific locations is crucial to the feature-based systems. More importantly, our proposed CNN significantly outperforms all the baseline models at the confidence levels ≥ 95%, an improvement of 4.96% over the best feature-based system Words-HM-Wed (Nguyen and Grishman, 2014). This result indicates that CNNs are a better way to employ word embeddings for relation extraction.
Remember that although the traditional systems can achieve a performance greater than 72% on the ACE dataset (Qian et al., 2008;Sun et al., 2011), they come at the expense of elaborate feature engineering as well as much more expensive feature extraction. In particular, the feature extractors of these feature-based systems require: (i) the perfect entity and mention type information hand-labeled laboriously by human annotators; (ii) the extensive usage of the existing supervised NLP toolkits and resources (constituent and dependency parsers, dictionaries, gazetteers etc) which might be unavailable for various domains in reality. The absence of the perfect (hand-annotated) entity and mention type information (i.e point (i) above) greatly impairs these feature-based systems' performance. For instance, both Plank and Moschitti (2013) and Nguyen and Grishman (2014) report a performance less than 60% on the ACE 2005 dataset when the perfect entity type and mention type features are not employed although the other features with extensive feature engineering (i.e point (ii) above) are still included. As a result, in a more realistic setting where hand-annotated features are prohibitive, the proposed CNN requires much less feature engineering and resources but still performs better than the traditional feature-based systems.

Relation Classification Experiment
In order to further verify the effectiveness of the system, we test the system on the relation classification task with the SemEval 2010 dataset and compare the results with the state-of-the-art systems in this area. Table 4 describes the performance of various traditional systems that are based on classifiers such as MaxEnt and SVM with series of supervised and manual features 9 (Hendrickx et al., 2010) as well as the more recent systems based on convolutional neural networks (Zeng et al., 2014) (O-CNN), recursive neural networks (RNN), matrix-vector recursive neural networks (MVRNN) (Socher et al., 2012) or log-quadratic factor-based compositional embedding model (FCM) (Yu et al., 2014) 10 .
As we can see, among the systems not using any 9 i.e the features extracted from supervised pre-processing NLP modules and manual resources 10 These are the macro-averaged F1-scores, computed by the officially provided scorer.   (Yu et al., 2014) with an improvement of 2.2%. More interestingly, even without supervised and manual features, our system can still work comparably to the other systems utilizing these features as the vital components. For instance, the supervised features (dependency parse and name tagging) are crucial to FCM (Yu et al., 2014) to significantly improve its performance. We attribute our performance advantage over the closely-related system O-CNN (Zeng et al., 2014) to the multiple window sizes, the optimization of the position embeddings during training and possibly the superiority of the embeddings word2vec we use.

Impact of Unbalanced Dataset
Shifting from relation classification to relation extraction with an unbalanced corpus, we witness a large performance gap as described above. In this section, we study the impact of the unbalanced corpus on the performance of relation extractors for both convolutional neural networks and traditional feature-based approaches (Words and Words-HM-  Figure 2 shows the curves. This is a 5-fold cross validation experiment and all the comparisons are significant at confidence levels ≥ 95%. From the figure, we see that all the models improve constantly with the increase of the ratio of the positive and negative examples. The performance peaks with an improvement of about 20% for all models when the number of examples of the class Other is small relative to the others. In other words, the systems attain their best performance when relation extraction is reduced to the relation classification problem, suggesting that relation extraction is much more challenging than relation classification. Finally, for all the ratio values, we consistently see that the convolutional neural network is superior to the others, once again confirming its advantages.

Conclusion
We present a CNN for relation extraction that emphasizes an unbalanced corpus and minimizes usage of external supervised NLP toolkits for features. The network uses multiple window sizes for filters, position embeddings for encoding relative distances and pre-trained word embeddings for initialization in a non-static architecture. The experimental results demonstrate the effectiveness of the proposed CNN on both RC and RE. Our future work includes: (i) to enrich the representation of CNNs with more features for RE, (ii) to study the applications of CNNs on other related tasks, and (iii) to examine other neural network models for RE.