Relation Classification via Multi-Level Attention CNNs

Relation classiﬁcation is a crucial ingredi-ent in numerous information extraction systems seeking to mine structured facts from text. We propose a novel convolutional neural network architecture for this task, relying on two levels of attention in order to better discern patterns in heterogeneous contexts. This architecture enables end-to-end learning from task-speciﬁc labeled data, forgoing the need for external knowledge such as explicit dependency structures. Experiments show that our model outperforms previous state-of-the-art methods, including those relying on much richer forms of prior knowledge.


Introduction
Relation classification is the task of identifying the semantic relation holding between two nominal entities in text. It is a crucial component in natural language processing systems that need to mine explicit facts from text, e.g. for various information extraction applications as well as for question answering and knowledge base completion (Tandon et al., 2011;Chen et al., 2015). For instance, given the example input "Fizzy [drinks] and meat cause heart disease and [diabetes]." with annotated target entity mentions e 1 = "drinks" and e 2 = "diabetes", the goal would be to automatically recognize that this sentence expresses a causeeffect relationship between e 1 and e 2 , for which we use the notation Cause-Effect(e 1 ,e 2 ). Accurate relation classification facilitates precise sentence interpretations, discourse processing, and higherlevel NLP tasks (Hendrickx et al., 2010). Thus, * Equal contribution. † Corresponding author. Email: liuzy@tsinghua.edu.cn relation classification has attracted considerable attention from researchers over the course of the past decades (Zhang, 2004;Qian et al., 2009;Rink and Harabagiu, 2010).
In the example given above, the verb corresponds quite closely to the desired target relation. However, in the wild, we encounter a multitude of different ways of expressing the same kind of relationship. This challenging variability can be lexical, syntactic, or even pragmatic in nature. An effective solution needs to be able to account for useful semantic and syntactic features not only for the meanings of the target entities at the lexical level, but also for their immediate context and for the overall sentence structure.
Thus, it is not surprising that numerous featureand kernel-based approaches have been proposed, many of which rely on a full-fledged NLP stack, including POS tagging, morphological analysis, dependency parsing, and occasionally semantic analysis, as well as on knowledge resources to capture lexical and semantic features (Kambhatla, 2004;Zhou et al., 2005;Suchanek et al., 2006;Qian et al., 2008;. In recent years, we have seen a move towards deep architectures that are capable of learning relevant representations and features without extensive manual feature engineering or use of external resources. A number of convolutional neural network (CNN), recurrent neural network (RNN), and other neural architectures have been proposed for relation classification (Zeng et al., 2014;dos Santos et al., 2015;Xu et al., 2015b). Still, these models often fail to identify critical cues, and many of them still require an external dependency parser. We propose a novel CNN architecture that addresses some of the shortcomings of previous approaches. Our key contributions are as follows: 1. Our CNN architecture relies on a novel multi-level attention mechanism to capture both entity-specific attention (primary attention at the input level, with respect to the target entities) and relation-specific pooling attention (secondary attention with respect to the target relations). This allows it to detect more subtle cues despite the heterogeneous structure of the input sentences, enabling it to automatically learn which parts are relevant for a given classification. 2. We introduce a novel pair-wise margin-based objective function that proves superior to standard loss functions. 3. We obtain the new state-of-the-art results for relation classification with an F1 score of 88.0% on the SemEval 2010 Task 8 dataset, outperforming methods relying on significantly richer prior knowledge.

Related Work
Apart from a few unsupervised clustering methods (Hasegawa et al., 2004;Chen et al., 2005), the majority of work on relation classification has been supervised, typically cast as a standard multiclass or multi-label classification task. Traditional feature-based methods rely on a set of features computed from the output of an explicit linguistic preprocessing step (Kambhatla, 2004;Zhou et al., 2005;Boschee et al., 2005;Suchanek et al., 2006;Chan and Roth, 2010;Nguyen and Grishman, 2014), while kernel-based methods make use of convolution tree kernels (Qian et al., 2008), subsequence kernels (Mooney and Bunescu, 2005), or dependency tree kernels . These methods thus all depend either on carefully handcrafted features, often chosen on a trial-and-error basis, or on elaborately designed kernels, which in turn are often derived from other pre-trained NLP tools or lexical and semantic resources. Although such approaches can benefit from the external NLP tools to discover the discrete structure of a sentence, syntactic parsing is error-prone and relying on its success may also impede performance (Bach and Badaskar, 2007). Further downsides include their limited lexical generalization abilities for unseen words and their lack of robustness when applied to new domains, genres, or languages. In recent years, deep neural networks have shown promising results. The Recursive Matrix-Vector Model (MV-RNN) by Socher et al. (2012) sought to capture the compositional aspects of the sentence semantics by exploiting syntactic trees. Zeng et al. (2014) proposed a deep convolutional neural network with softmax classification, extracting lexical and sentence level features. However, these approaches still depend on additional features from lexical resources and NLP toolkits. Yu et al. (2014) proposed the Factor-based Compositional Embedding Model, which uses syntactic dependency trees together with sentence-level embeddings. In addition to dos Santos et al. (2015), who proposed the Ranking CNN (CR-CNN) model with a class embedding matrix, Miwa and Bansal (2016) similarly observed that LSTM-based RNNs are outperformed by models using CNNs, due to limited linguistic structure captured in the network architecture. Some more elaborate variants have been proposed to address this, including bidirectional LSTMs (Zhang et al., 2015), deep recurrent neural networks (Xu et al., 2016), and bidirectional treestructured LSTM-RNNs (Miwa and Bansal, 2016). Several recent works also reintroduce a dependency tree-based design, e.g., RNNs operating on syntactic trees (Hashimoto et al., 2013), shortest dependency path-based CNNs (Xu et al., 2015a), and the SDP-LSTM model (Xu et al., 2015b). Finally, Nguyen and Grishman (2015) train both CNNs and RNNs and variously aggregate their outputs using voting, stacking, or log-linear modeling (Nguyen and Grishman, 2015). Although these recent models achieve solid results, ideally, we would want a simple yet effective architecture that does not require dependency parsing or training multiple models. Our experiments in Section 4 demonstrate that we can indeed achieve this, while also obtaining substantial improvements in terms of the obtained F1 scores.

The Proposed Model
Given a sentence S with a labeled pair of entity mentions e 1 and e 2 (as in our example from Section 1), relation classification is the task of identifying the semantic relation holding between e 1 and e 2 among a set of candidate relation types (Hendrickx et al., 2010). Since the only input is a raw sentence with two marked mentions, it is non-trivial to obtain all the lexical, semantic and syntactic cues necessary to make an accurate prediction.
To this end, we propose a novel multi-level attention-based convolution neural network model. A schematic overview of our architecture is given in Figure 1. The input sentence is first encoded using word vector representations, exploiting the context and a positional encoding to better capture the word order. A primary attention mechanism, based on diagonal matrices is used to capture the relevance of words with respect to the target entities. To the resulting output matrix, one then applies a convolution operation in order to capture contextual information such as relevant n-grams, followed by max-pooling. A secondary attention pooling layer is used to determine the most useful convolved features for relation classification from the output based on an attention pooling matrix. The remainder of this section will provide further details about this architecture. Table 1 provides an overview of the notation we will use for this. The final output is given by a new objective function, described below.

Classification Objective
We begin with top-down design considerations for the relation classification architecture. For a given sentence S, our network will ultimately output some w O . For every output relation y ∈ Y, we assume there is a corresponding output embedding W L y , which will automatically be learnt by the network (dos Santos et al., 2015).
We propose a novel distance function δ θ (S) that measures the proximity of the predicted network output w O to a candidate relation y as follows.
using the L 2 norm (note that W L y are already normalized). Based on this distance function, we design a margin-based pairwise loss function L as where 1 is the margin, β is a parameter, δ θ (S, y) is the distance between the predicted label embedding W L and the ground truth label y and δ θ (S,ŷ − ) refers to the distance between w O and a selected incorrect relation labelŷ − . The latter is chosen as the one with the highest score among all incorrect classes (Weston et al., 2011;dos Santos et al., 2015), i.e.ŷ − = argmax y ∈Y,y =y δ(S, y ).
This margin-based objective has the advantage of a strong interpretability and effectiveness compared with empirical loss functions such as the ranking loss function in the CR-CNN approach by dos Santos et al. (2015). Based on a distance function motived by word analogies (Mikolov et al., 2013b), we minimize the gap between predicted outputs and ground-truth labels, while maximizing the distance with the selected incorrect class. By minimizing this pairwise loss function iteratively (see Section 3.5), δ θ (S, y) are encouraged to decrease, while δ θ (S,ŷ − ) increase.

Input Representation
Given a sentence S = (w 1 , w 2 , ..., w n ) with marked entity mentions e 1 (=w p ) and e 2 (=w t ), (p, t ∈ [1, n], p = t), we first transform every word into a real-valued vector to provide lexicalsemantic features. Given a word embedding matrix W V of dimensionality d w × |V | , where V is the input vocabulary and d w is the word vector dimensionality (a hyper-parameter), we map every w i to a column vector w d i ∈ R dw . To additionally capture information about the relationship to the target entities, we incorporate word position embeddings (WPE) to reflect the relative distances between the i-th word to the two marked entity mentions (Zeng et al., 2014;dos Santos et al., 2015). For the given sentence in Fig. 1, the relative distances of word "and" to entity e 1 "drinks" and e 2 "diabetes" are −1 and 6, respectively. Every relative distance is mapped to a randomly initialized position vector in R dp , where d p is a hyper-parameter. For a given word i, we obtain two position vectors w p i,1 and w p i,2 , with regard to entities e 1 and e 2 , respectively. The overall word embedding for the i-th word is . Using a sliding window of size k centered around the i-th word, we encode k successive  words into a vector z i ∈ R (dw+2dp)k to incorporate contextual information as An extra padding token is repeated multiple times for well-definedness at the beginning and end of the input.

Input Attention Mechanism
While position-based encodings are useful, we conjecture that they do not suffice to fully capture the relationships of specific words with the target entities and the influence that they may bear on the target relations of interest. We design our model so as to automatically identify the parts of the input sentence that are relevant for relation classification. Attention mechanisms have successfully been applied to sequence-to-sequence learning tasks such as machine translation (Bahdanau et al., 2015;Meng et al., 2015) and abstractive sentence summarization (Rush et al., 2015), as well as to tasks such as modeling sentence pairs (Yin et al., 2015) and question answering (Santos et al., 2016). To date, these mechanisms have generally been used to allow for an alignment of the input and output sequence, e.g. the source and target sentence in machine translation, or for an alignment between two input sentences as in sentence similarity scoring and question answering.

Figure 2: Input and Primary Attention
In our work, we apply the idea of modeling attention to a rather different kind of scenario involving heterogeneous objects, namely a sentence and two entities. With this, we seek to give our model the capability to determine which parts of the sentence are most influential with respect to the two entities of interest. Consider that in a long sentence with multiple clauses, perhaps only a single verb or noun might stand in a relevant relationship with a given target entity.
As depicted in Fig. 2, the input representation layer is used in conjunction with diagonal attention matrices and convolutional input composition.
Contextual Relevance Matrices. Consider the example in Fig. 1, where the non-entity word "cause" is of particular significance in determining the relation. Fortunately, we can exploit the fact that there is a salient connection between the words "cause" and "diabetes" also in terms of corpus cooccurrences. We introduce two diagonal attention matrices A j with values A j i,i = f (e j , w i ) to characterize the strength of contextual correlations and connections between entity mention e j and word w i . The scoring function f is computed as the inner product between the respective embeddings of word w i and entity e j , and is parametrized into the network and updated during the training process. Given the A j matrices, we define to quantify the relative degree of relevance of the ith word with respect to the j-th entity (j ∈ {1, 2}).
Input Attention Composition. Next, we take the two relevance factors α 1 i and α 2 i and model their joint impact for recognizing the relation via simple averaging as Apart from this default choice, we also evaluate two additional variants. The first (Variant-1) concatenates the word vectors as to obtain an information-enriched input attention component for this specific word, which contains the relation relevance to both entity 1 and entity 2. The second variant (Variant-2) interprets relations as mappings between two entities, and combines the two entity-specific weights as to capture the relation between them. Based on these r i , the final output of the input attention component is the matrix R = [r 1 , r 2 , . . . , r n ], where n is the sentence length.

Convolutional Max-Pooling with Secondary Attention
After this operation, we apply convolutional maxpooling with another secondary attention model to extract more abstract higher-level features from the previous layer's output matrix R.
Convolution Layer. A convolutional layer may, for instance, learn to recognize short phrases such as trigrams. Given our newly generated input attention-based representation R, we accordingly apply a filter of size d c as a weight matrix W f of size d c × k(d w + 2d p ). Then we add a linear bias B f , followed by a non-linear hyperbolic tangent transformation to represent features as follows: Attention-Based Pooling. Instead of regular pooling, we rely on an attention-based pooling strategy to determine the importance of individual windows in R * , as encoded by the convolutional kernel. Some of these windows could represent meaningful n-grams in the input. The goal here is to select those parts of R * that are relevant with respect to our objective from Section 3.1, which essentially calls for a relation encoding process, while neglecting sentence parts that are irrelevant for this process. We proceed by first creating a correlation modeling matrix G that captures pertinent connections between the convolved context windows from the sentence and the relation class embedding W L introduced earlier in Section 3.1: where U is a weighting matrix learnt by the network. Then we adopt a softmax function to deal with this correlation modeling matrix G to obtain an attention pooling matrix A p as where G i,j is the (i, j)-th entry of G and A p i,j is the (i, j)-th entry of A p .
Finally, we multiply this attention pooling matrix with the convolved output R * to highlight important individual phrase-level components, and apply a max operation to select the most salient one (Yin et al., 2015;Santos et al., 2016) for a given dimension of the output. More precisely, we obtain the output representation w O as follows in Eq. (12): where w O i is the i-th entry of w O and (R * A p ) i,j is the (i, j)-th entry of R * A p .

Training Procedure
We rely on stochastic gradient descent (SGD) to update the parameters with respect to the loss function in Eq. (2) as follows: (13) where λ and λ 1 are learning rates, and incorporating the β parameter from Eq. (2).

Experimental Setup
Dataset and Metric. We conduct our experiments on the commonly used SemEval-2010 Task 8 dataset (Hendrickx et al., 2010), which contains 10,717 sentences for nine types of annotated relations, together with an additional "Other" type. The nine types are: Cause-Effect, Component-Whole, Content-Container, Entity-Destination, Entity-Origin, Instrument-Agency, Member-Collection, Message-Topic, and Product-Producer, while the relation type "Other" indicates that the relation expressed in the sentence is not among the nine types. However, for each of the aforementioned relation types, the two entities can also appear in inverse order, which implies that the sentence needs to be regarded as expressing a different relation, namely the respective inverse one. For example, Cause-Effect(e 1 ,e 2 ) and Cause-Effect(e 2 ,e 1 ) can be considered two distinct relations, so the total number |Y| of relation types is 19. The SemEval-2010 Task 8 dataset consists of a training set of 8,000 examples, and a test set with the remaining examples. We evaluate the models using the official scorer in terms of the Macro-F1 score over the nine relation pairs (excluding Other).
Settings. We use the word2vec skip-gram model (Mikolov et al., 2013a) to learn initial word representations on Wikipedia. Other matrices are initialized with random values following a Gaussian distribution. We apply a cross-validation procedure on the training data to select suitable hyperparameters. The choices generated by this process are given in Table 2.  approaches. We observe that our novel attentionbased architecture achieves new state-of-the-art results on this relation classification dataset. Att-Input-CNN relies only on the primal attention at the input level, performing standard max-pooling after the convolution layer to generate the network output w O , in which the new objective function is utilized. With Att-Input-CNN, we achieve an F1-score of 87.5%, thus already outperforming not only the original winner of the SemEval task, an SVM-based approach (82.2%), but also the wellknown CR-CNN model (84.1%) with a relative improvement of 4.04%, and the newly released DRNNs (85.8%) with a relative improvement of 2.0%, although the latter approach depends on the Stanford parser to obtain dependency parse information. Our full dual attention model Att-Pooling-CNN achieves an even more favorable F1-score of 88%. Table 4 provides the experimental results for the two variants of the model given by Eqs. (7) and (8) in Section 3.3. Our main model outperforms the other variants on this dataset, although the variants may still prove useful when applied to other tasks. To better quantify the contribution of the different components of our model, we also conduct an ablation study evaluating several simplified models. The first simplification is to use our model without the input attention mechanism but with the pooling attention layer. The second removes both attention mechanisms. The third removes both forms of attention and additionally uses a regular objective function based on the inner product s = r · w for a sentence representation r and relation class embedding w. We observe that all three of our components lead to noticeable improvements over these baselines.

Detailed Analysis
Primary Attention. To inspect the inner workings of our model, we considered the primary attention matrices of our multi-level attention model

Our Architectures
Att-Input-CNN 87.5 Att-Pooling-CNN 88.0  Fig. 3 plots the word-level attention values for the input attention layer to act as an example, using the calculated attention values for every individual word in the sentence. We find the word "inside" was assigned the highest attention value, while words such as "room" and "house" also are deemed important. This appears sensible in light of the ground-truth labeling as a Component-Whole(e 1 ,e 2 ) relationship. Additionally, we observe that words such as "this", which are rather irrelevant with respect to the target relationship, indeed have significantly lower attention scores.
Most Significant Features for Relations. Table 5 lists the top-ranked trigrams for each relation class y in terms of their contribution to the score for determining the relation classification. Recall the definition of δ θ (x, y) in Eq. (1). In the network, we trace back the trigram that contributed most to  the correct classification in terms of δ θ (S i , y) for each sentence S i . We then rank all such trigrams in the sentences in the test set according to their total contribution and list the top-ranked trigrams. * In Table 5, we see that these are indeed very informative for deducing the relation. For example, the top trigram for Cause-Effect(e 2 ,e 1 ) is "are caused by", which strongly implies that the first entity is an effect caused by the latter. Similarly, the top trigram for Entity-Origin(e 1 ,e 2 ) is "from the e 2 ", which suggests that e 2 could be an original location, at which entity e 1 may have been located.
Error Analysis. Further, we examined some of the misclassifications produced by our model. The following is a typical example of a wrongly classified sentence: Relation (e1, e2) (e2, e1) e1 caused a, caused a e2, e2 caused by, e2 from e1, Cause-Effect e1 resulted in, the cause of, is caused by, are caused by, had caused the, poverty cause e2 was caused by, been caused by Component-Whole e1 of the, of a e2, of the e2, with its e2, e1 consists of, in the e2, part of the e1 has a, e1 comprises e2 Content-Container in a e2, was hidden in, e1 with e2, filled with e2, inside a e2, was contained in e1 contained a, full of e2, Entity-Destination e1 into the, e1 into a, had thrown into was put inside, in a e2 Entity-Origin from this e2, is derived from, e1 e2 is, the e1 e2, from the e2, away from the for e1 e2, the source of Instrument-Agency for the e2, is used by, by a e2, e1 use e2, with a e2, with the e2, a e1 e2 by using e2 Member-Collection of the e2, in the e2, a e1 of, e1 of various, a member of, from the e2 e1 of e2, the e1 of Message-Topic on the e2, e1 asserts the, the e1 of, described in the, e1 points out, e1 is the the topic for, in the e2 Product-Producer e1 made by, made by e2, has constructed a, came up with, from the e2, by the e2 has drawn up, e1 who created This sentence is wrongly classified as belonging to the "Other" category, while the ground-truth label is Message-Topic(e 1 ,e 2 ). The phrase "revolves around" does not appear in the training data, and moreover is used metaphorically, rather than in its original sense of turning around, making it difficult for the model to recognize the semantic connection. Another common issue stems from sentences of the form ". . . e 1 e 2 . . . ", such as the following ones: These belong to three different relation classes, Component-Whole(e 2 ,e 1 ), Entity-Origin(e 2 ,e 1 ), and Instrument-Agency(e 1 ,e 2 ), respectively, which are only implicit in the text, and the context is not particularly helpful. More informative word embeddings could conceivably help in such cases.
Convergence. Finally, we examine the convergence behavior of our two main methods. We plot the performance of each iteration in the Att-Input-CNN and Att-Pooling-CNN models in Fig. 4. It can be seen that Att-Input-CNN quite smoothly converges to its final F1 score, while for the Att-Pooling-CNN model, which includes an additional attention layer, the joint effect of these two attention layer induces stronger back-propagation effects. On the one hand, this leads to a seesaw phenomenon in the result curve, but on the other hand it enables us to obtain better-suited models with slightly higher F1 scores.

Conclusion
We have presented a CNN architecture with a novel objective and a new form of attention mechanism that is applied at two different levels. Our results show that this simple but effective model is able to outperform previous work relying on substantially richer prior knowledge in the form of structured models and NLP resources. We expect this sort of architecture to be of interest also beyond the specific task of relation classification, which we intend to explore in future work.