Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths

Relation classiﬁcation is an important research arena in the ﬁeld of natural language processing (NLP). In this paper, we present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture leverages the shortest dependency path (SDP) between two entities; multichannel recurrent neural networks, with long short term memory (LSTM) units, pick up heterogeneous information along the SDP. Our proposed model has several distinct features: (1) The shortest dependency paths retain most relevant information (to relation classiﬁcation), while eliminating irrelevant words in the sentence. (2) The multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency paths. (3) A customized dropout strategy regularizes the neural network to alleviate overﬁtting. We test our model on the SemEval 2010 relation classiﬁcation task, and achieve an F 1 -score of 83.7%, higher than competing methods in the literature.


Introduction
Relation classification is an important NLP task.It plays a key role in various scenarios, e.g., information extraction (Wu and Weld, 2010), question answering (Yao and Van Durme, 2014), medical informatics (Wang and Fan, 2014), ontology learning (Xu et al., 2014), etc.The aim of relation classification is to categorize into predefined classes the relations between pairs of marked entities in given texts.For instance, in the sentence "A trillion gallons of [water] e 1 have been poured into an empty [region] e 2 of outer space," the entities water and region are of relation Entity-Destination(e 1 , e 2 ).
Traditional relation classification approaches rely largely on feature representation (Kambhatla, 2004), or kernel design (Zelenko et al., 2003;Bunescu and Mooney, 2005).The former method usually incorporates a large set of features; it is difficult to improve the model performance if the feature set is not very well chosen.The latter approach, on the other hand, depends largely on the designed kernel, which summarizes all data information.Deep neural networks, emerging recently, provide a way of highly automatic feature learning (Bengio et al., 2013), and have exhibited considerable potential (Zeng et al., 2014;Santos et al., 2015).However, human engineering-that is, incorporating human knowledge to the network's architecture-is still important and beneficial.
This paper proposes a new neural network, SDP-LSTM, for relation classification.Our model utilizes the shortest dependency path (SDP) between two entities in a sentence; we also design a long short term memory (LSTM)-based recurrent neural network for information processing.The neural architecture is mainly inspired by the following observations.
• Shortest dependency paths are informative (Fundel et al., 2007;Chen et al., 2014).To determine the two entities' relation, we find it mostly sufficient to use only the words along the SDP: they concentrate on most relevant information while diminishing less relevant noise.Figure 1 depicts the dependency parse tree of the aforementioned sentence.Words along the SDP form a trimmed phrase (gallons of water poured into region) of the original sentence, which conveys much information about the target relation.Other words, such as a, trillion, outer space, are less informative and may bring noise if not dealt with properly.
• Direction matters.Dependency trees are a kind of directed graph.The dependency relation between into and region is PREP; such relation hardly makes any sense if the directed edge is reversed.Moreover, the entities' relation distinguishes its directionality, that is, r(a, b) differs from r(b, a), for a same given relation r and two entities a, b.Therefore, we think it necessary to let the neural model process information in a directionsensitive manner.Out of this consideration, we separate an SDP into two sub-paths, each from an entity to the common ancestor node.
The extracted features along the two subpaths are concatenated to make final classification.
• Linguistic information helps.For example, with prior knowledge of hyponymy, we know "water is a kind of substance."This is a hint that the entities, water and region, are more of Entity-Destination relation than, say, Communication-Topic.
To gather heterogeneous information along SDP, we design a multichannel recurrent neural network.It makes use of information from various sources, including words themselves, POS tags, WordNet hypernyms, and the grammatical relations between governing words and their children.
For effective information propagation and integration, our model leverages LSTM units during recurrent propagation.We also customize a new dropout strategy for our SDP-LSTM network to alleviate the problem of overfitting.To the best of our knowledge, we are the first to use LSTMbased recurrent neural networks for the relation classification task.
We evaluate our proposed method on the SemEval 2010 relation classification task, and achieve an F 1 -score of 83.7%, higher than competing methods in the literature.
In the rest of this paper, we review related work in Section 2. In Section 3, we describe our SDP-LSTM model in detail.Section 4 presents quantitative experimental results.Finally, we have our conclusion in Section 5.

Related Work
Relation classification is a widely studied task in the NLP community.Various existing meth- In feature-based approaches, different sets of features are extracted and fed to a chosen classifier (e.g., logistic regression).Generally, three types of features are often used.Lexical features concentrate on the entities of interest, e.g., entities per se, entity POS, entity neighboring information.Syntactic features include chunking, parse trees, etc. Semantic features are exemplified by the concept hierarchy, entity class, entity mention.Kambhatla (2004) uses a maximum entropy model to combine these features for relation classification.However, different sets of handcrafted features are largely complementary to each other (e.g., hypernyms versus named-entity tags), and thus it is hard to improve performance in this way (Zhou et al., 2005).
Kernel-based approaches specify some measure of similarity between two data samples, without explicit feature representation.Zelenko et al. (2003) compute the similarity of two trees by utilizing their common subtrees.Bunescu and Mooney (2005) propose a shortest path dependency kernel for relation classification.Its main idea is that the relation strongly relies on the dependency path between two given entities.Wang (2008) provides a systematic analysis of several kernels and show that relation extraction can bene-fit from combining convolution kernel and syntactic features.Plank and Moschitti (2013) introduce semantic information into kernel methods in addition to considering structural information only.One potential difficulty of kernel methods is that all data information is completely summarized by the kernel function (similarity measure), and thus designing an effective kernel becomes crucial.
Deep neural networks, emerging recently, can learn underlying features automatically, and have attracted growing interest in the literature.Socher et al. (2011) propose a recursive neural network (RNN) along sentences' parse trees for sentiment analysis; such model can also be used to classify relations (Socher et al., 2012).Hashimoto et al. (2013) explicitly weight phrases' importance in RNNs to improve performance.Ebrahimi and Dou (2015) rebuild an RNN on the dependency path between two marked entities.Zeng et al. (2014) explore convolutional neural networks, by which they utilize sequential information of sentences.Santos et al. ( 2015) also use the convolutional network; besides, they propose a ranking loss function with data cleaning, and achieve the state-of-the-art result in SemEval-2010 Task 8.
In addition to the above studies, which mainly focus on relation classification approaches and models, other related research trends include information extraction from Web documents in a semi-supervised manner (Bunescu and Mooney, 2007;Banko et al., 2007), dealing with small datasets without enough labels by distant supervision techniques (Mintz et al., 2009), etc.

The Proposed SDP-LSTM Model
In this section, we describe our SDP-LSTM model in detail.Subsection 3.1 delineates the overall architecture of our model.Subsection 3.2 presents the rationale of using SDPs.Four different information channels along the SDP are explained in Subsection 3.3.Subsection 3.4 introduces the recurrent neural network with long short term memory, which is built upon the dependency path.Subsection 3.5 customizes a dropout strategy for our network to alleviate overfitting.We finally present our training objective in Subsection 3.6.First, a sentence is parsed to a dependency tree by the Stanford parser;1 the shortest dependency path (SDP) is extracted as the input of our network.Along the SDP, four different types of information-referred to as channels-are used, including the words, POS tags, grammatical relations, and WordNet hypernyms.(See Figure 2a.)In each channel, discrete inputs, e.g., words, are mapped to real-valued vectors, called embeddings, which capture the underlying meanings of the inputs.

Overview
Two recurrent neural networks (Figure 2b) pick up information along the left and right sub-paths of the SDP, respecitvely.(The path is separated by the common ancestor node of two entities.)Long short term memory (LSTM) units are used in the recurrent networks for effective information propagation.A max pooling layer thereafter gathers information from LSTM nodes in each path.
The pooling layers from different channels are concatenated, and then connected to a hidden layer.Finally, we have a softmax output layer for classification.(See again Figure 2a.)

The Shortest Dependency Path
The dependency parse tree is naturally suitable for relation classification because it focuses on the action and agents in a sentence (Socher et al., 2014).Moreover, the shortest path between entities, as discussed in Section 1, condenses most illuminating information for entities' relation.
We also observe that the sub-paths, separated by the common ancestor node of two entities, provide strong hints for the relation's directionality.Take Figure 1 as an example.Two entities water and region have their common ancestor node, poured, which separates the SDP into two parts: The first sub-path captures information of e 1 , whereas the second sub-path is mainly about e 2 .By examining the two sub-paths separately, we know e 1 and e 2 are of relation Entity-Destination(e 1 , e 2 ), rather than Entity-Destination(e 2 , e 1 ).
Following the above intuition, we design two recurrent neural networks, which propagate bottom-up from the entities to their common ancestor.In this way, our model is directionsensitive.

Channels
We make use of four types of information along the SDP for relation classification.We call them channels as these information sources do not interact during recurrent propagation.Detailed channel descriptions are as follows.
• Word representations.Each word in a given sentence is mapped to a real-valued vector by looking up in a word embedding table.Unsupervisedly trained on a large corpus, word embeddings are thought to be able to well capture words' syntactic and semantic information (Mikolov et al., 2013b).
• Part-of-speech tags.Since word embeddings are obtained on a generic corpus of a large scale, the information they contain may not agree with a specific sentence.We deal with this problem by allying each input word with its POS tag, e.g., noun, verb, etc.
In our experiment, we only take into use a coarse-grained POS category, containing 15 different tags.The tool assigns a hypernym to each word, from 41 predefined concepts in WordNet, e.g., noun.food,verb.motion,etc.Given its hypernym, each word gains a more abstract concept, which helps to build a linkage between different but conceptual similar words.
As we can see, POS tags, grammatical relations, and WordNet hypernyms are also discrete (like words per se).However, no prevailing embedding learning method exists for POS tags, say.Hence, we randomly initialize their embeddings, and tune them in a supervised fashion during training.We notice that these information sources contain much fewer symbols, 15, 19, and 41, than the vocabulary size (greater than 25,000).Hence, we believe our strategy of random initialization is feasible, because they can be adequately tuned during supervised training.

Recurrent Neural Network with Long Short Term Memory Units
The recurrent neural network is suitable for modeling sequential data by nature, as it keeps a hid- den state vector h, which changes with input data at each step accordingly.We use the recurrent network to gather information along each sub-path in the SDP (Figure 2b).
The hidden state h t , for the t-th word in the sub-path, is a function of its previous state h t−1 and the current word x t .Traditional recurrent networks have a basic interaction, that is, the input is linearly transformed by a weight matrix and nonlinearly squashed by an activation function.Formally, we have where W in and W rec are weight matrices for the input and recurrent connections, respectively.b h is a bias term for the hidden state vector, and f h a non-linear activation function (e.g., tanh).
One problem of the above model is known as gradient vanishing or exploding.The training of neural networks requires gradient backpropagation.If the propagation sequence (path) is too long, the gradient may probably either grow, or decay, exponentially, depending on the magnitude of W rec .This leads to the difficulty of training.
Long short term memory (LSTM) units are proposed in Hochreiter (1998) to overcome this problem.The main idea is to introduce an adaptive gating mechanism, which decides the degree to which LSTM units keep the previous state and memorize the extracted features of the current data input.Many LSTM variants have been proposed in the literature.We adopt in our method a variant introduced by Zaremba and Sutskever (2014), also used in Zhu et al. (2014).
Concretely, the LSTM-based recurrent neural network comprises four components: an input gate i t , a forget gate f t , an output gate o t , and a memory cell c t (depicted in Figure 3 and formalized through Equations 1-6 as bellow).
The three adaptive gates i t , f t , and o t depend on the previous state h t−1 and the current input x t (Equations 1-3).An extracted feature vector g t is also computed, by Equation 4, serving as the candidate memory cell.
The current memory cell c t is a combination of the previous cell content c t−1 and the candidate content g t , weighted by the input gate i t and forget gate f t , respectively.(See Equation 5 below.) The output of LSTM units is the the recurrent network's hidden state, which is computed by Equation 6 as follows.
In the above equations, σ denotes a sigmoid function; ⊗ denotes element-wise multiplication.

Dropout Strategies
A good regularization approach is needed to alleviate overfitting.Dropout, proposed recently by Hinton et al. (2012), has been very successful on feed-forward networks.By randomly omitting feature detectors from the network during training, it can obtain less interdependent network units and achieve better performance.However, the conventional dropout does not work well with recurrent neural networks with LSTM units, since dropout may hurt the valuable memorization ability of memory units.
As there is no consensus on how to drop out LSTM units in the literature, we try several dropout strategies for our SDP-LSTM network: • Dropout embeddings; • Dropout inner cells in memory units, including i t , g t , o t , c t , and h t ; and • Dropout the penultimate layer.
As we shall see in Section 4.2, dropping out LSTM units turns out to be inimical to our model, whereas the other two strategies boost in performance.
The following equations formalize the dropout operations on the embedding layers, where D denotes the dropout operator.Each dimension in the embedding vector, x t , is set to zero with a predefined dropout rate.

Training Objective
The SDP-LSTM described above propagates information along a sub-path from an entity to the common ancestor node (of the two entities).A max pooling layer packs, for each sub-path, the recurrent network's states, h's, to a fixed vector by taking the maximum value in each dimension.Such architecture applies to all channels, namely, words, POS tags, grammatical relations, and WordNet hypernyms.The pooling vectors in these channels are concatenated, and fed to a fully connected hidden layer.Finally, we add a softmax output layer for classification.The training objective is the penalized cross-entropy error, given by where t ∈ R nc is the one-hot represented ground truth and y ∈ R nc is the estimated probability for each class by softmax.(n c is the number of target classes.)• F denotes the Frobenius norm of a matrix; ω and υ are the numbers of weight matrices (for W 's and U 's, respectively).λ is a hyperparameter that specifies the magnitude of penalty on weights.Note that we do not add 2 penalty to biase parameters.
We pretrained word embeddings by word2vec (Mikolov et al., 2013a) on the English Wikipedia corpus; other parameters are initialized randomly.We apply stochastic gradient descent (with minibatch 10) for optimization; gradients are computed by standard back-propagation.Training details are further introduced in Section 4.2.

Experiments
In this section, we present our experiments in detail.Our implementation is built upon Mou et al. (2015).Section 4.1 introduces the dataset; Section 4.2 describes hyperparameter settings.In Section 4.3, we compare SDP-LSTM's performance with other methods in the literature.We also analyze the effect of different channels in Section 4.4.

Dataset
The SemEval-2010 Task 8 dataset is a widely used benchmark for relation classification (Hendrickx et al., 2010).The dataset contains 8,000 sentences for training, and 2,717 for testing.We split 1/10 samples out of the training set for validation.
The target contains 19 labels: 9 directed relations, and an undirected Other class.The directed relations are list as below.
• Cause-Effect In the following are illustrated two sample sentences with directed relations.
[People] e 1 have been moving back into [downtown] e 2 .
Financial [stress] e 1 is one of the main causes of [divorce] e 2 .
The dataset also contains an undirected Other class.Hence, there are 19 target labels in total.The undirected Other class takes in entities that do not fit into the above categories, illustrated by the following example.
We use the official macro-averaged F 1 -score to evaluate model performance.This official measurement excludes the Other relation.Nonetheless, we have no special treatment of Other class in our experiments, which is typical in other studies.

Hyperparameters and Training Details
This subsection presents hyperparameter tuning for our model.We set word-embeddings to be 200-dimensional; POS, WordNet hyponymy, and grammatical relation embeddings are 50dimensional.Each channel of the LSTM network contains the same number of units as its source embeddings (either 200 or 50).The penultimate hidden layer is 100-dimensional.As it is not feasible to perform full grid search for all hyperparameters, the above values are chosen empirically.We add 2 penalty for weights with coefficient 10 −5 , which was chosen by validation from the set {10 −2 , 10 −3 , • • • , 10 −7 }.
We thereafter validate the proposed dropout strategies in Section 3.5.Since network units in different channels do not interact with each other during information propagation, we herein take one channel of LSTM networks to assess the efficacy.Taking the word channel as an example, we first drop out word embeddings.Then with a fixed dropout rate of word embeddings, we test the effect of dropping out LSTM inner cells and the penultimate units, respectively.
We find that, dropout of LSTM units hurts the model, even if the dropout rate is small, 0.1, say (Figure 4b).Dropout of embeddings improves model performance by 2.16% (Figure 4a); dropout of the penultimate layer further improves by 0.16% (Figure 4c).This analysis also provides, for other studies, some clues for dropout in LSTM networks.Neural networks are first used in this task in Socher et al. (2012).They build a recursive neural network (RNN) along a constituency tree for relation classification.They extend the basic RNN with matrix-vector interaction and achieve an F 1score of 82.4%.Zeng et al. (2014) treat a sentence as sequential data and exploit the convolutional neural network (CNN); they also integrate word position information into their model.Santos et al. (2015) design a model called CR-CNN; they propose a ranking-based cost function and elaborately diminish the impact of the Other class, which is not counted in the official F 1 -measure.In this way, they achieve the state-of-the-art result with the F 1score of 84.1%.Without such special treatment, their F 1 -score is 82.7%.an F 1 -score of 82.2%.These results demonstrate the effectiveness of LSTM and directionality in relation classification.

Effect of Different Channels
This subsection analyzes how different channels affect our model.We first used word embeddings only as a baseline; then we added POS tags, grammatical relations, and WordNet hypernyms, respectively; we also combined all these channels into our models.Note that we did not try the latter three channels alone, because each single of them (e.g., POS) does not carry much information.
Adding either grammatical relations or Word-Net hypernyms outperforms other existing methods (data cleaning not considered here).POS tagging is comparatively less informative, but still boosts the F 1 -score by 0.63%.
We notice that, the boosts are not simply added when channels are combined.This suggests that these information sources are complementary to each other in some linguistic aspects.Nonetheless, incorporating all four channels further pushes the F 1 -score to 83.70%.

Conclusion
In this paper, we propose a novel neural network model, named SDP-LSTM, for relation classification.It learns features for relation classification iteratively along the shortest dependency path.Several types of information (word themselves, POS tags, grammatical relations and WordNet hypernyms) along the path are used.Meanwhile, we leverage LSTM units for long-range information propagation and integration.We demonstrate the effectiveness of SDP-LSTM by evaluating the model on SemEval-2010 relation classification task, outperforming existing state-of-art methods (in a fair condition without data cleaning).Our result sheds some light in the relation classification task as follows.
• The shortest dependency path can be a valuable resource for relation classification, covering mostly sufficient information of target relations.• Classifying relation is a challenging task due to the inherent ambiguity of natural languages and the diversity of sentence expression.Thus, integrating heterogeneous linguistic knowledge is beneficial to the task.• Treating the shortest dependency path as two sub-paths, mapping two different neural networks, helps to capture the directionality of relations.
• LSTM units are effective in feature detection and propagation along the shortest dependency path.

Figure 1 :
Figure 1: The dependency parse tree corresponding to the sentence "A trillion gallons of water have been poured into an empty region of outer space."Red lines indicate the shortest dependency path between entities water and region.An edge a → b refers to a being governed by b.Dependency types are labeled by the parser, but not presented in the figure for clarity.

Figure 2
Figure2depicts the overall architecture of our SDP-LSTM network.First, a sentence is parsed to a dependency tree

Figure 2 :
Figure 2: (a) The overall architecture of SDP-LSTM.(b) One channel of the recurrent neural networks built upon the shortest dependency path.The channels are words, part-of-speech (POS) tags, grammatical relations (abbreviated as GR in the figure), and WordNet hypernyms.

Figure 4 :
Figure4: F 1 -scores versus dropout rates.We first evaluate the effect of dropout embeddings (a).Then the dropout of the inner cells (b) and the penultimate layer (c) is tested with word embeddings being dropped out by 0.5.
Marneffe et al., 2006)to capture such gram-matical relations in SDPs.In our experiment, grammatical relations are grouped into 19 classes, mainly based on a coarse-grained classification (DeMarneffe et al., 2006).
nsubj − −− → it" is distinct from "beats dobj − −− → it."• WordNet hypernyms.As illustrated in Section 1, hyponymy information is also useful for relation classification.(Details are not repeated here.)To leverage WordNet hypernyms, we use a tool developed by Ciaramita and Altun (2006). 2 Table 4 compares our SDT-LSTM with other stateof-the-art methods.The first entry in the ta-ble presents the highest performance achieved by traditional feature engineering.Hendrickx et al. (2010) leverage a variety of handcrafted features, and use SVM for classification; they achieve an F 1 -score of 82.2%.

Table 1 :
Comparison of relation classification systems.The " †" remark refers to special treatment for the Other class.

Table 2 :
Effect of different channels.