Bidirectional Recurrent Convolutional Neural Network for Relation Classification

Relation classiﬁcation is an important semantic processing task in the ﬁeld of natu-ral language processing (NLP). In this paper, we present a novel model BRCNN to classify the relation of two entities in a sentence. Some state-of-the-art systems concentrate on modeling the shortest dependency path (SDP) between two entities leveraging convolutional or recurrent neural networks. We further explore how to make full use of the dependency relations information in the SDP, by combining convolutional neural networks and two-channel recurrent neural networks with long short term memory (LSTM) units. We propose a bidirectional architecture to learn relation representations with directional information along the SDP forwards and backwards at the same time, which beneﬁts classifying the direction of relations. Experimental results show that our method outperforms the state-of-the-art approaches on the SemEval-2010 Task 8 dataset.


Introduction
Relation classification aims to classify the semantic relations between two entities in a sentence. For instance, in the sentence "The [burst] e 1 has been caused by water hammer [pressure] e 2 ", entities burst and pressure are of relation Cause-Effect(e 2 , e 1 ). Relation classification plays a key role in robust knowledge extraction, and has become a hot research topic in recent years.
Nowadays, deep learning techniques have made significant improvement in relation classification, * Corresponding author compared with traditional relation classification approaches focusing on designing effective features (Rink and Harabagiu, 2010) or kernels (Zelenko et al., 2003;Bunescu and Mooney, 2005) Although traditional approaches are able to exploit the symbolic structures in sentences, they still suffer from the difficulty to generalize over the unseen words. Some recent works learn features automatically based on neural networks (NN), employing continuous representations of words (word embeddings). The NN research for relation classification has centered around two main network architectures: convolutional neural networks and recursive/recurrent neural networks. Convolutional neural network aims to generalize the local and consecutive context of the relation mentions, while recurrent neural networks adaptively accumulate the context information in the whole sentence via memory units, thereby encoding the global and possibly unconsecutive patterns for relation classification. Socher et al. (2012) learned compositional vector representations of sentences with a recursive neural network. Kazuma et al. (2013) proposed a simple customizaition of recursive neural networks. Zeng et al. (2014) proposed a convolutional neural network with position embeddings.
Recently, more attentions have been paid to modeling the shortest dependency path (SDP) of sentences. Liu et al. (2015) developed a dependency-based neural network, in which a convolutional neural network has been used to capture features on the shortest path and a recursive neural network is designed to model subtrees. Xu et al. (2015b) applied long short term memory (LSTM) based recurrent neural networks (RNNs) along the shortest dependency path. However, SDP is a special structure in which every two neighbor words are separated by a dependency relations. Previous works treated dependency relations in the same Figure 1: The shortest dependency path representation for an example sentence from SemEval-08.
way as words or some syntactic features like partof-speech (POS) tags, because of the limitations of convolutional neural networks and recurrent neural networks. Our first contribution is that we propose a recurrent convolutional neural network (RCNN) to encode the global pattern in SDP utilizing a two-channel LSTM based recurrent neural network and capture local features of every two neighbor words linked by a dependency relation utilizing a convolution layer.
We further observe that the relationship between two entities are directed. For instance, Figure 1 shows that the shortest path of the sentence "The [burst] e 1 has been caused by water hammer [pressure] e 2 ." corresponds to relation Cause-Effect(e 2 , e 1 ). The SDP of the sentence also corresponds to relation Cause-Effect(e 2 , e 1 ), where e 1 refers to the entity at front end of SDP and e 2 refers to the entity at back end of SDP, and the inverse SDP corresponds to relation Cause-Effect(e 1 , e 2 ). Previous work (Xu et al., 2015b) simply transforms a (K+1)-relation task into a (2K + 1) classification task, where 1 is the Other relation and K is the number of directed relations. Besides, the recurrent neural network is a biased model, where later inputs are more dominant than earlier inputs. It could reduce the effectiveness when it is used to capture the semantics of a whole shortest dependency path, because key components could appear anywhere in a SDP rather than the end.
Our second contribution is that we propose a bidirectional recurrent convolutional neural networks (BRCNN) to learn representations with bidirectional information along the SDP forwards and backwards at the same time, which also strengthen the ability to classifying directions of relationships between entities. Experimental results show that the bidirectional mechanism significantly improves the performance.
We evaluate our method on the SemEval-2010 relation classification task, and achieve a state-ofthe-art F 1 -score of 86.3%.

The Proposed Method
In this section, we describe our method in detail. Subsection 2.1 provides an overall picture of our BCRNN model. Subsection 2.2 presents the rationale of using SDPs and some characteristics of SDP. Subsection 2.3 describes the two-channel recurrent neural network, and bidirectional recurrent convolutional neural network is introduced in Subsection 2.4. Finally, we present our training objective in Subsection 2.5.

Framework
Our BCRNN model is used to learn representations with bidirectional information along the SDP forwards and backwards at the same time. Figure  2 depicts the overall architecture of the BRCNN model.
Given a sentence and its dependency tree, we build our neural network on its SDP extracted from the tree. Along the SDP, two recurrent neural networks with long short term memory units are applied to learn hidden representations of words and dependency relations respectively. A convolution layer is applied to capture local features from hidden representations of every two neighbor words and the dependency relations between them. A max pooling layer thereafter gathers information from local features of the SDP or the inverse SDP. We have a so f tmax output layer after pooling layer for classification in the unidirectional model RCNN.
On the basis of RCNN model, we build a bidirectional architecture BRCNN taking the SDP and the inverse SDP of a sentence as input. During the training stage of a (K+1)-relation task, two fine-grained so f tmax classifiers of RCNNs do a (2K + 1)-class classification respectively. The pooling layers of two RCNNs are concatenated and a coarse-grained so f tmax output layer is followed to do a (K + 1)-class classification. The final (2K+1)-class distribution is the combination of two (2K+1)-class distributions provided by finegrained classifiers respectively during the testing stage.

The Shortest Dependency Path
If e 1 and e 2 are two entities mentioned in the same sentence such that they are observed to be in a relationship R, the shortest path between e 1 and e 2 condenses most illuminating information for the relationship R(e 1 , e 2 ). It is because (1) if entities e 1 and e 2 are arguments of the same predicate, the shortest path between them will pass through the predicate; (2) if e 1 and e 2 belong to different predicate-argument structures that share a common argument, the shortest path will pass through this argument.
Bunescu and Mooney (2005) first used shortest dependency paths between two entities to capture the predicate-argument sequences, which provided strong evidence for relation classification. Xu et al. (2015b) captured information from the sub-paths separated by the common ancestor node of two entities in the shortest paths. However, the shortest dependency path between two entities is usually short (∼4 on average) , and the common ancestor of some SDPs is e 1 or e 2 , which leads to imbalance of two sub-paths.
We observe that, in the shortest dependency path, each two neighbor words w a and w b are linked by a dependency relation r ab . The dependency relations between a governing word and its children make a difference in meaning. Besides, if we inverse the shortest dependency path, it corresponds to the same relationship with an opposite direction. For example , in Figure 1, the shortest path is composed of some sub-structure like "burst nsub jpass − −−−−−−− → caused". Following the above intuition, we design a bidirectional recurrent convolutional neural network, which can capture features from the local substructures and inversely at the same time.

Two-Channel Recurrent Neural Network with Long Short Term Memory Units
The recurrent neural network is suitable for modeling sequential data, as it keeps hidden state vector h, which changes with input data at each step accordingly. We make use of words and dependency relations along the SDP for relations classification ( Figure 2). We call them channels as these information sources do not interact during recurrent propagation. Each word and dependency relation in a given sentence is mapped to a real-valued vector by looking up in a embedding table. The embeddings of words are trained on a large corpus unsupervisedly and are thought to be able to capture their syntactic and semantic information, and the embeddings of dependency relations are initialized randomly. The hidden state h t , for the t-th input is a function of its previous state h t−1 and the embedding x t of current input. Traditional recurrent networks have a basic interaction, that is, the input is linearly transformed by a weight matrix and nonlinearly squashed by an activation function. Formally, we have where W in and W rec are weight matrices for the input and recurrent connections, respectively. b h is a bias term for the hidden state vector, and f a non-linear activation function. It was difficult to train RNNs to capture longterm dependencies because the gradients tend to either vanish or explode. Therefore, some more sophisticated activation function with gating units were designed. Long short term memory units are proposed in Hochreiter and Schmidhuber (1997) to overcome this problem. The main idea is to introduce an adaptive gating mechanism, which decides the degree to which LSTM units keep the previous state and memorize the extracted features of the current data input. Many LSTM variants have been proposed. We adopt in our method a variant introduced by Zaremba and Sutskever (2014). Concretely, the LSTM-based recurrent neural network comprises four components: an input gate i t , a forget gate f t , an output gate o t , and a memory cell c t .
First, we compute the values for i t , the input gate, and g t the candidate value for the states of the memory cells at time t: Second, we compute the value for f t , the activations of the memory cells' forget gates at time t: Given the value of the input gate activations i t , the forget gate activation f t and the candidate state value g t , we can compute c t the memory cells' new state at time t: With the new state of the memory cells, we can compute the value of their output gates and, subsequently, their outputs: In the above equations, σ denotes a sigmoid function; ⊗ denotes element-wise multiplication.

Bidirectional Recurrent Convolutional Neural Network
We observe that a governing word w a and its children w b are linked by a dependency relation r ab , which makes a difference in meaning. For exam- The shortest dependency path is composed of many substructures like "w a r ab − − → w b ", which are hereinafter referred to as "dependency unit". Hidden states of words and dependency relations in the SDP are obtained, utilizing two-channel recurrent neural network. The hidden states of w a , w b and r ab are h a , h b and h ab , and the hidden state of the dependency unit d ab is [h a ⊕ h ab ⊕ h b ], where ⊕ denotes concatenate operation. Local features L ab for the dependency unit d ab can be extracted, utilizing a convolution layer upon the two-channel recurrent neural network . Formally, we have where W con is the weight matrix for the convolution layer and b con is a bias term for the hidden state vector. f is a non-linear activation function(tanh is used in our model). A pooling layer thereafter gather global information G from local features of dependency units, which is defined as where the max function is an element-wise function, and D is the number of dependency units in the SDP. The advantage of two-channel recurrent neural network is the ability to better capture the contextual information, adaptively accumulating the context information the whole path via memory units. However, the recurrent neural network is a biased model, where later inputs are more dominant than earlier inputs. It could reduce the effectiveness when it is used to capture features for relation classification, for the entities are located at both ends of SDP and key components could appear anywhere in a SDP rather than at the end. We tackle the problem with Bidirectional Convolutional Recurrent Neural Network.
On the basis of observation, we make a hypothesis that SDP is a symmetrical structure. For example, if there is a forward shortest path − → S which corresponds to relation R x (e 1 , e 2 ), the backward shortest path ← − S can be obtained by inversing − → S , and ← − S corresponds to R x (e 2 , e 1 ), and both − → S and ← − S correspond to relation R x .
As shown in Figure 2, two RCNNs pick up information along − → S and ← − S , obtaining global representations − → G and ← − G. A representation with bidirectional information is obtained by concatenating − → G and ← − G . A coarse-grained so f tmax classifier is used to predict a (K+1)-class distribution y. Formally, Where W c is the transformation matrix and b c is the bias vector. Coarse-grained classifier makes use of representation with bidirectional information ignoring the direction of relations, which learns the inherent correlation between the same directed relations with opposite directions, such as R x (e 1 , e 2 ) and R x (e 2 , e 1 ). Two fine-grained so f tmax classifiers are applied to − → G and ← − G with linear transformation to give the (2K+1)-class distribution − → y and ← − y respectively. Formally, where W f is the transformation matrix and b f is the bias vector. Classifying − → S and ← − S respecitvely at the same time can strengthen the model ability to judge the direction of relations.

Training Objective
The (K + 1)-class so f tmax classifier is used to estimate probability that − → S and ← − S are of relation R . The two (2K + 1)-class so f tmax classifiers are used to estimate the probability that − → S and ← − S are of relation − → R and ← − R respectively. For a single data sample, the training objective is the penalized cross-entropy of three classifiers, given by where t ∈ R K+1 , − → t and ← − t ∈ R 2K+1 , indicating the one-hot represented ground truth. y, − → y and ← − y are the estimated probabilities for each class described in section 2.4. θ is the set of model parameters to be learned, and λ is a regularization coefficient. For decoding (predicting the relation of an unseen sample), the bidirectional model provides the (2K+1)-class distribution − → y and ← − y . The final (2K+1)-class distribution y test becomes the combination of − → y and ← − y . Formally, where α is the fraction of the composition of distributions, which is set to the value 0.65 according to the performance on validation dataset. During the implementation of BRCNN, elements in two class distributions at the same position are not corresponding, e.g. Cause-Effect(e 1 , e 2 ) in − → y should correspond to Cause-Effect(e 2 , e 1 ) in ← − y . We apply a function z to transform ← − y to a corresponding forward distribution like − → y .

Dataset
We evaluated our BRCNN model on the SemEval-2010 Task 8 dataset, which is an established benchmark for relation classification (Hendrickx et al., 2010). The dataset contains 8000 sentences for training, and 2717 for testing. We split 800 samples out of the training set for validation.
The former K=9 relations are directed, whereas the Other class is undirected, we have (2K+1)=19 different classes for 10 relations. All baseline systems and our model use the official macroaveraged F 1 -score to evaluate model performance. This official measurement excludes the Other relation.

Hyperparameter Settings
In our experiment, word embeddings were 200dimensional as used in (Yu et al., 2014), trained on Gigaword with word2vec (Mikolov et al., 2013). Embeddings of relation are 50-dimensional and initialized randomly. The hidden layers in each channel had the same number of units as their embeddings (200 or 50). The convolution layer was 200-dimensional. The above values were chosen according to the performance on the validation dataset.
As we can see in Figure 1, dependency relation r " Experiment results show that, the performance of BR-CNN is improved if r and r −1 correspond to different relations embeddings rather than a same embedding. We notice that dependency relations contain much fewer symbols than the words contained in the vocabulary, and we initialize the embeddings of dependency relations randomly for they can be adequately tuned during supervised training.
We add l 2 penalty for weights with coefficient 10 −5 , and dropout of embeddings with rate 0.5. We applied AdaDelta for optimization (Zeiler, 2012), where gradients are computed with an adaptive learning rate.

Results
Table 1 compares our BRCNN model with other state-of-the-art methods. The first entry in the table presents the highest performance achieved by traditional feature-based methods. Rink and Harabagiu. (2010) fed a variety of handcrafted features to the SVM classifier and achieve an F 1score of 82.2%.
Recent performance improvements on this dataset are mostly achieved with the help of neural networks. Socher et al. (2012) built a recursive neural network on the constituency tree and achieved a comparable performance with Rink and Harabagiu. (2010). Further, they extended their recursive network with matrix-vector interaction and elevated the F 1 to 82.4%. Xu et al. (2015b) first introduced a type of gated recurrent neural network (LSTM) into this task and raised the F 1score to 83.7%.
From the perspective of convolution, Zeng et al. (2014) constructed a CNN on the word sequence; they also integrated word position embeddings, which helped a lot on the CNN architecture. dos Santos et al. (2015) proposed a similar CNN model, named CR-CNN, by replacing the common so f tmax cost function with a ranking-based cost function. By diminishing the impact of the Other class, they have achieved an F 1 -score of 84.1%. Along the line of CNNs, Xu et al. (2015a) designed a simple negative sampling method, which introduced additional samples from other corpora like the NYT dataset. Doing so greatly improved the performance to a high F 1 -score of 85.6%. Liu et al. (2015) proposed a convolutional neural network with a recursive neural network designed to model the subtrees, and achieve an F 1 -score of 83.6%.
Without the use of neural networks, Yu et al. (2014) proposed a Feature-based Compositional Embedding Model (FCM), which combined unlexicalized linguistic contexts and word embeddings. They achieved an F 1 -score of 83.0%.
We make use of three types of information to improve the performance of BRCNN: POS tags, NER features and WordNet hypernyms. Our proposed BRCNN model yields an F 1 -score of 86.3%, outperforming existing competing approaches. Without using any human-designed features, our model still achieve an F 1 -score of 85.4%, while the best performance of state-of-theart methods is 84. 1% (dos Santos et al., 2015).  For a fair comparison, hyperparameters are set according to the performance on validation dataset as BRCNN . CNN with embeddings of words, positions and dependency relations as input achieves an F 1 -score of 81.8%. LSTM with word embeddings as input only achieves an F 1 -score of 76.6%, which proves that dependency relations in SDPs play an important role in relation classification. Two-channel LSTM concatenates the pooling layers of words and dependency relations along the shortest dependency path, achieves an F 1 -score of 81.5% which is still lower than CNN. RCNN captures features from dependency units by combining the advantages of CNN and RNN, and achieves an F 1 -score of 82.4%.

Model
Input   Table 3, if we inverted the SDP of all relations as input, we observe a performance degradation of 1.2% compared with RCNN. As mentioned in section 3.1, the SemEval-2010 task 8 dataset contains an undirected class Other in addition to 9 directed relations(18 classes). For bidirectional model, it is natural that the inversed Other relation is also in the Other class itself. However, the class Other is used to indicate that relation between two nominals dose not belong to any of the 9 directed classes. Therefore, the class Other is very noisy since it groups many different types of relations with different directions.
On the basis of the analysis above, we only inverse the SDP of directed relations. A significant improvement is observed and Bi-RCNN achieves an F 1 -score of 84.9%. This proves bidirectional representations provide more useful information to classify directed relations. We can see that our model still benefits from the coarse-grained classification, which can help our model learn inherent correlation between directed relations with opposite directions. Compared with Bi-RCNN classifying − → S and ← − S into 19 classes separately, BRCNN also conducts a 10 classes (9 directed relations and Other) classification and improves 0.5% in F 1 -score. Beyond the relation classification task, we believe that our bidirectional method is general technique, which is not restricted in a specific dataset and has the potential to benefit other NLP tasks.

Related Work
Relation classification is an important topic in NLP. Traditional Methods for relation classification mainly fall into three classes: feature-based, kernel-based and neural network-based.
In feature-based approaches, different types of features are extracted and fed into a classifier. Generally, three types of features are often used. Lexical features concentrate on the entities of interest, e.g., POS. Syntactic features include chunking, parse trees, etc. Semantic features are exemplified by the concept hierarchy, entity class. Kambhatla (2004) used a maximum entropy model for feature combination. Rink and Harabagiu (2010) collected various features, including lexical, syntactic as well as semantic features.
In kernel based methods, similarity between two data samples is measured without explicit feature representation. Bunescu and Mooney (2005) designed a kernel along the shortest dependency path between two entities by observing that the relation strongly relies on SDPs. Wang (2008) provided a systematic analysis of several kernels and showed that relation extraction can benefit from combining convolution kernel and syntactic features. Plank and Moschitti (2013) combined structural information and semantic information in a tree kernel. One potential difficulty of kernel methods is that all data information is completely summarized by the kernel function, and thus designing an effective kernel becomes crucial.
Recently, deep neural networks are playing an important role in this task. Socher et al. (2012) introduced a recursive neural network model that assigns a matrix-vector representation to every node in a parse tree, in order to learn compositional vector representations for sentences of arbitrary syntactic type and length.
Convolutional neural works are widely used in relation classification. Zeng et al. (2014) proposed an approach for relation classification where sentence-level features are learned through a CNN, which has word embedding and position features as its input. In parallel, lexical features were extracted according to given nouns. dos Santos et al. (2015) tackled the relation classification task using a convolutional neural network and proposed a new pairwise ranking loss function, which achieved the state-of-the-art result in SemEval-2010 Task 8. Yu et al. (2014) proposed a Factor-based Compositional Embedding Model (FCM) by deriving sentence-level and substructure embeddings from word embeddings, utilizing dependency trees and named entities. It achieved slightly higher accuracy on the same dataset than Zeng et al. (2014), but only when syntactic information is used.
Nowadays, many works concentrate on extracting features from the SDP based on neural networks. Xu et al. (2015a) learned robust relation representations from SDP through a CNN, and proposed a straightforward negative sampling strategy to improve the assignment of subjects and objects. Liu et al. (2015) proposed a recursive neural network designed to model the subtrees, and CNN to capture the most important features on the shortest dependency path. Xu et al. (2015b) picked up heterogeneous information along the left and right sub-path of the SDP respectively, leveraging recurrent neural networks with long short term memory units. We propose BRCNN to model the SDP, which can pick up bidirectional information with a combination of LSTM and CNN.

Conclusion
In this paper, we proposed a novel bidirectional neural network BRCNN, to improve the performance of relation classification. The BRCNN model, consisting of two RCNNs, learns features along SDP and inversely at the same time. Information of words and dependency relations are used utilizing a two-channel recurrent neural network with LSTM units. The features of dependency units in SDP are extracted by a convolution layer.
We demonstrate the effectiveness of our model by evaluating the model on SemEval-2010 relation classification task. RCNN achieves a better performance at learning features along the shortest dependency path, compared with some common neural networks. A significant improvement is observed when BRCNN is used, outperforming state-of-the-art methods.