Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction

A capsule is a group of neurons, whose activity vector represents the instantiation parameters of a specific type of entity. In this paper, we explore the capsule networks used for relation extraction in a multi-instance multi-label learning framework and propose a novel neural approach based on capsule networks with attention mechanisms. We evaluate our method with different benchmarks, and it is demonstrated that our method improves the precision of the predicted relations. Particularly, we show that capsule networks improve multiple entity pairs relation extraction.


Introduction
This paper focus on the task of relation extraction. One popular method for relation extraction is the multi-instance multi-label learning framework (MIML) (Surdeanu et al., 2012) with distant supervision, where the mentions for an entity pair are aligned with the relations in Freebase (Bollacker et al., 2008). The recently proposed neural network (NN) models (Zeng et al., 2014;Ye et al., 2017;Wang et al., 2018a) achieve state-of-the-art performance. However, despite the great success of these NNs, some disadvantages remain. First, the existing models focus on, and heavily rely on, the quality of instance representation. Using a vector to represent a sentence is limited because languages are delicate and complex. Second, CNN subsampling fails to retain the precise spatial relationships between higher-level parts. The structural relationships such as the positions in sentences are valuable. Besides, existing aggregation operations summarizing the sentence meaning into a fixedsize vector such as max or average pooling are lack of guidance by task information. Self-attention (Lin et al., 2017) can select task-dependent information. However, the context vectors are fixed once learned (Gong et al., 2018).
More importantly, most state-of-the-art systems can only predict one most likely relation for a single entity pair. However, it is very common that one sentence may contain multiple entity pairs and describe multiple relations. It is beneficial to consider other relations in the context while predicting the relations (Sorokin and Gurevych, 2017). For example, given the sentence "[Swag It Out] is the official debut single by American [singer] [Zendaya]", our model can predict multi-relations for these multiple entity pairs simultaneously.
In our work, we present a novel architecture based on capsule networks (Sabour et al., 2017) for relation extraction. We regard the aggregation as a routing problem of how to deliver the messages from source nodes to target nodes. This process enables the capsule networks to decide what and how much information need to be transferred as well as identify complex and interleaved features. Furthermore, the capsule networks convert the multi-label classification problem into a multiple binary classification problem. We also import word-level attention by considering the different contribution of the words. The experimental results show that our model achieves improvements in both single and multiple relation extraction.

Related Work
Neural Relation Extraction: In the recent years, NN models have shown superior performance over approaches using hand-crafted features in various tasks. CNN is the first one of the deep learning models that have been applied to relation ex-   (Zeng et al., 2015), instance-level selective attention CNN (Lin et al., 2016), rank CNN (Ye et al., 2017), attention and memory CNN  and syntax-aware CNN (He et al., 2018). Recurrent neural networks (RNN) are another popular choice, and have been used in recent works in the form of attention RNNs (Zhou et al., 2016), context-aware long short-term memory units (LSTMs) (Sorokin and Gurevych, 2017), graph-LSTMs (Peng et al., 2017) and ensemble L-STMs . Capsule Network: Recently, the capsule network has been proposed to improve the representation limitations of CNNs and RNNs. (Sabour et al., 2017) replaced the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement. (Hinton et al., 2018)) proposed a new iterative routing procedure among capsule layers, based on the EM algorithm. For natural language processing tasks, (Zhao et al., 2018) explored capsule networks for text classification. (Gong et al., 2018) designed two dynamic routing policies to aggregate the outputs of RNN/CNN encoding layer into a final encoding vector. (Wang et al., 2018b) proposed a capsule model based on RNN for sentiment analysis. To the best of our knowledge, there has been no work that investigates the performance of capsule networks in relation extraction tasks at present.

Methodology
In this section, we first introduce the MIML framework, and then describe the model architecture we propose for relation extraction, which is shown in Figure 1.

Preliminaries
In MIML, the set of text sentences for the single entity pair or multiple entity pairs 2 (maximum two entity pairs in this paper) is denoted by X = {x 1 , x 2 , ..., x n }. Assumed that there are E predefined relations (including NA) to extract. Formally, for each relation r, the prediction target is denoted by P (r|x 1 , ..., x n ).

Neural Architectures
Input Representation: For each sentence x i , we use pretrained word embeddings to project each word token onto the d w -dimensional space. We adopt the position features as the combinations of the relative distances from the current word to M entities and encode these distances in M d p -dimensional vectors 3 . For single entity pair relation extraction, M = 2; for multiple entity pairs relation extraction, we limit the maximum number of entities in a sentence to four (i.e. two entity pairs). As three entities in one instance is possible when two tuples have a common entity, we set the relative distance to the missing entity to a very large number. Finally, each sentence is transformed into a matrix Bi-LSTM Layer: We make use of LSTMs to deeply learn the semantic meaning of a sentence. We concatenate the current memory cell hidden state vector h t of LSTM from two directions as the We import word-level attention mechanism as only a few words in a sentence that are relevant to the relation expressed (Jat et al., 2018). The scoring function is g t = h t × A × r, where A ∈ R E×E is a square matrix and r ∈ R E×1 is a relation vector. Both A and r are learned. After obtaining g t , we feed them to a softmax function to calculate the final importance α t = sof tmax(g t ). Then, we get the representationx t = α t h t .
For a given bag of sentences, the learning is done using the setting proposed by (Zeng et al., 2015), where the sentence with highest probability of expressing the relation in a bag is selected to train the model in each iteration.
Primary Capsule Layer: Suppose u i ∈ R d denotes the instantiated parameters set of a capsule, where d is the dimension of the capsule. Let W b ∈ R 2×2B be the filter shared across different windows. We have a window sliding each 2-gram vector in the sequencex ∈ R L×2B with stride 1 to produce a list of capsules U ∈ R (L+1)×C×d , totally C × d filters.
Algorithm 1 Dynamic Routing Algorithm 1: procedure ROUTING(û j|i ,â j|i , r, l) 2: Initialize the logits of coupling coefficients b j|i = 0 3: for r iterations do 4: for all capsule i in layer l and capsule j in layer l + 1 do 5: for all capsule j in layer l + 1 do for all capsule i in layer l and capsule j in layer l + 1 do Dynamic Routing: We explore the transformation matrices to generate the prediction vector u j|i ∈ R d from a child capsule i to its parent capsule j. The transformation matrices share weights W c ∈ R E×d×d across the child capsules, where E is the number relations (parent capsules) in the layer above. Formally, each corresponding vote can be computed by: The basic idea of dynamic routing is to design a nonlinear map: where H = (L + 1) × C. Inspired by (Zhao et al., 2018), we attempt to use the probability of existence of parent capsules to iteratively amend the connection strength, which is summarized in Algorithm 1. The length of the vector v j represents the probability of each relation. We use a separate margin loss L k for each relation capsule k: where Y k = 1 if the relation k is present, m + = 0.9 , m − = 0.1 and λ = 0.5. The total loss can be formulated as: L total = E k=1 L k

Prediction
For single entity pair relation extraction, we calculate the length of the vector v j which represents the probability of each relation. For multiple entity pairs relation extraction, we choose relations with top two probability meanwhile bigger than the threshold (We empirically set the threshold 0.7). Finally, we may get one or two predicted relations r. Given entity pair (e 1 , e 2 ), in order to choose which relationship the tuple belongs to, we adopt the pretrained embeddings of entities and relations 4 and calculate r k = arg min k |t − h − r k | , where t, h are the embeddings of entities e 1 , e 2 respectively and r k is the relation embedding. The relation with the closest embedding to the entity embedding difference is the predicted category.

Experiments
We test our model on the NYT dataset (NYT) developed by (Riedel et al., 2010) for single entity pair relation extraction and the Wikidata dataset (Sorokin and Gurevych, 2017) for multiple entity pairs relation extraction. We exclude sentences longer than L . All code is implemented in Tensorflow (Abadi et al., 2016). We adopt the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001, batch size 128, LSTMs' unit size 300, L = 120, d p = 5, d = 8, C = 32, dropout rate 0.5, routing iteration 3.

Practical Performance
NYT dataset (Single entity pair): We utilize the word embeddings released by (Lin et al., 2016) 5 . The precision-recall curves for different models on the test set are shown in Figure 2. Our model BiLSTM+Capsule achieves comparable results compared with all baselines, where Mintz refers to (Mintz et al., 2009), Hoffmann refers to (Hoffmann et al., 2011), MIMLRE refers to (Surdeanu et al., 2012), CNN+ATT refers to (Zeng et al., 2015), PCNN+ATT refers to (Lin et al., 2016), Rank+ExATT refers to (Ye et al., 2017) and Memory refers to . We also show the precision numbers for some particular recalls as well as the AUC in Table 1, where our model generally leads to better precision. Interestingly, we observe our model achieve comparable results to predict multi-relation compared with Rank+ExATT in Figure 3. Given an entity tuple (South Korea, Seoul) which has two relations: /location/./administrative divisions and /location/./capital. We observe these two relations have the highest scores among the other relations in our model which demonstrate the ability of multi-relations prediction.  We show the precision numbers for some particular recalls as well as the AUC in Table 2, where PCNN+ATT (1) refers to train sentences with two entities and one relation label, PCNN+ATT (m) refers to train sentences with four entities 7 and two relation labels. We observe that our model exhibits the best performances. Moreover, in the process of predicting the existence of relations for a sentence, our approach is more convenient , as the PCNN-ATT (1) has to predict all possible pairs of entities in the sentence while our approach can predict multiple relations simultaneously.  Ablation study: To better demonstrate the performance of capsule net and attention mechanism, we remove the primary capsule layer and dynamic routing to make Bi-LSTM layer followed by a fully connected layer instead. We also remove the word-level attention separately. The experimental results on Wikidata dataset are summarized in Table 3. The results of "-Word-ATT" row refers to the results without word-level attention. According to the table, the drop of precision demonstrates that the word-level attention is quite useful. Generally, all two proposed strategies contribute to the effectiveness of our model. A nonlinear map is constructed in an iterative manner, ensuring the output of each capsule to be sent to an appropriate parent in the subsequent layer. Dynamic routing may be more effective than the strategies such as max-pooling in C-NN, which essentially detects whether a feature is present in any position of the text or not, but loses spatial information of the feature. Additionally, capsule achieves comparable results to predict multi-relations in the case of single entity pair, and performs better in the case of multiple entity pairs relation extraction. Choice of d: In the experiments, the larger the dimension of the capsule, the more the capabilities of the feature information it contains. However, larger dimension increases the computational complexity. We test different levels of dimensions of capsules. The model is trained on two Nvidia GTX1080ti GPUs with 64G RAM and six Intel(R) Core(TM) i7-6850K CPU 3.60GHz. As the table 4 depicts, the training time increases with the growth of d. When d = 32, we observe that the loss decreases very slowly and the model is difficult to converge. So we only train 2 epochs and stop training. We set the parameter d = 8 empirically to balance the precision and training time cost. Effects of Iterative Routing: We also study how the iteration number affect the performance on the Wikidata dataset. Table 5 shows the comparison of 1 -5 iterations. We find that the performance reach the best when iteration is set to 3. The results indicate the dynamic routing is contributing to improve the performance. Specifically, in the iteration algorithm, the b j|i = b j|i +û j|i · v j . When the number of iteration is very large, v j becomes either 0 or 1, which means each underlying capsule is only linked to a single upper capsule. Therefore, the iteration times should not be too large.

Conclusion
We propose a relation extraction approach based on capsule networks with attention mechanism. Although we use Bi-LSTM as sentence encoding in this paper, the other encoding method, such as convolved n-gram, could be alternatively used. Experimental results of two benchmarks show that the model improves the precision of the predicted relations.
In the future, we tend to resolve the situation of how to assign predicted relationship to multi entity pairs when two entities have multi-relations by utilizing prior knowledge such as entity type and joint training with named entity recognition. We will also try to optimize the model in terms of speed and focus on other problems by leveraging class ties between labels, specially on multilabel learning problems. Besides, dynamic routing could also be useful to improve other natural language processing tasks such as the sequenceto-sequence task and so on.