Regularized Attentive Capsule Network for Overlapped Relation Extraction

Distantly supervised relation extraction has been widely applied in knowledge base construction due to its less requirement of human efforts. However, the automatically established training datasets in distant supervision contain low-quality instances with noisy words and overlapped relations, introducing great challenges to the accurate extraction of relations. To address this problem, we propose a novel Regularized Attentive Capsule Network (RA-CapNet) to better identify highly overlapped relations in each informal sentence. To discover multiple relation features in an instance, we embed multi-head attention into the capsule network as the low-level capsules, where the subtraction of two entities acts as a new form of relation query to select salient features regardless of their positions. To further discriminate overlapped relation features, we devise disagreement regularization to explicitly encourage the diversity among both multiple attention heads and low-level capsules. Extensive experiments conducted on widely used datasets show that our model achieves significant improvements in relation extraction.


Introduction
Relation extraction aims to extract relations between entities in text, where distant supervision proposed by (Mintz et al., 2009) automatically establishes training datasets by assigning relation labels to instances that mention entities within knowledge bases. However, the wrong labeling problem can occur and various multi-instance learning methods (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012) have been proposed to address it. Despite the wrong labeling problem, each instance in distant supervision is crawled from web pages, which is informal with many noisy words and can express multiple similar relations. This problem is not well-handled by previous approaches and severely hampers the performance of conventional neural relation extractors. To handle this problem, we have to address two challenges: (1) Identifying and gathering spotted relation information from low-quality instances; (2) Distinguishing multiple overlapped relation features from each instance.
First, a few significant relation words are distributed dispersedly in the sentence, as shown in Figure 1, where words marked in red brackets represent entities, and italic words are key to expressing the relations. For instance, the clause "evan bayh son of birch bayh" in S1 is sufficient to express the relation /people/person/children of evan bayh and birch bayh. Salient relation words are few in number and dispersedly in S1, while others excluded from the clause can be regarded as noise. Traditional neural models have difficulty gathering spotted relation features at different positions along the sequence because they use Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN) as basic relation encoders (Zeng et al., 2015;Liu et al., 2018;Ye and Ling, 2019), which model each sequence word by word and lose rich non-local information for modeling the dependencies of semantic salience. Thus, a well-behaved relation extractor is needed to extract scattered relation features from informal instances.
Second, each instance can express multiple similar relations of two entities. As shown in Figure 1, Changsha and Hunan possess the relations /location/location/contains and /location/province/capital in  Figure 1: Example of instances from the New York Times (NYT).
S2, which have similar semantics, introducing great challenges for neural extractors in discriminating them clearly. Conventional neural methods are not effective at extracting overlapped relation features, because they mix different relation semantics into a single vector by max-pooling (Zeng et al., 2014) or self-attention (Lin et al., 2016). Although (Zhang et al., 2019) first propose an attentive capsule network for multi-labeled relation extraction, it treats the CNN/RNN as low-level capsules without the diversity encouragement, which poses the difficulty of distinguishing different and overlapped relation features from a single type of semantic capsule. Therefore, a well-behaved relation extractor is needed to discriminate diverse overlapped relation features from different semantic spaces. To address the above problem, we propose a novel Regularized Attentive Capsule Network (RA-CapNet) to identify highly overlapped relations in the low-quality distant supervision corpus. First, we propose to embed multi-head attention into the capsule network, where attention vectors from each head are encapsulated as a low-level capsule, discovering relation features in an unique semantic space. Then, to improve multi-head attention in extracting spotted relation features, we devise relation query multi-head attention, which selects salient relation words regardless of their positions. This mechanism assigns proper attention scores to salient relation words by calculating the logit similarity of each relation representation and word representation. Furthermore, we apply disagreement regularization to multihead attention and low-level capsules, which encourages each head or capsule to discriminate different relation features from different semantic spaces. Finally, the dynamic routing algorithm and slidingmargin loss are employed to gather diverse relation features and predict multiple specific relations. We evaluate RA-CapNet using two benchmarks. The experimental results show that our model achieves satisfactory performance over the baselines. Our contributions are summarized as follows: • We first propose to embed multi-head attention as low-level capsules into the capsule network for distantly supervised relation extraction. • To improve the ability of multi-head attention in extracting scattered relation features, we design relation query multi-head attention. • To discriminate overlapped relation features, we devise disagreement regularization on multi-head attention and low-level capsules. • RA-CapNet achieves significant improvements for distantly supervised relation extraction.

Related Work
Distantly supervised relation extraction has been essential for knowledge base construction since (Mintz et al., 2009) propose it. To address the wrong labeling problem in distant supervision, multi-instance and multi-label approaches are proposed (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012).
With the renaissance of neural networks, increasing researches in distant supervision have been proposed to extract precise relation features. Piecewise CNNs with various attention mechanisms are proposed (Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017). Reinforcement learning and adversarial training are proposed to select valid instances to train relation extractors (Feng et al., 2018;Qin et al., 2018b;Qin et al., 2018a). Recently, multi-level noise reduction is designed by (Ye and Ling, 2019; . Nevertheless, the above approaches ignore the effect of noisy words and overlapped relation features in each instance. To reduce the impact of noisy words, tree-based methods attempt to obtain the relevant sub-structure of an instance for relation extraction (Xu et al., 2015;Miwa and Bansal, 2016;Liu et al., 2018). To discriminate overlapped relation features, (Zhang et al., 2019) apply the capsule network (Sabour et al., 2017) for multi-labeled relation extraction. Inspired by the ability of multi-head attention in modeling the long-term dependency (Vaswani et al., 2017), (Zhang et al., 2020) attempt to reduce multi-granularity noise via multi-head attention in relation extraction.

Methodology
As shown in Figure 2, we will introduce the three-layer RA-CapNet: (1) The Feature Encoding Layer primarily contains the word encoding layer and BLSTM encoding layer.
(2) The Feature Extracting Layer chiefly includes relation query multi-head attention and disagreement regularization. (3) The Relation Gathering Layer mainly consists of a regularized capsule network and dynamic routing.

Feature Encoding Layer
Each instance is first input into the encoding layer to be transformed to the distributed representations for the convenience of calculation and extraction by neural networks.

Word Encoding Layer
As mentioned in (Zeng et al., 2014), the inputs of the relation extractor are word and position tokens, which are encoded by word embeddings and position embeddings at first. Then, the j th input word x ij in the i th instance, is concatenated by one word embedded vector x w ij ∈ R k and two position embedded vectors , where k and p represent the dimensions of word vectors and position vectors respectively, and ; denotes the vertical concatenating operation. To simplify the mathematical expression, we denote x ij as x j .

BLSTM Encoding Layer
To further encode relation features inside the context, we adopt the Bidirectional Long-Short Term network (BLSTM) (Graves, 2013) as our basic relation encoder, which can access the future context as well as the past. The encoding feature vector h i of the i th word is calculated as follows: where − → h i and ← − h i ∈ R d are hidden state vectors of the LSTM. Finally, we obtain the sentence encoding vector H = [h 1 , h 2 , · · · , h l ], where l represents the instance length.

Feature Extracting Layer
First, relation query multi-head attention is devised to emphasize spotted relation features from different semantic spaces. Then, disagreement regularization is applied to encouraging the diversity of relation features that each head discovers.

Relation Query Multi-Head Attention
Multi-head attention is useful for modeling the long-term dependency of salient information in the context (Vaswani et al., 2017). Based on this mechanism, we propose relation query multi-head attention to improve the ability of multi-head attention in extracting spotted and salient relation features regardless of their irregular positions in the instance.
Formally, given an encoding instance H, we use the subtraction of two entities' states h en1 and h en2 as the relation representation, as inspired by (Bordes et al., 2013). The relation representation acts as a query vector as follows: where W Q ∈ R d×d is a weight matrix. The corresponding key K and value V vectors are defined: where W K and W V ∈ R d×d are weight matrices. Afterward, we calculate the logit similarity of the relation query vector and word representation vectors as attention scores: where the energy can measure the importance of each word to relation extraction, which is leveraged to select salient and spotted relation features along the sequence: To extract diverse relation features, we employ relation query attention into multi-head attention: where W o ∈ R d×d is the weight matrix. Multiple heads can capture various semantic features. After we acquire the output E m of multi-head attention, a Feed-Forward Network (FFN) is applied:

Disagreement Regularization on Multi-Head Attention
To further discriminate overlapped relation features from different heads in multi-head attention, we introduce the disagreement regularization based on . Formally, given n heads Head = [head 1 , head 2 , · · · , head n ] as calculated in Eq. (8), we calculate the cosine similarity cos(.) between the vector pair head i and head j in different value subspaces: where * represents the L2 norm of vectors. The average cosine distance among all heads is obtained: Our goal is to minimize D sub , which encourages the heads to be different from each other, improving the diversity of subspaces among multiple heads. Accordingly, each head can discriminate overlapped relation features more clearly.

Relation Gathering Layer
To form relation-specific features, the relation gathering layer gathers scattered relation features from diverse low-level capsules using a dynamic routing algorithm.

Low-Level Capsules with Disagreement Regularization
The capsule network has been proven effective in discriminating overlapped features (Sabour et al., 2017;Zhang et al., 2019). In our application, a capsule is a group of neural vectors within one-head attention and regularized by a disagreement term. Thus, each capsule can capture relation features in an unique semantic space. In detail, the orientation of the attention vector inside one head indicates one certain factor of a specific relation, while its length means the probability that this relational factor exists. We reorganize each attention head of H r to form a low-level capsule denoted as u ∈ R du , where each capsule captures information in a specific semantic space. Formally, the above process is as follows: where t is the number of low-level capsules, which equals the quantity of heads. Eq. (14) is a squash function, shrinking the length of vectors from 0 to 1 to express the probability.
To encourage the diversity of these capsules, disagreement regularization is applied to them: To minimize D cap , we can encourage the capsules to be different from each other, improving the diversity of subspaces among multiple capsules and discriminating overlapped relation features more clearly. The final disagreement regularization term is the average of multi-head and capsule disagreement: where D is the final disagreement regularization term which only works for the training process.

High-Level Capsules with Dynamic Routing
After the low-level capsules capturing the different aspects of semantic information, the high-level capsules r ∈ R dr are produced from them to gather scattered information and form specific relation features, which are calculated as follows: where W h j ∈ R du×dr are parameters for high-level capsules and c ij are coupling coefficients that are determined by the dynamic routing process described in (Sabour et al., 2017).

Loss Function
The sliding-margin loss function used in the capsule network enables the prediction of multiple overlapped relations, which sums up the loss for both the relations present and absent from the instances. This margin loss function is integrated into our model as follows: where γ is the width of the margin, S is a learnable threshold for "no relation" (NA), and λ is the downweighting of the loss for absent relations. Y j = 1 if the relation corresponding to r j is present in the sentence and Y j = 0 otherwise. Afterward, the final loss is defined as follows: where β and β are hyperparameters used to restrict the disagreement regularization and L2 regularization of all parameters θ. In this paper, we use the Adam (Kingma and Ba, 2014) to minimize the final loss.

Experiments
Our experiments are devised to demonstrate that RA-CapNet can identify highly overlapped relations of informal instances in distant supervision. In this section, we first introduce the dataset and experimental setup. Then, we evaluate the overall performance of RA-CapNet and the effects of different parts of RA-CapNet. Finally, we present the case study.

Dataset and Experimental Setup
Dataset. To evaluate the effects of RA-CapNet, we conduct experiments on two datasets. NYT-10 is a standard dataset constructed by (Riedel et al., 2010), which aligns relational tuples in Freebase (Bollacker et al., 2008) with the corpus of New York Times. Sentences from 2005-2006 are used as the training set, while sentences from 2007 are used for testing. NYT-18 is a larger dataset constructed by (Zhang et al., 2020) with the same creation method as NYT-10, which crawls 2008-2017 contexts from the NYT. All the sentences are divided into five parts with the same relation distribution for five-fold cross-validation. The details of the datasets are illustrated in Table 1 Evaluation Metric. As mentioned in (Mintz et al., 2009), we use the held-out metrics to evaluate RA-CapNet. The held-out evaluation offers an automatic way to assess models with the PR curve and Precision at top 100 or 10k predictions (P@100 on NYT-10 or P@10k on NYT-18) at all numbers of instances under each entity pair, which indicates that all instances under the entity pair are used to represent the relation. Parameter. In our work, we use the Skip-Gram (Mikolov et al., 2013) to pretrain our word embedding matrices. The words of an entity are concatenated when it has multiple words. The grid search and crossvalidation are used to adjust important hyperparameters of the networks. Our final parameter settings are illustrated in Table 2   To evaluate our model, we select the following methods for comparison: PCNN (Zeng et al., 2015) present a piecewise CNN for relation extraction.
PCNN+INTRA+INTER (Ye and Ling, 2019) propose to emphasize true labeled sentences and bags.
ATT+CAPNET (Zhang et al., 2019) put forward an attentive capsule network for relation extraction.
QARE+ATT (Zhang et al., 2020) propose improved multi-head attention with transfer learning. We compare our method with baselines on two datasets. For both datasets, the PR curves on NYT-10 and NYT-18 are shown in Figure 4 and Figure 5. We find that: (1) BGRU+SET performs well on NYT-10 but poorly on NYT-18. This demonstrates that BGRU+SET is not well-handled on highly informal instances because the complex instances in NYT-18 are difficult to be parsed precisely by the conventional parser.
(2) RA-CapNet achieves the best PR curve among all baselines on both datasets, which improves the PR curve significantly. This verifies that our model is effective in capturing overlapped and scattered relation features. (3) RA-CapNet outperforms ATT+CAPNET, which indicates that the relation query multi-head attention and disagreement regularization are useful for overlapped relation extraction.   A detailed comparison of all approaches, the areas of the PR curves, P@100 and P@10k on NYT-10 or NYT-18, are illustrated in Table 3 and Table 4. From the tables, we find that: (1) RA-CapNet is the first method to increase the PR curve area over 0.5 on NYT-10 while improving it on NYT-18 to 0.7. In P@100 and P@10k, our model also achieves superior performance. This result further demonstrates the effectiveness of RA-CapNet with multi-instance learning on overlapped relation extraction. (2) Capnetbased models achieve better performance on the highly complex NYT-18 dataset, which results from their capability of handling overlapped relations and complex sentences. To further evaluate the impacts of different parts on RA-CapNet, we compare the performance on the NYT-10 dataset of RA-CapNet with five settings:

Ablation Study
In the future, we will experiment with different forms of regularization terms and their application to other components of our model.