Multi-Level Structured Self-Attentions for Distantly Supervised Relation Extraction

Attention mechanism is often used in deep neural networks for distantly supervised relation extraction (DS-RE) to distinguish valid from noisy instances. However, traditional 1-D vector attention model is insufficient for learning of different contexts in the selection of valid instances to predict the relationship for an entity pair. To alleviate this issue, we propose a novel multi-level structured (2-D matrix) self-attention mechanism for DS-RE in a multi-instance learning (MIL) framework using bidirectional recurrent neural networks (BiRNN). In the proposed method, a structured word-level self-attention learns a 2-D matrix where each row vector represents a weight distribution for different aspects of an instance regarding two entities. Targeting the MIL issue, the structured sentence-level attention learns a 2-D matrix where each row vector represents a weight distribution on selection of different valid instances. Experiments conducted on two publicly available DS-RE datasets show that the proposed framework with multi-level structured self-attention mechanism significantly outperform baselines in terms of PR curves, P@N and F1 measures.


Introduction
Relation extraction is a fundamental task in information extraction (IE), which studies the issue of predicting semantic relations between pairs of entities in a sentence (Zelenko et al., 2003;Bunescu and Mooney, 2005;Zhou et al., 2005). One crucial problem in RE is the relative lack of large-scale, high-quality labeled data. In recent years, one commonly used and effective technique for dealing with this challenge is the distant supervision method via knowledge bases (KBs) (Mintz et al., 2009;Riedel et al., 2010;Hoffmann et al., 2011), which assumes that if one entity pair appearing in some sentences can be observed in a KB with a certain relationship, then these sentences will be labeled as the context of this entity pair and this relationship. The distant supervision strategy is an effective and efficient method for automatically labeling large-scale training data. However, it also introduces a severe mislabelling problem due to the fact that a sentence that mentions two entities does not necessarily express their relation in a KB (Surdeanu et al., 2012;Zeng et al., 2015).
Plenty of research work has been proposed to deal with distantly supervised data and has achieved significant progress, especially with the rapid development of deep neural networks (DNN) for relation extraction in recent years (Zeng et al., 2014(Zeng et al., , 2015Lin et al., 2016Lin et al., , 2017aWang et al., 2016;Zhou et al., 2016;Ji et al., 2017;Yang et al., 2017;Zeng et al., 2017). DNN models under an MIL framework for DS-RE have become state-of-the-art, replacing statistical methods, such as feature-based and graphical models (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012). In the MIL framework for distantly supervised RE, each entity pair often has multiple instances where some are noisy and some are valid. The attention mechanism in DNNs, such as convolutional (CNN) and recurrent neural networks (RNN), is an effective way to select valid instances by learning a weight distribution over multiple instances. However, there are two important representation learning problems in DNN-based distantly supervised RE: (1) Problem I: entity pair-targeted context representation learning from an instance; and (2) Problem II: valid instance selection representation learning over multiple instances. The former can use a word-level attention mechanism to learn a weight distribution on words and then a weighted sentence representation regarding two entities; the latter can employ a sentence-level attention mechanism to learn a weight distribution on multiple instances so that valid sentences with higher weights can be focused and selected, and noisy instances with lower weights are suppressed.
Both the word-level and sentence-level attention mechanisms in previous work on the RE task are simple 1-D vectors which are learned using the hidden states of the RNN, or via pooling from either the RNNs' hidden states or convolved ngrams (Zeng et al., 2014(Zeng et al., , 2015Zhou et al., 2016;Wang et al., 2016;Ji et al., 2017;Yang et al., 2017). The deficiency of the 1-D attention vector is that it only focuses on one or a small number of aspects of the sentence, or one or a small number of instances (Lin et al., 2017b), with the result that different semantic aspects of the sentence, or different multiple valid sentences are ignored, and cannot be utilised.
Inspired by the structured self-attentive sentence embedding in Lin et al. (2017b), we propose a novel multi-level structured (2-D) self-attention mechanism (MLSSA) in a bidirectional LSTMbased (BiLSTM) (Hochreiter and Schmidhuber, 1997) MIL framework to alleviate two problems in the distantly supervised RE. Regarding Problem I, we propose a 2-D matrix-based wordlevel attention mechanism, which contains multiple vectors, each focusing on different aspects of the sentence for better context representation learning. In terms of Problem II, we propose a 2-D sentence-level attention mechanism for multiple instance learning, where it contains multiple vectors, each focusing on different valid instances for a better sentence selection. "structured" indicates that the weight vectors in the learned 2-D matrix try to construct a structural dependency relationship by learning different weight distributions for different contexts or instances given the entity pair. We can see that our structured attention mechanism is different from that in Kim et al. (2017) which incorporates richer structural distributions and are simple extensions of the basic attention procedure. We verify the proposed framework on two distantly supervised RE datasets, namely the New York Times (NYT) dataset (Riedel et al., 2010) and the DBpedia Portuguese dataset (Batista et al., 2013). Experimental results show that our MLSSA framework significantly outperforms state-of-the-art baseline systems in terms of different evaluation metrics.
The main contributions of this paper include: (1) we propose a novel multi-level structured (2-D) self-attention mechanism for DS-RE which can make full use of input sequences to learn different contexts, without integrating extra resources; (2) we propose a 2-D matrix-based wordlevel attention for better context representation learning targeting two entities; (3) we propose a 2-D sentence-level attention mechanism over multiple instances to select different valid instances; and (4) we verify the proposed framework on two publicly available distantly supervised datasets.

Related Work
Most existing work on distant supervision data mainly focuses on denoising the data under the MIL strategy by learning a valid sentence representation or features, and then selecting one or more valid instances for relation classification (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016Lin et al., , 2017aZhou et al., 2016;Ji et al., 2017;Zeng et al., 2017;Yang et al., 2017). Riedel et al. (2010) and Surdeanu et al. (2012) use a graphical model and MIL to select the valid sentences and classify the relations. However, these models are based on statistical methods and feature engineering, i.e. extracting sentence features using other NLP tools. Zeng et al. (2015) proposed a piece-wise CNN (PCNN) method to automatically learn sentence-level features and select one valid instance for the relation classification. The one-sentence-selection strategy does not make full use of the supervision information among multiple instances. Lin et al. (2016) and Ji et al. (2017) introduce an attention mechanism to the PCNN-based MIL framework to select informative sentences, which outperforms all baseline systems on the NYT data set. However, their attention mechanism is only a sentence-level model without incorporating wordlevel attention. Zhou et al. (2016) introduce a word-level attention model to the BiLSTM-based MIL framework and obtain significant improvements on the SemEval2010 (Hendrickx et al., 2010) data set. Wang et al. (2016) extend the single word-level attention model to multiple word levels in CNNs to discern patterns in heterogeneous contexts of the input sentence, and achieve best performance on the SemEval2010 data set. However, these two works were not targeting the distantly supervised RE problem. Yang et al. (2017) experiment with word-level and sentence-level attention models in the bidirectional RNN on the NYT dataset on the basis of the open source DS-RE system, 1 and verify that a two-level attention mechanism achieves best performance compared to PCNN/CNN models. Both the word-level and sentence-level attention models are 1-D vectors.
From previous work, we can see that the attention mechanism in DNNs has made significant progress on the RE task. However, both word-level and sentence-level attention models are still based on 1-D vectors which have the following insufficiencies: (1) although the 1-D attention model can learn weights for different contexts, it only focuses on one or very few aspects of a single sentence (Lin et al., 2017b), or one or very few instances; (2) in order to allow the attention mechanism to learn more aspects of the sentence, or different instances, extra knowledge needs to be integrated, such as the work in Ji et al. (2017) and Lin et al. (2017a). The former integrates entity descriptions generated from Freebase and Wikipedia as supplementary background knowledge to disambiguate the entity. The latter introduces a multilingual framework which employs a monolingual attention mechanism to utilize the information within monolingual texts, and further uses a cross-lingual attention mechanism to consider the information consistency and complementarity among cross-lingual texts. However, extra resources are difficult to obtain in many practical scenarios.
In order to alleviate the burden of integrating extra knowledge, and make full use of the input sentence (i.e. learning different aspects of context and focusing on different valid instances), we propose a multi-level structured self-attention mechanism in a BiLSTM-based MIL framework without integrating extra resources.

Approach
The distantly supervised RE can be formalised as follows: given an entity pair (e 1 , e 2 ), a bag G containing J instances, and the relation label r for G, the goal of the training process is to denoise these instances by selecting valid candidates based on r, and the goal of the testing process is to denoise multiple instances by selecting valid candidates to predict the relation r for G.
To alleviate the aforementioned two problems, improving the following two representation learning issues is clearly important for a DNN-based RE classifier: • Entity pair-targeted context representation: The model should have the capability to learn a better context representation from the input sentence targeting the entity pair; • Instance selection representation: The model should have the capability to learn a better weight distribution over multiple instances to select valid instances regarding an entity pair.
Motivated by these two issues, we propose a multi-level structured self-attention framework.

Architecture
The proposed framework consists of three parts as shown in Figure 1. The first part includes the input layer, embedding layer and BiLSTM layer which transform the input sequence at different time steps to LSTM hidden states.
The second part implements the entity pairtargeted context representation learning, including: • a structured word-level self-attention layer: this generates a set of summation weight vectors (or a 2-D matrix) taking the LSTM hidden states as input. Each vector in the 2-D matrix represents the weights for different aspects of the input sentence.
• a structured context representation layer: the weight vectors learned by the 2-D word-level self-attention are dotted with the BiLSTM hidden states. Accordingly, a 2-D matrix or a set of weighted LSTM hidden state vectors, denoted as "M L1 " in Figure 1, is obtained. Each weighted vector represents a sentence embedding reflecting a different aspect of the sentence targeting the entity pair. By this means, a dependency parsing-like structure of the input sentence can be constructed, obtaining different semantic representations of the sentence for the two entities in question.
• a flattened representation layer: this concatenates each vector in the 2-D matrix of the sentence embedding to one vector. Then, the flattened vector connects to a 1-layer multilayer perceptron (MLP) with ReLU activation function, generating an aggregated sentence representation.
The first and second parts operate on the single instance level, i.e. given a bag G and feeding each instance into the framework, the structured wordlevel self-attention mechanism will construct J individual structured sentence representations corresponding to J input instances.
The third part targets the instance selection representation learning issue, and operates on the bag level, i.e. considering weighted context representations of all instances in the bag G and learning probability distributions to distinguish informative from noisy sentences. This part includes: • a structured sentence-level attention model: this has a similar structure to the structured word-level attention mechanism, except that it generates a set of summation weight vectors for all input instances in the same bag G. Each vector is a weight distribution over all instances. Accordingly, the 2-D sentencelevel matrix is expected to learn a set of different weight distributions focusing on different informative instances. As a result, informative sentences are expected to contribute more with higher weights, and noisy sentences are expected to contribute less with smaller weights, to the relation classification.
• an averaged sentence-level attention layer: the 2-D sentence-level attention matrix is averaged and converted to a 1-D vector.
• a selection representation layer: the 1-D averaged attention vector is dotted with the output of the flattened representation layer. Accordingly, a 1-D vector, denoted as "M L2 " in Figure 1, is obtained which represents an av-eraged weighted selection representation of multiple sentences.
• an output layer: this connects to a softmax layer and produces a probability distribution corresponding to relation classes.

Structured Word-Level Self-Attention and its Penalisation Function
Given a bag G = (S 1 , S 2 , . . . , S J ) containing J instances, and a sentence S j in G consisting of N tokens, S j can be represented using a sequence of word embeddings, as in (1): where e i is a d-dimension vector for the i-th word, and S j is the j-th instance in G.
We denote the hidden state of the BiLSTM as in (2): where h t is a concatenation of the forward hidden state − → h t and the backward hidden state ← − h t at time step t. T is the transpose operation. If the size of each unidirectional LSTM is u, then H has the size 2u-by-N .
Then, the structured word-level self-attention mechanism is defined as in (3): where L1 stands for the first-level attention mechanism, i.e. the word-level; W L1 s1 is a weight matrix of size d L1 a × 2u, where d L1 a is a hyper-parameter for the number of neurons in the attention network; W L1 s2 is a weight matrix with the shape r L1 × d L1 a , where r L1 (r L1 > 1) is the hyper-parameter representing the size of multiple vectors in the 2-D attention matrix. The size of r L1 is defined based on how many different aspects of the sentence need to be focused on; A L1 is the annotation matrix of size r L1 ×N . We can see that in A L1 , there are r L1 attention vectors for the N -token input sentence.
Finally, we compute the r L1 weighted sums by multiplying the annotation matrix A L1 and BiL-STM hidden states H. The resulting structured sentence representation M L1 is (4): where M L1 has the shape r L1 × 2u. It can be seen that the traditional 1-D sentence representation is extended to a 2-D representation (r L1 > 1).
Subsequently, the output of the flattened representation layer for the instance S j in G is (5): where W L1 o is the weight matrix that has the shape v-by-r L1 * 2u, where v is the amount of neurons in the ReLU -based MLP layer; M F T L1 is the flattened structured sentence representation which is a concatenated vector of each row in M L1 and has the dimension r L1 * 2u; b L1 o is the bias vector of size v; O L1 j is the aggregated sentence representation of the j-th instance in the bag G with size v.
Then, the output of all instances in G from the flattened representation layer is denoted as in (6): where O L1 has the shape of v × J. As in Lin et al. (2017b), the penalisation term for the structured word-level attention is as in (7): where || · || F is the Frobenius norm of a matrix. I is an identity matrix. Minimising this penalisation term means that we learn an orthogonal matrix for A L1 so that each row in A L1 only focuses on a single aspect of semantics.

Structured Sentence-Level Self-Attention and Averaged Selection Representation
Taking O L1 as the input to the structured 2-D sentence-level attention model, the annotation matrix A L2 is calculated as in (8): where W L2 s1 is the weight matrix of size d L2 a × v, and d L2 a is the number of neurons in the attention network; W L2 s2 is the weight matrix of shape r L2 × d L2 a , where r L2 (r L2 > 1) is the hyperparameter representing the size of multiple vectors in the 2-D sentence-level attention matrix. The r L2 multiple vectors are expected to focus on different informative instances for the relation classification; A L2 is the sentence-level annotation matrix of size r L2 × J. We can see that the traditional 1-D sentence-level attention model is expanded to a multi-vector attention (r L2 > 1).
Then, we average the 2-D A L2 to a 1-D vector A L2 which has the dimension of J.
Accordingly, we calculate the averaged weighted sum by multiplyingĀ L2 and the aggregated sentence representation O L1 , with the resulting instance selection representation M L2 being (9): where M L2 has the size of v.
The probability distribution of the predicted relation type, i.e. the final output for relation prediction, can be calculated as in (10):

Loss Function and Optimisation
The total loss of the network is the summation of the penalisation term P L1 , softmax loss in Eq. (10) and the L2 regularisation loss. We use the ADAM optimiser (Kingma and Ba, 2014) to minimize the loss function on the minibatch basis which is randomly selected from the training set.

Datasets
We use two distantly supervised datasets, namely the NYT corpus (NYT) and the DBpedia Portuguese dataset (PT), 2 to verify our method.
In the NYT dataset, there are 53 relationships including a special relation NA which indicates a None Relation between two entities. The training set contains 580,888 sentences, 292,484 entity pairs and 19,429 relational facts (Non-NA). The test set contains 172,448 sentences, 96,678 entity pairs and 1,950 relational facts (Non-NA). There are 19.24% and 22.57% entity pairs corresponding to multiple instances in the training set and test set, respectively.
The DBpedia Portuguese dataset is smaller, containing just 10 relationships including a special relation Other. After preprocessing the original dataset, we obtain 96,847 sentences, 85,528 entity pairs and 77,321 relational facts (Non-Other). There are 8.61% entity pairs corresponding to multiple instances in the whole dataset. As in Batista et al. (2013), we use two different settings for the training and test sets: (1) a manually 2 There are several reasons to use the Portuguese dataset: (i) the data sets reported in previous work, such as the KBP data, are not publicly available, or (ii) SemEval data sets which are not distantly supervised data. Google has also released a dataset (https://github.com/google-researchdatasets/relation-extraction-corpus), but it is smaller and only has 4 relation types. For all these reasons, the Portuguese data is a better option to verify our method. reviewed subset that contains 602 sentences (PT-MANUAL) as the test set; and (2) 70%-30% out of the whole data as the training set and test set, respectively (PT-SPLIT).

Word Embeddings and Relative Position Features
For the NYT dataset, we use the 200-dimensional word vectors pre-trained using the NYT corpus; 3 for the PT dataset, we use a pre-trained 300dimensional word vector model. 4 For the twoword entities in the data set, we use underscore to connect them as one word. The word embeddings of unknown words are intialised using the normal distribution with the standard deviation 0.05. Similar to previous work, we also use position embeddings specified by entity pairs. It is defined as the combination of the relative distances from the current word to head or tail entities (Zeng et al., 2014(Zeng et al., , 2015Lin et al., 2016).

Baselines and Our MLSSA Systems
Neural RE systems have become the state-ofthe-art, such as CNN-based (Zeng et al., 2014;Lin et al., 2017a), Piecewise CNN-based (Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017), and BiLSTM-based (Zhou et al., 2016) models with or without an attention mechanism. In order to carry out a fair comparison, we select CNN+ATT, PCNN+ATT, BiGRU+ATT (bidirectional gated recurrent unit) and BiGRU+2ATT models as baselines on the NYT data, PCNN+ATT and Bi-GRU+2ATT as baselines on the PT data, where ATT indicates that the model has a sentence-level attention mechanism, and 2ATT indicates that the model has a 1-D word-level and a 1-D sentencelevel attention. 5 To show the incremental effectiveness of structured 2-D word-level and 2-D sentence-level selfattention mechanisms, we use two different settings for our MLSSA system: (1) MLSSA-1: this has a 2-D word-level self-attention and a 1-D sentence-level attention, i.e. A L2 in Figure 1 is a 1-D vector. This system is used to verify the context representation learning targeting Problem I; (2) MLSSA-2: both the word-level and sentencelevel attentions are structured 2-D matrices. This system verifies the instance selection representation learning targeting Problem II.

Experiment Setup and Evaluation Metrics
Following previous work, we use different evaluation metrics on these two datasets. For the NYT dataset: • Overall evaluation: all training data is used for the model training, and all test data is used for the evaluation in terms of Precision-Recall (PR) curves; • P@N evaluation: we select those entity pairs that have more than one instance to carry out the comparison in terms of the precision at n (P@N) measure. 6 As in Lin et al. (2016), there are three settings: (1) One: for each testing entity pair corresponding to multiple instances, we randomly select one sentence to predict the relation; (2) Two: for each testing entity pair with multiple instances, we randomly select two sentences for the relation extraction; and (3) All: for each entity pair having multiple instances, we use all of them to predict the relation. Note that these three selections are only applied to the test set, and we keep all sentences in the training data for model building.
For the PT dataset, we use Macro F1 to evaluate system performance. 7

Hyper-parameter Settings
We use cross-validation to determine the hyperparameters of our system regarding two different settings and datasets. The in-common and different parameters for our two systems and two datasets are shown in Table 1.

PR Curves on NYT Dataset
The comparison results for the NYT test set are shown in Figure 2. We have the following observations: (1) BiGRU+ATT outperforms CNN+ATT and PCNN+ATT in terms of the PR curve, showing that it can learn a better semantic representation from the sequential input; (2) BiGRU+2ATT has better overall performance compared to Bi-GRU+ATT, showing that word-level attention is beneficial to sentence-level attention compared to single-attention models, i.e. the sentencelevel attention model can select more informative sentences based on a more reasonable sentence embedding learned by the word-level attention model; (3) MLSSA-1 outperforms all baseline systems in terms of the PR curve, which demonstrates that the structured 2-D word-level attention model can learn a better sentence representation by focusing on different aspects of the sentence, so that the sentence-level attention has a better chance of selecting the most informative sentences; and (4) the PR curve of MLSSA-2 is higher than that of MLSSA-1, demonstrating that the 2-D sentence-level attention model can better select the most informative sentences compared to the 1-D sentence-level attention model targeting those entity pairs with multiple instances.

P@N Evaluation on NYT Dataset
The results on the NYT dataset regarding P@100, P@200, P@300 and the mean of three settings for each model are shown in Table 2. From the table, we have similar observations to the PR Curves: (1) BiGRU+2ATT outperforms CNN+ATT, PCNN+ATT and BiGRU+ATT in most cases in terms of all P@N scores; and (2) MLSSA-1 and MLSSA-2 significantly outperform all baselines for all measures. We observe that MLSSA-1 performs better than MLSSA-2 on tasks One and Two, but worse on All. We infer that in our 2-D sentence-level attention model, we   set r L2 to 9, but there are only one and two instances for selection in tasks One and Two, so the 2-D matrix cannot demonstrate its full potential. However, in All, many entity pairs contain multiple or more than 9 instances, so it can learn a better 2-D matrix to focus on different instances.

Results on PT Dataset
Based on results from the NYT dataset, we choose PCNN+ATT and BiGRU+2ATT as representative baselines to compare against our MLSSA-1/2 systems on the PT test sets. The results in terms of Macro F1 are shown in Table 3. It can be seen that on both test sets, our MLSSA-2 model achieved the best performance which shows that the structured 2-D word-level and sentence-level self-attention models can be well applied to datasets of a smaller scale and with a smaller ratio of multiple instances.

Examples and Analysis
In order to show the effectiveness of structured self-attention mechanisms, we show some exam-   Figure 3 shows the comparison of word-level attention mechanism between BiGRU+2ATT and MLSSA-1 reflecting their capability of context representation learning (Problem I). MLSSA-2 has a similar probability distribution to MLSSA-1 in terms of this example.
The pink fonts indicate lower probability and red indicates higher probability. We observe that: (1) BiGRU+2ATT mainly focuses on one word baltimore. We can see that it has little attention on the entity word maryland. In this example, the comma implies a semantic relationship location/location/contains for the entity pair (Maryland, Baltimore). However, BiGRU+2ATT allocates quite a small probability to it; and (2) we can see that our model focuses on different words via different attention vectors (9 in total). Words with a red background have a high probability of 0.98 or so. For rows 5, 6, 8 and 9, the focus is on the BLANK tokens. In both systems, the maximum time step is set to 70, which indicates that shorter sentences are padded with BLANK tokens and longer sentences are cut off. The last row shows the summation of 9 annotation vectors, and it constructs a dependency-like context of the relation for the entity pair. Attentions on different words are attributed to the penalisation P L1 which is optimised to learn orthogonal eigenvectors. Figure 4 shows the comparison of sentencelevel attentions between BiGRU+2ATT, MLSSA-1 and MLSSA-2. The first, second and third columns are probability distributions over multi-  ple instances. The entity pair is (vinod khosla, sun microsystems), and their relation is Business/Person/Company. From this figure, we observe that: (1) BiGRU+2ATT allocates high probabilities to Sentences 1 and 2 by learning the context of "a founder of", but does not recognise that "co-founder" is semantically the same as "founder"; and (2) our two models almost evenly focus on all sentences because they express the same semantic concept of "a person is a founder of a company" in terms of the given entity pair. Therefore, the structured self-attention mechanism is helpful to learn a better representation and select informative sentences.

Conclusion and Future Work
This paper has proposed a multi-level structured self-attention mechanism for distantly supervised RE. In this framework, the traditional 1-D wordlevel and sentence-level attentions are extended to 2-D structured matrices which can learn different aspects of a sentence, and different informative instances. Experimental results on two distant supervision data sets show that (1) the structured 2-D word-level attention can learn a better sentence representation; (2) the structured 2-D sentence-level attention and averaged selection can perform better selection from multiple instances for relation classification; (3) the proposed framework significantly outperforms state-of-the-art baseline systems for a range of different measures, which verifies its effectiveness on two representation learning issues. A subsequent manual investigation via examples also show its effectiveness on two representation learning issues.
In future work, we will build a domain-specific distant supervision dataset with a higher ratio of multiple instances and compare our system with others. Furthermore, we will consider not using RNNs or CNNs, but a deeper neural networks with only attentions for distantly supervised RE, similar to the work in Vaswani et al. (2017).