Neural Relation Extraction via Inner-Sentence Noise Reduction and Transfer Learning

Extracting relations is critical for knowledge base completion and construction in which distant supervised methods are widely used to extract relational facts automatically with the existing knowledge bases. However, the automatically constructed datasets comprise amounts of low-quality sentences containing noisy words, which is neglected by current distant supervised methods resulting in unacceptable precisions. To mitigate this problem, we propose a novel word-level distant supervised approach for relation extraction. We first build Sub-Tree Parse(STP) to remove noisy words that are irrelevant to relations. Then we construct a neural network inputting the sub-tree while applying the entity-wise attention to identify the important semantic features of relational words in each instance. To make our model more robust against noisy words, we initialize our network with a priori knowledge learned from the relevant task of entity classification by transfer learning. We conduct extensive experiments using the corpora of New York Times(NYT) and Freebase. Experiments show that our approach is effective and improves the area of Precision/Recall(PR) from 0.35 to 0.39 over the state-of-the-art work.


Introduction
Relation extraction aims to extract relations between pairs of marked entities in raw texts. Traditional supervised methods are time-consuming for the requirement of large-scale manually labeled data. Thus, Mintz et al. (2009) propose the distant supervised relation extraction, in which amounts of sentences are crawled from web pages of New York Times (NYT) and labeled with a known knowledge base automatically. The method assumes that if two entities have a relation in a known knowledge base, all instances that mention these two entities will express the same relation. Obviously, this assumption is too strong, since a sentence that mentions the two entities does not necessarily express the relation contained in a known knowledge base. As described in Riedel et al. (2010), the assumption leads to the wrong labeling problem. In order to tackle the wrong labeling problem, various multi-instance learning methods are adopted by mitigating noise between sentences (Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016). Despite the wrong labeling problem, distant supervised methods may suffer from the low quality of sentences which derive from the large-scale automatically constructed dataset by crawling web pages (Yang et al., 2017). To handle the problem of low-quality sentences, we have to face two major challenges: (1) Reduce word-level noise within sentences; (2) Improve the robustness of relation extraction against noise.
To explain the influence of word-level noise within sentences, we consider the following sentence as an example: [It is no accident that the main event will feature the junior welterweight champion miguel cotto, a puerto rican, against Paul Malignaggi, an Italian American from Brooklyn.], where Paul Malignaggi and Brooklyn are two corresponding entities. The subsentence [Paul Malignaggi, an Italian American from Brooklyn.] keeps enough words to express the relation /people/person/place of birth, and the other words could be regarded as noise that may hamper the extractor's performance. Meanwhile, as shown in Figure 1, half of the original sentences are longer than 40 words, which means that there are many irrelevant words inside sentences. To be more detail, there are about 12 noisy words in each sentence on average, and 99.4% of sentences in the NYT-10 dataset have noise. Although the Shortest Dependency Path (SDP) proposed by Xu et al. (2015) tries to get rid of irrelevant words for relation extraction, it is not suitable to handle such informal sentences. Moreover, word-level attention has been leveraged to alleviate the impact of noisy words (Zhou et al., 2016), but it weakens the importance of entity features for relation extraction. As for the second challenge, a robust model could extract precise relation features even from low-quality sentences containing noisy words. However, previous neural methods are always lacking in robustness because parameters are initialized randomly and hard to tune with noisy training data, resulting in the poor performance of extractors. Inspired by Kumagai (2016), initializing neural networks with a priori knowledge learned from relevant tasks by transfer learning could improve the robustness of the target task. For the relation extraction, entity type classification can be used as the relevant task since entity types provide abundant background knowledge. For instance, the sentence [Alfead Kahn, the Cornell-University economist who led the fight to deregulate airplanes.] has a relation business/person/company, which is hard to decide without the information that Alfead Kahn is a person and Cornell-University is a company. Therefore, type features learned from entity type classification are proper a priori knowledge to initialize the relation extractor.
In this paper, we propose a novel word-level approach for distant supervised relation extraction by reducing inner-sentence noise and improving robustness against noisy words. To reduce innersentence noise, we utilize a novel Sub-Tree Parse (STP) method to remove irrelevant words by intercepting a subtree under the parent of entities' lowest common ancestor. As shown in Figure 1, the average length of the parsed sentences is much shorter. Furthermore, the entity-wise attention is adopted to alleviate the influence of noisy words in the subtree and emphasize the task-relevant features. To tackle the second challenge, we initialize our model parameters with a priori knowledge learned from the entity type classification task by transfer learning. The experimental results show that our model can achieve satisfactory performance among the state-of-the-art works. Our contributions are summarized as follows: • To handle the problem of low-quality sentences, we propose the STP to remove noisy words of sentences and the entity-wise attention mechanism to enhance semantic features of relational words.
• We first propose to initialize the neural relation extractor with a priori knowledge learned from entity type classification, which strengthens its robustness against low-quality corpus.
• Our model achieves significant results for distant supervised relation extraction, which improves the Precision/Recall (PR) curve area from 0.35 to 0.39 and increases top 100 predictions by 6.3% over the state-of-the-art work.

Related Work
The distant supervised method plays an increasingly essential role in relation extraction due to its less requirement of human labor (Mintz et al., 2009). However, an evident drawback of the method is the wrong labeling problem. Thus, multi-instance and multi-label learning methods are proposed to address this issue (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012). Meanwhile, other researches (Angeli et al., 2014;Han and Sun, 2016) incorporate humandesigned features and leverage Natural Language Processing (NLP) tools. As neural networks have been widely used, an increasing number of researches have been proposed. Zeng et al. (2015) use a piecewise convolutional neural network with multi-instance learning. Furthermore, selective attention over instances with the neural network is proposed (Lin et al., 2016). Making use of entity description, Ji et al. (2017) assign more precise attention weights. Focused on the imbalance of datasets, a soft label method has been proposed by .
Recently, reinforcement learning and adversarial learning are widely used to select the valid instances for relation extraction (Feng et al., 2018;Qin et al., 2018b,a).
However, above methods ignore inner-sentence noise. To better remove irrelevant words, the SDP between entities is proved to be effective (De Marneffe and Manning, 2008;Chen and Manning, 2014;Xu et al., 2015;Miwa and Bansal, 2016). Nevertheless, in our observation, the SDP deals with informal texts difficultly (See Section 3.1 for details). Furthermore, word-level attention is adopted to focus on relational words for relation extraction (Zhou et al., 2016), but it hinders the effect of entity words.
Transfer learning proposed by Pratt (1993) provides a new approach to leverage knoweldge extracted by related tasks to enhance the performance of the target task. Furthermore, parameter transfer learning is proved to be effective to improve the stability of models by initializing model parameters reasonably (Pan and Yang, 2010;Kumagai, 2016).

Methodology
In this section, we present our methodology for distant supervised relation extraction. Figure 2 shows the overall architecture of our model. Our model is divided into three parts: Sub-Tree Parser. Input instances are parsed to dependency parse trees by the Stanford parser 1 (Chen and Manning, 2014) at first. Then words in the STP and relative positions are transformed to distributed representations.
Entity-Wise Neural Extractor. Given the representation of each subtree, Bidirectional Gated Recurrent Unit (BGRU) extracts specific features. Then, entity-wise attention combined with wordlevel attention is applied to reducing irrelevant features for relation extraction. Finally, the sentencelevel attention is used to alleviate the influence of wrong labeling sentences.
Parameter-Transfer Initializer. The transfer learning method pre-trains our model parameters from the task of entity type classification aiming at boosting the performance of relation extraction.

Sub-Tree Parser
Each instance is put into the dependency parse module to build the dependency parse tree in the 1 https://nlp.stanford.edu/software/lex-parser.shtml first place. Then we can tailor the sentences based on the STP method. Finally, we transform word tokens and position tokens of each instance to distributed representations by embedding matrixes.

Sub-Tree Parse
In order to reduce inner-sentence noise and extract relational words, we propose the STP method which intercepts the subtree of each instance under the parent of entities' lowest common ancestor. For instance, in Figure 2(b), China and Shanghai are entities connected directly with the appositive relation. The instance [In 1990, he lives in Shanghai, China.] will be transformed to [in Shanghai, China.] on the basis of the STP, where in is the parent of Shanghai and China lowest common ancestor and kept as important information for expressing the relation location/location/contain. Words connected by the imaginary line indicating the extracted subtree are reorganized into their original sequence order to form network inputs.
Among the parse tree, the SDP has been widely used by Chen and Manning (2014) and Xu et al. (2015) to help models focus on relational words. However, in our observation, the SDP is not appropriate in the condition that key relation words are not in the SDP. Although additional information (dependency relations between words) is adopted to enhance the performance of SDP, we found they have the minor effect through our experiment. Thus, we do not make use of other types of linguistic information. As Figure 2(b) shows, in the SDP method, the original sentence will be transformed to [Shanghai China] because Shanghai and China are connected with each other directly in the dependency parse tree, which results in deleting the keyword in and may confuse the model when extracting relations. Compared with SDP, the STP method is more appropriate to extract useful information in informal sentences where relational words are always not in the SDP.

Word and Position Embeddings
The inputs of the network are word and position tokens, which are transformed to the distributed representations before they are fed into the neural model. We map j th word in the i th instance to a vector of k dimensions denoted as x w ij ∈ R k through the Skip-Gram model (Mikolov et al., 2013). Like Zeng et al. (2014), we leverage Pos1 and Pos2 to specify entity pairs, which are defined as the relative distances of current word from head  entity and tail entity. For instance, in Figure 2 relative distances of lived from Shanghai and China are -2 and -4 respectively. Then, the position token of each word is transformed to a vector in l dimensions. Position embeddings are denoted as x p1 ij ∈ R l and x p2 ij ∈ R l respectively. Finally, the input representation for x ij is concatenated by word embedding x w ij , position embeddings x p1 ij and x p2 ij , which is denoted as

Entity-Wise Neural Extractor
As shown in Figure 2, we transform the STP into feature vectors by BGRU at first. Next, entity-wise attention combined with the hierarchical-level attention mechanism is applied to enhancing semantic features of each instance.

BGRU over STP
Since the transfer learning and entity-wise attention require the specific features of entities in tree parsed instances as their input, we adopt Gated Recurrent Unit (GRU) (Cho et al., 2014) to be our based relation extractor, which can extract global information of each word by pointing out its corresponding position in the sequence. It can be briefly described as below: where x it is the t th word representation in the i th parsed instance as described in the input layer, and h it ∈ R m is the hidden state of GRU in m dimensions. Furthermore, BGRU implementing GRU in a different direction can access future as well as past context. Under our circumstance, BGRU combined with the STP can extract semantic and syntactic features adequately. Figure 2(a) shows the processing of BGRU over STP. The following equation defines the operation mathematically.
In above equation, the t th word output h it ∈ R m of BGRU is the element-wise addition of the t th hidden states of forward GRU and backward one.

Entity-wise Attention
To reduce noise within sentences, we propose the entity-wise attention mechanism to help our model focus on relational words, especially entity words for relation extraction. Assume that H i is the i th instance matrix consisting of T word vectors [h i1 , h i2 , · · · , h iT ] produced by BGRU.
Not all words contribute equally to the representation of the sentence. Entity words are of great importance because they are significantly beneficial to relation extraction. In our model, entitywise attention assigns the weight α e it to focus on the target entity and removes noise further. It is defined as follows: In the above equation, α e it = 1 if t th word belongs to the head or tail entity.

Hierarchical-level Attention
To reduce inner-sentence noise further and deemphasize noisy sentences, we incorporate wordlevel attention and sentence-level attention as hierarchical-level attention which is introduced in Yang et al. (2016).
Word-level Attention. It assigns an additional weight α w it to relational word h it due to its relevance to the relation as described by Zhou et al. (2016). It can be described as follows: where A w is a weighted matrix, and vector r w can be seen as a high level representation in a fixed query what is the informative word over the other words.
The i th sentence representation S i ∈ R m is computed as a weighted sum of h it : Sentence-level Attention. After we get the instance representation S i , we adopt the selective attention mechanism over instances to de-emphasize the noisy sentence (Lin et al., 2016), which is described as follows: where A s is a weighted matrix, r s is the query vector associated with the relation, and S ∈ R m is the output of the sentence-level attention layer.

Parameter-Transfer Initializer
The transfer learning method pre-trains our model parameters in the entity type classification task, which in turn contributes to the relation extraction.

Pre-learn the Entity Type
As entity type information plays a significant role in detecting relation types, the entity type classification task is considered to be the source task, which is learned before the relation extraction task. According to Eq. 6, outputs of the sentencelevel attention layer for the head entity and tail entity task are S head and S tail respectively. They are ultimately fed into the softmax layer: (8) where W i and b i are the weight and bias for the entity type classification task respectively,p i ∈ R zt is the predicted probability of each class and z t is the number of entity classes. The loss function of the source task is the negative log-likelihood of the true labels: where λ t is the weight of each task, θ 0 is the shared model parameters, θ head and θ tail are individual parameters for the head and tail entity classification tasks respectively, y t ∈ R zt is the onehot vector representing ground truth, and β is the hyper-parameter for L2 regularization.

Train the Relation Extractor
Based on the pre-trained model in the entity type classification task, the relation extractor initializes shared parameters θ 0 within the best state of the pre-trained model and independent parameters θ r randomly. Same as the entity type classification task, the output S r of the attention layer for the relation extraction task is finally fed into the softmax layer and the loss is calculated by cross entropy, which is defined as follows: where W r , b r , y ∈ R zr ,p ∈ R zr , θ r and β are defined similarly in the entity type classification task. As shown in Figure 2, two tasks share all layers except attention and output layers. Assume that the set of total model parameters is θ. Thus, θ, θ 0 , θ r , θ head and θ tail have a relationship described in the following equations: where A w i , r w i , A s i , r s i , W i and b i are parameters in attention and output layers.

Optimize the Objective Function
At first, we minimize J to obtain θ 0 at the best model stateθ 0 for entity type classification. Then we minimize J r for the best performance of relation extraction under the initialization of θ 0 to bê θ 0 . Above process can be summarized as the following equation: where λ ∈ (0, 1) is the hyperparameter to determine the importance of each task at different training steps. We use Adam (Kingma and Ba, 2014) optimizer to minimize the objective J(θ).

Experiments
Our experiments are designed to demonstrate that our model alleviates the influence of word-level noise arising from low-quality sentences. In this section, we first introduce the dataset and evaluation metrics. Next, we describe parameter settings. Then we evaluate effects of the STP, entity-wise attention and the parameter-transfer initializer. Finally, we compare our model with the state-of-theart works by several evaluation metrics.

Dataset and Evaluation Metrics
To evaluate the performance of our model, we adopt a widely used dataset NYT-10 developed by Riedel et al. (2010). NYT-10 dataset is constructed by aligning relational facts in Freebase (Bollacker et al., 2008)  There are 53 relations including a special relation NA, which means that there is no relation between the entity pair in the instance. Meanwhile, all relations in Freebase are defined on head types and tail types. Therefore, we can construct datasets for type prediction tasks with the same dataset. The dataset has 29 head types and 26 tail types. Like previous works, we evaluate our model with the held-out metrics, which compare relations found by models with those in Freebase. The held-out evaluation provides a convenient way to assess models. We report both the PR curve and Precision at top N predictions (P@N) at various numbers of instances under each entity pair: One: For each entity pair, we randomly select one instance to represent the relation.
Two: For each entity pair, we randomly select two instances to represent the relation.
All: For each entity pair, we select all instances to represent the relation.

Experimental Settings
In the experiment, we utilize word2vec 2 to train word embeddings on NYT corpus. We use the cross-validation to tune our model and grid search to determine model parameters. The grid search approach is used to select optimal learning rate lr for Adam optimizer among {0.1, 0.001, 0.0005, 0.0001}, GRU size m ∈ {100, 160, 230, 400}, position embedding size l ∈ {5, 10, 15, 20}. Table 1 shows all parameters for our task. We follow experienced settings for other parameters because they make little influence to our model performance.

Effect of Various Model Parts
In this section, we utilize the PR curve to evaluate the effects of three main parts in our model: the STP, entity-wise attention and the parametertransfer initializer.

Effect of the STP
To demonstrate the effect of the STP, we adopt BGRU with Word-Level Attention (WLA) proposed by Zhou et al. (2016) as our base model. We compare the performance of BGRU, BGRU+STP, and BGRU+SDP. From Figure 3, we can observe that the model with the STP performs best, and the SDP model obtains an even worse result than the pure one. The PR curve areas of BGRU+SDP and BGRU are about 0.332 and 0.337 respectively, while BGRU+STP increases it to 0.366. The result indicates: (1) Our STP can get rid of irrelevant words in each instance and obtain more precise sentence representation for relation extraction. It proves that our STP module is effective. (2) The SDP method is not appropriate to handle low-quality sentences where key relation words are not in the SDP.  To evaluate the effect of entity-wise attention combined with word-level attention, we utilize BGRU in three settings on our tree parsed data and original data. One setting is to use WLA mechanism only (BGRU). The second one is to replace WLA with the Entity-Wise Attention (EWA) mechanism (BGRU-WLA+EWA). The third one is to incorporate two mechanisms (BGRU+EWA). From Table 2 and Figure 4, we can obtain: (1) Regardless of the dataset that we employ, BGRU-WLA(+STP)+EWA outperforms BGRU(+STP). To be more specific, the PR curve area has a relative improvement of over 2.3%, which demonstrates that entity-wise hidden states in the BGRU present more precise relational features than other word states. (2) BGRU(+STP)+EWA achieves further improvements and outperforms the baseline by over 4.6%, because it considers more information than entity or relational words alone. Thus, it indicates that entity words are essential for relation extraction, but they can not represent features of the whole sentence without other words.

Effect of Parameter-Transfer Initializer
To evaluate the effect of the parameter-transfer initializer in our model, we leverage BGRU under four circumstances. The first one is to directly apply it on the original dataset. The second one tests BGRU combined with Transfer Learning (TL) on the original dataset. The third one uses BGRU on our STP dataset. The fourth one examines BGRU+TL on our STP dataset.
From Figure 5, we can conclude: (1) Regardless of the dataset that we use, models with TL achieve better performance, which improve the PR curve area by over 4.7%. It demonstrates that  transfer learning helps our model become more robust against noise. (2) BGRU+STP+TL achieves the best performance and increases the area to 0.383, while areas of BGRU, BGRU+STP and BGRU+TL are 0.337, 0.366 and 0.372 respectively. It means that the TL method works well with the STP and can resist noisy words further.

Comparison with Baselines
To evaluate our approach, we select the following six methods as our baseline: Mintz (Mintz et al., 2009) proposes the humandesigned feature model.
BGRU (Zhou et al., 2016) proposes a BGRU with the word-level attention mechanism. As Figure 6 shows, we can observe: (1) BGRU+STP+EWA achieves the best PR curve over baselines, which improves the area to 0.38 over 0.33 of PCNN, 0.34 of BGRU and 0.35 of PCNN+ATT. At the recall rate of 0.25, our model can still achieve a precision rate above 0.6. It demonstrates that BGRU+STP+EWA is effective because the STP and entity-wise attention combined with word-level attention can reduce inner-sentence noise at a fine-grained level.
(2) Integrated with transfer learning, BGRU+STP+EWA+TL performs much better and increases the PR curve area to 0.392. It means that the model is pre-trained for better parameter initialization so the TL model becomes more robust against noisy words. Parameter transfer learning can be applied in better feature extractors for further improvement.
Following previous works, we adopt P@N as a quantitative indicator to compare our model with baselines based on various instances under each relational tuple. In Table 3, we report P@100, P@200, P@300 and the mean of them for each model in the held-out evaluation. We can find: (1) Compared with baselines, BGRU+STP+EWA+TL achieves the best performance in all test settings, which increases the performance of PCNN+ATT in three settings by 6.3%, 7.6%, and 7.7% respectively. It demonstrates that the integrated model is the most effective; (2) Our STP and entity-wise attention combined with word-level attention reduce innersentence noise effectively, and outperform baselines by over 5%; (3) Our neural extractor initialized with a priori knowledge learned from entity type classification is more robust against wordlevel noise where BGRU+STP+EWA+TL has an improvement of 2% over BGRU+STP+EWA.

Conclusion
In this paper, we propose a novel word-level approach for distant supervised relation extraction. It aims at tackling the low-quality corpus by reducing inner-sentence noise and improving the robustness against noisy words. To alleviate the influence of word-level noise, we propose the STP. Meanwhile, entity-wise attention combined with word-level attention helps the model focus more on relational words. Furthermore, parameter transfer learning makes our model more robust against noise by reasonable initialization of parameters. The experimental results show that our model significantly and consistently outperforms the state-of-the-art method.
In the future, we will incorporate the SDP and STP to obtain more precise shortened sentences. Furthermore, we will conduct research in how to utilize entity information to assign more appropriate initial parameters of the relation extractor.