Deep Residual Learning for Weakly-Supervised Relation Extraction

Deep residual learning (ResNet) is a new method for training very deep neural networks using identity mapping for shortcut connections. ResNet has won the ImageNet ILSVRC 2015 classification task, and achieved state-of-the-art performances in many computer vision tasks. However, the effect of residual learning on noisy natural language processing tasks is still not well understood. In this paper, we design a novel convolutional neural network (CNN) with residual learning, and investigate its impacts on the task of distantly supervised noisy relation extraction. In contradictory to popular beliefs that ResNet only works well for very deep networks, we found that even with 9 layers of CNNs, using identity mapping could significantly improve the performance for distantly-supervised relation extraction.


Introduction
Relation extraction is the task of predicting attributes and relations for entities in a sentence (Zelenko et al., 2003;Bunescu and Mooney, 2005;GuoDong et al., 2005). For example, given a sentence "Barack Obama was born in Honolulu, Hawaii.", a relation classifier aims at predicting the relation of "bornInCity". Relation extraction is the key component for building relation knowledge graphs, and it is of crucial significance to natural language processing applications such as structured search, sentiment analysis, question answering, and summarization.
A major issue for relation extraction is the lack of labeled training data. In recent years, distant supervision (Mintz et al., 2009;Hoffmann et al., 2011;Surdeanu et al., 2012) emerges as the most popular method for relation extraction-it uses knowledge base facts to select a set of noisy instances from unlabeled data. Among all the machine learning approaches for distant supervision, the recently proposed Convolutional Neural Networks (CNNs) model (Zeng et al., 2014) achieved the state-of-the-art performance. Following their success, Zeng et al. (2015) proposed a piece-wise max-pooling strategy to improve the CNNs. Various attention strategies (Lin et al., 2016;Shen and Huang, 2016) for CNNs are also proposed, obtaining impressive results. However, most of these neural relation extraction models are relatively shallow CNNs-typically only one convolutional layer and one fully connected layer were involved, and it was not clear whether deeper models could have benefits on distilling signals from noisy inputs in this task.
In this paper, we investigate the effects of training deeper CNNs for distantly-supervised relation extraction. More specifically, we designed a convolutional neural network based on residual learning (He et al., 2016)-we show how one can incorporate word embeddings and position embeddings into a deep residual network, while feeding identity feedback to convolutional layers for this noisy relation prediction task. Empirically, we evaluate on the NYT-Freebase dataset (Riedel et al., 2010), and demonstrate the state-of-the-art performance using deep CNNs with identify mapping and shortcuts. In contrast to popular beliefs in vision that deep residual network only works for very deep CNNs, we show that even with a moderately deep CNNs, there are substantial improvements over vanilla CNNs for relation extraction. Our contributions are three-fold: • We are the first to consider deeper convolutional neural networks for weakly-supervised relation extraction using residual learning; • We show that our deep residual network model outperforms CNNs by a large margin empirically, obtaining state-of-the-art performances; • Our identity mapping with shortcut feedback approach can be easily applicable to any variants of CNNs for relation extraction.

Deep Residual Networks for Relation Extraction
In this section, we describe a novel deep residual learning architecture for distantly supervised relation extraction. Figure 1 describes the architecture of our model.

Vector Representation
Let x i be the i-th word in the sentence and e1, e2 be the two corresponding entities. Each word will access two embedding look-up tables to get the word embedding WF i and the position embedding PF i . Then, we concatenate the two embeddings and denote each word as a vector of v i = [WF i , PF i ].

Word Embeddings
Each representation v i corresponding to x i is a real-valued vector. All of the vectors are encoded in an embeddings matrix V w ∈ R dw×|V | where V is a fixed-sized vocabulary.

Position Embeddings
In relation classification, we focus on finding a relation for entity pairs. Following (Zeng et al., 2014), a PF is the combination of the relative distances of the current word to the first entity e 1 and the second entity e 2 . For instance, in the sentence "Steve Jobs is the founder of Apple.", the relative distances from founder to e 1 (Steve Job) and e 2 are 3 and -2, respectively. We then transform the relative distances into real valued vectors by looking up one randomly initialized position embedding matrices V p ∈ R dp× P where P is fixed-sized distance set. It should be noted that if a word is too far from entities, it may be not related to the relation. Therefore, we choose maximum value e max and minimum value e min for the relative distance.
In the example shown in Figure 1, it is assumed that d w is 4 and d p is 1. There are two position embeddings: one for e 1 , the other for e 2 . Finally, we concatenate the word embeddings and position embeddings of all words and denote a sentence of length n (padded where necessary) as a vector

Convolution
Let v i:i+j refer to the concatenation of words v i , v i+1 , ..., v i+j . A convolution operation involves a filter w ∈ R hd , which is applied to a window of h words to produce a new feature. A feature c i is generated from a window of word v i:i+h−1 by Here b ∈ R is a bias term and f is a non-linear function. This filter is applied to each possible window of words from v 1 to v n to produce feature c = [c 1 , c 2 , ..., c n−h+1 ] with c ∈ R s (s = n − h + 1).

Residual Convolution Block
Residual learning connects low-level to high-level representations directly, and tackles the vanishing gradient problem in deep networks. In our model, we design the residual convolution block by applying shortcut connections. Each residual convolutional block is a sequence of two convolutional layers, each one followed by an ReLU activation. The kernel size of all convolutions is h, with padding such that the new feature will have the same size as the original one. Here are two convolutional filter w 1 , w 2 ∈ R h×1 . For the first convolutional layer: For the second convolutional layer: Here b 1 , b 2 are bias terms. For the residual learning operation: c = c +ć Conveniently, the notation of c on the left is changed to define as the output vectors of the block. This operation is performed by a shortcut connection and element-wise addition. This block will be multiply concatenated in our architecture.

Max Pooling, Softmax Output
We then apply a max-pooling operation over the feature and take the maximum valueĉ = max{c}. We have described the process by which one feature is extracted from one filter. Take all features into one high level extracted feature z = [ĉ 1 ,ĉ 2 , ...,ĉ m ](note that here we have m filters). Then, these features are passed to a fully connected softmax layer whose output is the probability distribution over relations. Instead of using y = w · z + b for output unit y in forward propagation, dropout uses y = w · (z • r) + b where • is the element-wise multiplication operation and r ∈ R m is a 'masking' vector of Bernoulli random variables with probability p of being 1. In the test procedure, the learned weight vectors are scaled by p such thatŵ = pw and used (without dropout) to score unseen instances.

Experimental Settings
In this paper, we use the word embeddings released by (Lin et al., 2016) which are trained on the NYT-Freebase corpus (Riedel et al., 2010). We fine tune our model using validation on the training data. The word embedding is of size 50. The input text is padded to a fixed size of 100. Training is performed with tensorflow adam optimizer, using a mini-batch of size 64, an initial learning rate of 0.001. We initialize our convolutional layers following (Glorot and Bengio, 2010). The implementation is done using Tensorflow 0.11. All experiments are performed on a single NVidia Titan X (Pascal) GPU. In Table 1 we show all parameters used in the experiments. We experiment with several state-of-the-art baselines and variants of our model.
• CNN-B: Our implementation of the CNN baseline (Zeng et al., 2014) which contains one convolutional layer, and one fully connected layer. • CNN+ATT: CNN-B with attention over instance learning (Lin et al., 2016).
• CNN: Our CNN model which includes one convolutional layer and three fully connected layers.
• CNN-x: Deeper CNN model which has x convolutional layers. For example, CNN-9 is a model constructed with 9 convolutional layers (1 + 4 residual cnn block without identity shortcut) and three fully connected layers.
• ResCNN-x: Our proposed CNN-x model with residual identity shortcuts.
We evaluate our models on the widely used NYT freebase larger dataset (Riedel et al., 2010). Note that ImageNet dataset used by the original ResNet paper (He et al., 2016) has 1.28 million training instances. NYT freebase dataset includes 522K training sentences, which is the largest dataset in relation extraction, and it is the only suitable dataset to train deeper CNNs.

NYT-Freebase Dataset Performance
The advantage of this dataset is that there are 522,611 sentences in training data and 172,448 sentences in testing data and this size can support  us to train a deep network. Similar to previous work (Zeng et al., 2015;Lin et al., 2016), we evaluate our model using the held-out evaluation. We report both the aggregate curves precision/recall curves and Precision@N (P@N).
In Figure 2, we compare the proposed ResCNN model with various CNNs. First, CNNs with multiple fully-connected layers obtained very good results, which is a novel finding. Second, the results also suggest that deeper CNNs with residual learning help extracting signals from noisy distant supervision data. We observe that overfitting happened when we try to add more layers and the performance of CNN-9 is much worse than CNN. We find that ResNet can solve this problem and ResCNN-9 obtains better performance as compared to CNN-B and CNN and dominates the precision/recall curve overall.
We show the effect of depth in residual networks in Figure 3. We observe that ResCNN-5 is worse than CNN-5 because the ResNet does not work well for shallow CNNs, and this is consis- tent with the original ResNet paper. As we increase the network depth, we see that CNN-9 does overfit to the training data. With residual learning, both ResCNN-9 and ResCNN-13 provide significant improvements over  In contradictory to popular beliefs that ResNet only works well for very deep networks, we found that even with 9 layers of CNNs, using identity mapping could significantly improve the performance learning in a noisy input setting. The intuition of ResNet help this task in two aspect. First, if the lower, middle, and higher levels learn hidden lexical, syntactic, and semantic representations respectively, sometimes it helps to bypass the syntax to connect lexical and semantic space directly. Second, ResNet tackles the vanishing gradient problem which will decrease the effect of noise in distant supervision data.
In Table 2, we compare the performance of our models to state-of-the-art baselines. We show that our ResCNN-9 outperforms all models that do not select training instances. And even without piecewise max-pooling and instance-based attention, our model is on par with the PCNN+ATT model.
For the more practical evaluation, we compare the results for precision@N where N is small (1,5,10,20,50) in Table 3. We observe that our ResCNN-9 model dominates the performance when we predict the relation in the range of higher probability. ResNet helps CNNs to focus on the highly possible candidate and mitigate the noise effect of distant supervision. We believe that residual connections actually can be seen as a form of renormalizing the gradients, which prevents the model from overfitting to the noisy distant supervision data.
In our distant-supervised relation extraction experience, we have two important observations: (1) We get significant improvements with CNNs adding multiple fully-connected layers.

Conclusion
In this paper, we introduce a deep residual learning method for distantly-supervised relation extraction. We show that deeper convolutional models help distill signals from noisy inputs. With shortcut connections and identify mapping, the performances are significantly improved. These results aligned with a recent study (Conneau et al., 2017), suggesting that deeper CNNs do have positive effects on noisy NLP problems.