Genre Separation Network with Adversarial Training for Cross-genre Relation Extraction

Relation Extraction suffers from dramatical performance decrease when training a model on one genre and directly applying it to a new genre, due to the distinct feature distributions. Previous studies address this problem by discovering a shared space across genres using manually crafted features, which requires great human effort. To effectively automate this process, we design a genre-separation network, which applies two encoders, one genre-independent and one genre-shared, to explicitly extract genre-specific and genre-agnostic features. Then we train a relation classifier using the genre-agnostic features on the source genre and directly apply to the target genre. Experiment results on three distinct genres of the ACE dataset show that our approach achieves up to 6.1% absolute F1-score gain compared to previous methods. By incorporating a set of external linguistic features, our approach outperforms the state-of-the-art by 1.7% absolute F1 gain. We make all programs of our model publicly available for research purpose


Introduction
Relation extraction aims to identify and categorize the semantic relation between two entity mentions based on the contexts within the sentence. Supervised learning approaches have shown to be effective on this task. However, as relation extraction highly depends on information about entities and their contexts, a supervised model trained in one genre suffers from dramatical performance decrease when applied to a new genre, due to the distinct contexts among various genres.
Previous studies (Plank and Moschitti, 2013;Grishman, 2014, 2015;Yu et al., * *Corresponding author 1 We make all cleaned codes and resources publicly available at https://github.com/Garym713/Genre-Separation-Network-for-Relation-Extraction 2015;  tackle this problem by manually crafting genre-agnostic features such as word clusters and word embeddings, to train a genre-shared relation extractor. These methods suffer from information loss due to the limited human knowledge to capture all genre-agnostic features. As depicted in Figure 1, where red rectangles are features shared by two genres, and blue and green triangles are source and target genre features respectively, Feature Engineering only captures a portion of the genre-agnostic features. Fu et al. (2017), depicted as Feature Projection, applies a domain adversarial neural network to automatically project the source and target genre features into one unified feature space. However, it unnecessarily introduces genre-specific features which undermine the overall performance.
To address these problems, we propose a genreseparation network, which consists of two separate Convolutional Neural Networks (CNNs) to automatically separates genre-specific and genreagnostic features for each genre, which is depicted as Genre Separation Network in Figure 1. To avoid information loss during feature encoding, we reconstruct the original input from the two separate feature spaces via a novel reconstruction loss. Then we use an adversarial similarity loss to limit the genre-agnostic features into one fea-

Genre Separation Network (GSN)
As shown in Figure 1, our goal is to distinguish the genre-agnostic features (red rectangles) and genrespecific features (blue triangles and green crosses). Using source genre as an example, we apply a source private CNN encoder on the source sentence to generate the source-specific feature representation f p s , and a shared CNN encoder to generate genre-agnostic feature f c s . Similarly, we get f p t and f c t from the target private CNN encoder and the shared CNN encoder respectively. To separate f p s from f c s and separate f p t from f c t , we introduce a difference loss following previous studies (Bousmalis et al., 2016;Liu et al., 2017). More details will be elaborated below.
Formally, given a source sentence (s, e 1 , e 2 , r) where s = [w 1 , ..., w m ], for each word w ik , we generate a multi-type embedding: where v i denotes a pretrained word embedding. p i andp i are position embeddings (Al-Badrashiny et al., 2017) indicating the distance from w i to e 1 and e 2 respectively. t i andt i are entity type embedding (Ren et al., 2016; of e 1 and e 2 . c i is the chunking embedding, and η i is a binary digit indicating whether the word is within the shortest dependency path between e 1 and e 2 (Bunescu and Mooney, 2005;Liu et al., 2015;. All these embeddings except pre-trained word embedding are randomly initialized and optimized during training. Thus the input layer is a sequence of word representations V = {ṽ 1 ,ṽ 2 , ...,ṽ n }. We then apply the convolution weights W to each sliding n-gram phrase g j with a biased vector b, i.e., g j = tanh(W · V ) + b. All n-gram representations g j are further used to get an overall vector representation f by maxpooling.
Once we obtain f p s , f c s , f p s and f c s , we compute the difference loss: where ||.|| F 2 represents the squared Frobenius norm.
To limit the genre-agnostic features from various genres into a shared feature space, we further design a genre adversarial training component. We take the genre-agnostic features from both source genre and target genre as input to a Gradient Reversed Layer(GRL) (Ganin et al., 2016), which acts as a general hidden layer in forward process and reverses the gradient in loss backward phase to confuse the genre classifier, so that it cannot distinguish the input features from the source genre to the target genre: where d i ∈ {0, 1} indicates the samples from the source genre or the target genre, and N s , N t refer to the number of examples in the source genre and the target genre respectively. The term dˆi represents the probability of the sample from the source genre, which is acquired by a linear function of the genre classifier.

Genre Reconstruction
Till now, we can separate the features of each genre into two separated feature spaces by optimizing L dif f and L adv . However, there is no guarantee that the separated feature spaces are actually meaningful. From equation L dif f , we can see that the f p s , f p t would be easily optimized to zero if we did not place a constraint, in which case the model would fail to train. Therefore, we further reconstruct the input sentence from both genre-specific features and genre-agnostic features.
For each genre, e.g., the source genre, we first sum the genre-specific feature vector f p s and genre-agnostic feature vector f c s , i.e., f s = f p s + f c s . We take f s as input to an unpooling layer (Zeiler and Fergus, 2014) followed by a deconvolutional neural netowrk (DcNN) (Xu et al., 2014). The output of DcNN will include the same number of decoded vectors V * = {ṽ * 1 ,ṽ * 2 , ...,ṽ * n } as input V = {ṽ 1 ,ṽ 2 , ...,ṽ n }. We optimize the DcNN with the following reconstruction loss: where n indicates the total number of words in the input sentence,ṽ i represents the input word representation described in Section 2.2, andṽ * i is the corresponding reconstructed vector from DcNN.

Cross Genre Relation Extraction
We next utilize the genre-agnostic features from the source genre f c s to train a relation classifier. We first feed f c s into a fully connected layer and obtain a dense vector, then we use a linear projection function with a softmax as the relation classifier to determine the relation type where K is the total number of relation types. x k represents the probability of entities being classified to category k.
We finally linearly combine all the loss functions and jointly optimize the model using SGD (Bottou, 2010).

Data and Parameters
We evaluate our approach on the English portion of ACE2005 dataset (Walker et al., 2006;Ji et al., 2010;Hong et al., 2015;Yu et al., 2016). It covers 6 genres: Newswire (nw), Broadcast Conversation (bc), Broadcast News (bn), Telephone Speech (cts), Usenet Newsgroups (un), and Weblogs (wl) and 11 relation types. Following previous work Nguyen and Grishman, 2015;, we use newswire and broadcast news (nw&bn) as training data, half of bc as development set, and test the model on the remaining half of bc, cts, wl. We conduct the same preprocessing steps as previous work and yield 43,497 entity pairs in total for training.

Baseline Models
We compare our approach with the following methods: FCM (Gormley et al., 2015) is a feature combination model which composes word embeddings with traditional linguistic features.
Hybrid FCM (Gormley et al., 2015) incorporates many more selected linguistic features compared to FCM.
LRFCM  is a feature compositional model which scales to more features and more labels.
Log-linear & DNN (Nguyen and Grishman, 2015) explores CNN, Bi-GRU, Forward GRU, Backward GRU, and log-linear model for relation extraction. We compare against the performances of individual models instead of assembled models.
CNN+DANN (Fu et al., 2017) utilizes domain adversarial training to automatically extract genre-agnostic features for source and target genre within one feature space. Table 2 shows the cross-genre relation extraction performance among various methods. Our approach significantly outperforms all previous baselines by 1.2%-1.7% (F1). Table 3 presents the results without using extra linguistic features (only embedding based features), our approach achieves 2.9%-6.1% absolute gain over baselines. The ablation test by removing each component at a time justifies the contribution of each method. The difference and reconstruction components ensure the features to be separated into shared and private spaces, and they can remove redundant genrespecific features to some extent. That's why we got a significant F-score improvement when only utilizing these two components. The adversarial training component can further encourage the features of each genre from the shared encoder to be close to each other, thus the performance is further improved. We also conduct ablation experiments on each feature components. Among the linguistic features we used, the entity type and position features contribute the most to the performance. For example, the relation extraction performance decreases by about 8% if removing the entity type feature. We analyze the reasons and find that the entity type feature is vital to ensure the types of two entity mentions to be consistent with the hard entity type constraint of each relation type defined in ACE schema.

Comparison and Analysis
For the remaining errors, we notice that our model easily fails to predict relations between nested entity mentions. For example, in "Our president has put homeland security in the hands of failed Republican hacks.", our model mistakenly predicts the relationship between Republican and failed Republican hacks as None instead of organization-affiliation, due to the lack of context   information. Besides, we also observe some failed cases where the two entities are separated with a extreme wide context, which suggests us to incorporate dependency path based deep neural networks into the framework.

Related Work
Previous studies on cross-genre relation extraction either manually or automatically extract genreagnostic features (Plank and Moschitti, 2013;Nguyen and Grishman, 2014;Yu et al., 2015;Nguyen and Grishman, 2015), suffering from human labor and limited coverage of effective features, or automatically project source and target genres into one unified feature space and learn genre shared features (Fu et al., 2017), which inevitably introduces noise from genre specific features. Compared with these methods, our approach separates genre-specific features from genre-agnostic features first, and then automatically extracts meaningful features for cross-genre relation extraction.
Our work is also related to studies on domain separation networks (Bousmalis et al., 2016;Liu et al., 2017;Chen et al., 2017), which explicitly extracts features from two separate subspaces: domain-specific and domain-agnostic. We adopt a similar framework for cross-genre relation extraction and introduce a novel reconstruction component which is proved to be suitable to relation extraction.

Conclusions
We propose a genre separation framework for cross-genre relation extraction. Without requiring human crafted features, this framework can effectively separate genre-specific features from genreagnostic ones, and automatically extract meaningful features for the task. To ensure the separation of features within each genre and enforce the genre agnostic features from source genre and target genre to be in the same feature space, we design a difference loss and an adversarial training component. Experiments on various genres demonstrate the effectiveness of our framework. In the future, we will extend our framework to cross-lingual and cross-domain information extraction tasks.

Acknowledgements
We thank the anonymous reviewers for their valuable comments and suggestions. This work is sup- W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.