Exploiting the Syntax-Model Consistency for Neural Relation Extraction

This paper studies the task of Relation Extraction (RE) that aims to identify the semantic relations between two entity mentions in text. In the deep learning models for RE, it has been beneficial to incorporate the syntactic structures from the dependency trees of the input sentences. In such models, the dependency trees are often used to directly structure the network architectures or to obtain the dependency relations between the word pairs to inject the syntactic information into the models via multi-task learning. The major problem with these approaches is the lack of generalization beyond the syntactic structures in the training data or the failure to capture the syntactic importance of the words for RE. In order to overcome these issues, we propose a novel deep learning model for RE that uses the dependency trees to extract the syntax-based importance scores for the words, serving as a tree representation to introduce syntactic information into the models with greater generalization. In particular, we leverage Ordered-Neuron Long-Short Term Memory Networks (ON-LSTM) to infer the model-based importance scores for RE for every word in the sentences that are then regulated to be consistent with the syntax-based scores to enable syntactic information injection. We perform extensive experiments to demonstrate the effectiveness of the proposed method, leading to the state-of-the-art performance on three RE benchmark datasets.


Introduction
One of the fundamental tasks in Information Extraction (IE) is Relation Extraction (RE) where the goal is to find the semantic relationships between two entity mentions in text. Due to its importance, RE has been studied extensively in the literature. The recent studies on RE has focused on deep learning to develop methods to automatically induce sen-tence representations from data (Zeng et al., 2014;Nguyen and Grishman, 2015a;Verga et al., 2018). A notable insight in these recent studies is that the syntactic trees of the input sentences (i.e., the dependency trees) can provide effective information for the deep learning models, leading to the stateof-the-art performance for RE recently (Xu et al., 2015;Tran et al., 2019). In particular, the previous deep learning models for RE has mostly exploited the syntactic trees to structure the network architectures according to the word connections presented in the trees (e.g., performing Graph Convolutional Neural Networks (GCN) over the dependency trees ). Unfortunately, these models might not be able to generalize well as the tree structures of the training data might significantly differ from those in the test data (i.e., the models are overfit to the syntactic structures in the training data). For instance, in the cross-domain setting for RE, the domains for the training data and test data are dissimilar, often leading to a mismatch between the syntactic structures of the training data and test data. In order to overcome this issue, the overall strategy is to obtain a more general representation of the syntactic trees that can be used to inject the syntactic information into the deep learning models to achieve better generalization for RE.
A general tree representation for RE is presented in (Veyseh et al., 2019) where the dependency trees are broken down into their sets of dependency relations (i.e., the edges) between the words in the sentences (called the edge-based representation). These dependency relations are then used in a multitask learning framework for RE that simultaneously predicts both the relation between the two entity mentions and the dependency connections between the pairs of words in the input sentences. Although the dependency connections might be less specific to the training data than the whole tree structures, the major limitation of the edge-based representation is that it only captures the pairwise (local) connections between the words and completely ignores the overall (global) importance of the words in the sentences for the RE problem. In particular, some words in a given sentence might involve more useful information for relation prediction in RE than the other words, and the dependency tree for this sentence can help to better identify those important words and assign higher importance scores for them (e.g., choosing the words along the shortest dependency paths between the two entity mentions). We expect that introducing such importance information for the words in the deep learning models might lead to improved performance for RE. Consequently, in this work, we propose to obtain an importance score for each word in the sentences from the dependency trees (called the syntax-based importance scores). These will serve as the general tree representation to incorporate the syntactic information into the deep learning models for RE.
How can we employ the syntax-based importance scores in the deep learning models for RE? In this work, we first use the representation vectors for the words from the deep learning models to compute another importance score for each word (called the model-based importance scores). These model-based importance scores are expected to quantify the semantic information that a word contributes to successfully predict the relationship between the input entity mentions. Afterward, we propose to inject the syntax-based importance scores into the deep learning models for RE by enforcing that the model-based importance scores are consistent with the syntactic counterparts (i.e., via the KL divergence). The motivation of the consistency enforcement is to promote the importance scores as the bridge through which the syntactic information can be transmitted to enrich the representation vectors in the deep learning models for RE.
In order to implement this idea, we employ the Ordered-Neuron Long Short-Term Memory Networks (ON-LSTM) (Shen et al., 2019) to compute the model-based importance scores for the words in the sentences for RE. ON-LSTM extends the popular Long Short-Term Memory Networks (LSTM) by introducing two additional gates (i.e., the master forget and input gates) in the hidden vector computation. These new gates controls how long each neuron in the hidden vectors should be activated across different time steps (words) in the sentence (i.e., higher-order neurons would be maintained for a longer time). Based on such controlled neurons, the model-based importance score for a word can be determined by the number of active neurons that the word possesses in the operation of ON-LSTM. To our knowledge, this is the first time ON-LSTM is applied for RE in the literature.
One of the issues in the original ON-LSTM is that the master gates and the model-based importance score for each word are only conditioned on the word itself and the left context encoded in the previous hidden state. However, in order to infer the importance for a word in the overall sentence effectively, it is crucial to have a view over the entire sentence (i.e., including the context words on the right). To this end, instead of relying only on the current word, we propose to obtain an overall representation of the sentence that is used as the input to compute the master gates and the importance score for each word in the sentence. This would enrich the model-based importance scores with the context from the entire input sentences, potentially leading to the improved RE performance of the model in this work.
Finally, to further improve the representations learned by the deep learning models for RE, we introduce a new inductive bias to promote the similarity between the representation vectors for the overall sentences and the words along the shortest dependency paths between the two entity mentions. The intuition is that the relation between the two entity mentions of interest in a sentence for RE can be inferred from either the entire sentence or the shortest dependency path between the two entity mentions (due to the demonstrated ability of the shortest dependency path to capture the important context words for RE in the prior work (Bunescu and Mooney, 2005)). We thus expect that the representation vectors for the sentence and the dependency path should be similar (as both capture the semantic relation) and explicitly exploiting such similarity can help the models to induce more effective representations for RE. Our extensive experiments on three benchmark datasets (i.e., ACE 2005, SPOUSE and SciERC) demonstrate the effectiveness of the proposed model for RE, leading to the state-of-the-art performance for these datasets.

Related Work
RE has been traditionally solved by the featurebased or kernel-based approaches (Zelenko et al., 2003;Zhou et al., 2005;Bunescu and Mooney, 2005;Sun et al., 2011;Chan and Roth, 2010;Nguyen and Grishman, 2014;Nguyen et al., 2015c). One of the issues in these approaches is the requirement for extensive feature or kernel engineering effort that hinder the generalization and applicability of the RE models. Recently, deep learning has been applied to address these problems for the traditional RE approaches, producing the state-ofthe-art performance for RE. The typical network architectures for RE include the Convolutional Neural Networks (Zeng et al., 2014;Nguyen and Grishman, 2015a;dos Santos et al., 2015;Wang et al., 2016), Recurrent Neural Networks (Nguyen and Grishman, 2016;Zhou et al., 2016;Zhang et al., 2017;Nguyen et al., 2019a), and self-attentions in Transformer (Verga et al., 2018). The syntactic information from the dependency trees has also been shown to be useful for the deep learning models for RE (Tai et al., 2015;Xu et al., 2015;Liu et al., 2015;Miwa and Bansal, 2016;Peng et al., 2017;Tran et al., 2019;Song et al., 2019;Veyseh et al., 2019). However, these methods tend to poorly generalize to new syntactic structures due to the direct reliance on the syntactic trees (e.g., in different domains) or fail to exploit the syntax-based importance of the words for RE due to the sole focus on edges of the dependency trees (Veyseh et al., 2019).

Model
The RE problem can be formulated as a multi-class classification problem. Formally, given an input sentence W = w 1 , w 2 , . . . , w N where w t is the t-th word in the sentence W of length N , and two entity mentions of interest at indexes s and o (1 ≤ s < o ≤ N ), our goal is to predict the semantic relation between w s and w o in W .
Similar to the previous work on deep learning for RE (Shi et al., 2018;Veyseh et al., 2019), we first transform each word w t into a representation vector x t using the concatenation of the three following vectors: (i) the pre-trained word embeddings of w t , (ii) the position embedding vectors (to encode the relative distances of w t to the two entity mentions of interest w s and w o (i.e., t − s and t − o)), and (iii) the entity type embeddings (i.e., the embeddings of the BIO labels for the words to capture the entity mentions present in X). This word-to-vector transformation converts the input sentence W into a sequence of representation vec-tors X = x 1 , x 2 , . . . , x N to be consumed by the next neural computations of the proposed model.
There are three major components in the RE model in this work, namely (1) the CEON-LSTM component (i.e., context-enriched ON-LSTM) to compute the model-based importance scores of the words w t , (2) the syntax-model consistency component to enforce the similarity between the syntaxbased and model-based importance scores, and (3) the similarity component between the representation vectors of the overall sentence and the shortest dependency path.

CEON-LSTM
The goal of this component is to obtain a score for each word w t that indicates the contextual importance of w t with respect to the relation prediction between w s and w o in W . In this section, we first describe the ON-LSTM model to achieve these importance scores (i.e., the model-based scores). A new model (called CEON-LSTM) that integrates the representation of the entire sentence into the cells of ON-LSTM will be presented afterward.
ON-LSTM: Long-short Term Memory Networks (LSTM) (Hochreiter and Schmidhuber, 1997) has been widely used in Natural Language Processing (NLP) due to its natural mechanism to obtain the abstract representations for a sequence of input vectors (Nguyen andNguyen, 2018b, 2019). Given the input representation vector sequence X = x 1 , x 2 , . . . , x N , LSTM produces a sequence of hidden vectors H = h 1 , h 2 , . . . , h N using the following recurrent functions at the time step (word) w t (assuming the zero vector for h 0 ): where f t , i t and o t are called the forget, input and output gates (respectively).
In order to compute the importance score for each word w t , ON-LSTM introduce into the mechanism of LSTM two additional gates, i.e., the master forget gatef t and the master input gateî t (Shen et al., 2019). These gates are computed and inte-grated into the LSTM cell as follow: where cummax is an activation function defined as cummax(x) = cumsum(sof tmax(x)) 1 .
The forget and input gates in LSTM (i.e., f t and i t ) are different from the master forget and input gates in ON-LSTM (i.e.,f t andî t ) as the gates in LSTM assume that the neurons/dimensions in their hidden vectors are equally important and that these neurons are active at every step (word) in the sentence. This is in contrast to the master gates in ON-LSTM that impose a hierarchy over the neurons in the hidden vectors and limit the activity of the neurons to only a portion of the words in the sentence (i.e., higher-ranking neurons would be active for more words in the sentence). Such hierarchy and activity limitation are achieved via the function cumax(x) that aggregates the softmax output of the input vector x along the dimensions. The output of cumax(x) can be seen as the expectation of some binary vector of the form (0, . . . , 0, 1, . . . , 1) (i.e., involving two consecutive segments: the 0's segment and the 1's segment). At one step, the 1's segments in the gate vectors represents the neurons that are activated at that step. In ON-LSTM, a word w i is more contextually important than another word w j if the master gates for w i have more active neurons than those for w j . Consequently, in order to compute the importance score for the word w t , we can rely on the number of active neurons in the master gates that can be estimated by the sum of the weights of the neurons in the master gates in ON-LSTM. Following (Shen et al., 2019), we employ the hidden vectors for the master forget gate in ON-LSTM to compute the importance scores for the words in this work. Specifically, let f t =f t1 ,f t2 , . . . ,f tD be the weights for the neurons/dimensions inĥ t (i.e., D is the dimension of the gate vectors). The model-based importance score mod t for the word w t ∈ W is then obtained by: mod t = 1 − i=1..Df ti . For convenience, we also use H = h 1 , h 2 , . . . , h N to denote the hidden vectors returned from the application of ON-LSTM over the input representation vectors X.

Introducing Sentence Context into ON-LSTM
One limitation of the ON-LSTM model is that it only relies on the representation vector of the current word x t and the hidden vector for the left context (encoded in h t−1 ) to compute the master gate vectors and the model-based important score for the word w t as well. However, this score computation mechanism might not be sufficient for RE as the importance score for w t might also depend on the context information on the right (e.g., the appearance of some word on the right might make w t less important for the relation prediction between w s and w o ). Consequently, in this work, we propose to first obtain a representation vector x t = g(x 1 , x 2 , . . . , x N ) that has the context information about the entire sentence W (i.e., both the left and right context for the current word w t ). Afterward, x t will replace the input representation vector x t in the computation for the master gates and importance score at step t of ON-LSTM (i.e., in the formulas forf t andî t in Equation 2). In this way, the model-based importance score for w t will be able to condition on the overall context in the input sentence.
In this work, we obtain the representation vector x t for each step t of ON-LSTM based on the weighted sum of the transformed vectors of the input representation sequence x 1 , x 2 , . . . , x N : The weight α ti for the term with x i in this formula is computed by: Note that in this formula, we use the ON-LSTM hidden vector h t−1 from the previous step as the query vector to compute the attention weight for each word. The rationale is to enrich the attention weights for the current step with the context information from the previous steps (i.e., encoded in h t−1 ), leading to the contextualized input representation x t with richer information for the master gates and importance score computations in ON-LSTM. The proposed ON-LSTM with the enriched input vectors x t is called CEON-LSTM (i.e., Context-Enriched ON-LSTM) in this work.

Syntax-Model Consistency
As mentioned in the introduction, the role of the model-based importance scores obtained from CEON-LSTM is to serve as the bridge to inject the information from the syntactic structures of W into the representation vectors of the deep learning models for RE. In particular, we first leverage the dependency tree of W to obtain another importance score syn t for each word w t ∈ W (i.e., the syntax-based importance score). Similar to the model-based scores, the syntax-based scores are expected to measure the contextual importance of w t with respect to the relation prediction for w s and w o . Afterward, we introduce a constraint to encourage the consistency between the model-based and syntax-based importance scores (i.e., mod t and syn t ) for the words via minimizing the KL divergence L import between the normalized scores: The intuition is to exploit the consistency to supervise the model-based importance scores from the models with the syntax-based importance scores from the dependency trees. As the model-based importance scores are computed from the master gates with the active and inactive neurons in CEON-LSTM, this supervision allows the syntactic information to interfere directly with the internal computation/structure of the cells in CEON-LSTM, potentially generating representation vectors with better syntax-aware information for RE.
To obtain the syntax-based importance scores, we take the motivation from the previous work on RE where the shortest dependency paths between the two entity mentions of interest have been shown to capture many important context words for RE. Specifically, for the sentence W , we first retrieve the shortest dependency path DP between the two entity mentions w s and w o and the length T of the longest path between any pairs of words in the dependency tree of W . The syntax-based importance score syn t for the word w t ∈ W is then computed as the difference between T and the length of the shortest path between w t and some word in DP in the dependency tree (i.e., the words along DP will have the score of T ). On the one hand, these syntax-based importance scores are able to capture the importance of the words that is customized for the relation prediction between w s and w o . This is better suited for RE than the direct use of the edges in the dependency trees in (Veyseh et al., 2019) that is agnostic to the entity mentions of interest and fails to encode the importance of the words for RE. On the other hand, the syntax-based importance scores syn t represent a relaxed form of the original dependency tree that might have a better chance to generalize over different data and domains for RE than the prior work (i.e., the ones that directly fit the models to the whole syntactic structures  and run the risk of overfitting to the structures in the training data).

Sentence-Dependency Path Similarity
In this component, we seek to further improve the representation vectors in the proposed deep learning model for RE by introducing a novel constraint to maximize the similarity between the representation vectors for the overall input sentence W and the words along the shortest dependency path DP (i.e., inductive bias). The rationale for this bias is presented in the introduction.
In order to implement this idea, we first obtain the representation vectors R W and R DP for the sentence W and the words along DP (respectively) by applying the max-pooling operation over the CEON-LSTM hidden vectors h 1 , h 2 , . . . , h N for the words in W and DP : R W = max pooling w i ∈W {h i } and R DP = max pooling w i ∈DP {h i }. In the next step, we promote the similarity between R W and R DP by explicitly minimizing their negative cosine similarity 2 , i.e., adding the following term L path into the overall loss function:

Prediction
Finally, in the prediction step, following the prior work (Veyseh et al., 2019), we employ the following vector V as the overall representation vector to predict the relation between w s and w o in Note that V involves the information at different abstract levels for W , i.e., the raw input level with x s and x o , the abstract representation level with h s and h o 2 We tried the KL divergence and the mean square error for this, but cosine similarity achieved better performance. from CEON-LSTM, and the overall sentence vector R W . In our model, V would be fed into a feed-forward neural network with the softmax layer in the end to estimate the probability distribution P (.|W, w s , w o ) over the possible relations for W . The negative log-likelihood function is then obtained to serve as the loss function for the model: L label = − log P (y|W, w s , w o ) (y is the golden relation label for w s and w o in W ). Eventually, the overall loss function of the model in this work is: where α and β are trade-off parameters. The model is trained with shuffled mini-batching.

Datasets and Hyper-parameters
We evaluate the models in this work using three benchmark datasets, i.  (Yu et al., 2015) for compatible comparison. There are 6 different domains in this dataset, i.e., (bc, bn, cts, nw, un, and wl), covering text from news, conversations and web blogs. Following the prior work, the union of the domains bn and nw (called news) is used as the training data (called the source domain); a half of the documents in bc is reserved for the development data, and the remainder (cts, wl and the other half of bc) serve as the test data (called the target domains). This data separation facilitates the evaluation of the cross-domain generalization of the models due to the domain difference of the training and test data. The SPOUSE dataset is recently introduced by (Hancock et al., 2018), involving 22,195 sentences for the training data, 2,796 sentences for the validation data, and 2,697 sentences for the test data. Each sentence in this dataset contains two marked person names (i.e., the entity mentions) and the goal is to identify whether the two people mentioned in the sentence are spouses.
Finally, the SciERC dataset (Luan et al., 2018) annotates 500 scientific abstracts for the entity mentions along with the coreferences and relations between them. For RE, this dataset provides 3,219 sentences in the training data, 455 sentences in the validation data and 974 sentences in the test data.
We fine tune the hyper-parameters for the models in this work on the validation data of the ACE 2005 dataset. The best parameters suggested by this process include: 30 dimensions for the position embeddings and entity type embeddings, 200 hidden units for the CEON-LSTM model and all the other hidden vectors in the model (i.e., the hidden vectors in the final feed-forward neural network (with 2 layers) and the intermediate vectors in the weighted sum vector for x t ), 1.0 for both loss tradeoff parameters α and β, and 0.001 for the initial learning rate with the Adam optimizer. The batch size is set to 50. Finally, we use either the uncontextualized word embeddings word2vec (with 300 dimensions) or the hidden vectors in the last layer of the BERT base model (with 768 dimensions) (Devlin et al., 2019) to obtain the pre-trained word embeddings for the sentences (Devlin et al., 2019). We find it better to fix BERT in the experiments. Note that besides this section, we provide some additional analysis for the models in the Appendix.

Comparison with the state of the art
We fist compare the proposed model (called CEON-LSTM) with the baselines on the popular ACE 2005 dataset. In particular, the four following groups of RE models in the prior work on RE with the ACE 2005 dataset is chosen for comparison: (i) Feature based models: These models handdesign linguistic features for RE, i.e., FCM, Hybrid FCM, LRFCM, and SVM (Yu et al., 2015;Hendrickx et al., 2010).
(ii) Deep sequence-based models: These models employ deep learning architectures based on the sequential order of the words in the sentences for RE, i.e., log-linear, CNN, Bi-GRU, Forward GRU, Backward GRU (Nguyen and Grishman, 2016), and CNN+DANN (Fu et al., 2017).
(iii) Adversarial learning model: This model, called GSN, attempts to learn the domainindependent features for RE (Shi et al., 2018).
(iv) Deep structure-based models: These models use dependency trees either as the input features or the graphs to structure the network architectures in the deep learning models. The state-ofthe-art models of this type include: AGGCN (Attention Guided GCN)  2005. Note that we obtain the performance of these models on the considered datasets using the actual implementation released by the original papers.
Most of the prior RE work on the ACE 2005 dataset uses the uncontextualized word embeddings (i.e., word2vec) for the initial word representation vectors. In order to achieve a fair comparison with the baselines, we first show the performance of the models (i.e., the F1 scores) on the ACE 2005 test datasets when word2vec is employed for the pre-trained word embeddings in Table 1. The first observation from the table is that the deep structured-based models (e.g., C-GCN, DRPC) are generally better than the deep sequence-based models (e.g., CNN, Bi-GRU) and the feature base models with large performance gaps. This demonstrates the benefits of the syntactic structures that can provide useful information to improve the performance for the deep learning models for RE. We will thus focus on these deep structure-based models in the following experiments. Among all the models, we see that the proposed model CEON-LSTM is significantly better than all the baseline models over different test domains/datasets. In particular, CEON-LSTM is 1.38% and 3.1% better than DRPC and SACNN (respectively) on the average F1 scores over different test datasets. These performance improvements are significant with p < 0.01 and clearly demonstrate the effectiveness of the proposed CEON-LSTM model for RE.
In order to further compare CEON-LSTM with the baselines, Table 2 presents the performance of the models when the words are represented by the contextualized word embeddings (i.e., BERT). For this case, we also report the performance of the recent BERT-based model (i.e., Entity-Aware BERT (EA-BERT)) in     on the ACE 2005 dataset. Comparing the models in Table 2 with the counterparts in 1, it is clear that the contextualized word embeddings can significantly improve the deep structure-based models for RE. More importantly, similar to the case with word2vec, we see that the proposed model CEON-LSTM still significantly outperforms all the baselines models with large performance gaps and p < 0.01, further testifying to the benefits of the CEON-LSTM model in this work. Finally, in order to demonstrate the generalization of the proposed model over the other datasets, we show the performance of the models on the two other datasets in this work (i.e., SPOUSE and SciERC) using either word2vec or BERT as the word embeddings in Table 3. The results clearly confirm the effectiveness of CEON-LSTM as it is significantly better than all the other models over different datasets and word embedding settings.

Ablation Study
The Effect of the Model Components: There are three major components in the proposed model: (1) the introduction of the overall sentence representation x t into the ON-LSTM cells (called SCG -Sentence Context for Gates), (2) the consistency constraint for the syntax-based and model-based importance scores (called SMC -Syntax-Semantic Consistency), and (3) the similarity constraint for the representation vectors of the overall sentence and the shortest dependency path (called SDPS -Sentence-Dependency Path Similarity). In order to evaluate the contribution of these components for the overall model CEON-LSTM, we incrementally remove these components from CEON-LSTM and evaluate the performance of the remaining model. Table 4 reports the performance of the models on the ACE 2005 development dataset.
It is clear from the table that all the components are necessary for the proposed model as excluding any of them would hurt the performance significantly. It is also evident that removing more components results in more performance drop, thus demonstrating the complementary nature of the three proposed components in this work.
The Variants for CEON-LSTM: We study several variants of SCG, SMC, and SDPS in CEON-LSTM to demonstrate the effectiveness of the designed mechanisms. In particular, we consider the following alternatives for CEON-LSTM: (i) Bi-ON-LSTM: Instead of employing the attention-based representation vectors x t to capture the context of the entire input sentence for the model-based importance scores in SCG, we run two unidirectional ON-LSTM models (i.e., the forward and backward ON-LSTM) to compute the forward and backward importance scores for each word in W . The final model-based importance score for each word is then the average of the corresponding forward and backward scores.
(ii) SA-ON-LSTM: In this method, instead of using the hidden vector h t−1 as the query vector to compute the attention weight α ti in Equation 3 for SCG, we utilize the input representation vector x t for w t as the query vector (i.e., replace h t−1 with x t in Equation 3). Consequently, SA-ON-LSTM is basically a composed model where we first run the self-attention (SA) model (Vaswani et al., 2017) over X. The results are then fed into ON-LSTM to obtain the model-based importance scores mod t .
(iii) CE-LSTM: This aims to explore the effec-  tiveness of ON-LSTM for our model. In CE-LSTM, we replace the ON-LSTM network with the usual LSTM model in CEON-LSTM. The SMC component is not included in this case as the LSTM model cannot infer the importance scores.
(iv) EP-ON-LSTM: Before this work, the DRPC model in (Veyseh et al., 2019) has the stateof-the-art on ACE 2005. Both DRPC and CEON-LSTM apply a more general representation of the dependency trees in a deep learning model (i.e., avoid directly using the original trees to improve the generalization). To illustrate the benefit of the importance score representation for SMC, EP-ON-LSTM replaces the importance score representation for the dependency trees in CEON-LSTM with the dependency edge representation in DRPC. In particular, we replace the term L import in the overall loss function (i.e., Equation 6) with the dependency edge prediction loss (using the ON-LSTM hidden vectors) in DRPC for EP-ON-LSTM.
(v) SP-CEON-LSTM: This model removes the SDPS component and includes the representation vector of the dependency path DP (i.e., R DP ) in the final representation V for relation prediction. We consider both retaining and excluding the sentence representation R W in V in this case. This model seeks to show that the use of R DP for the similarity encouragement with R W is more effective than employing R DP directly in V . Table 5 reports the performance of these CEON-LSTM variations on the ACE 2005 development dataset. As we can see from the table, all the considered variants have significantly worse performance than CEON-LSTM (with p < 0.005). This clearly helps to justify the designs of the components SCG, SMC and SDPS for CEON-LSTM in this work.
Baseline for the Model-Based Importance Scores: One of the contributions in our work is to employ the gates in the cells of ON-LSTM to obtain the model-based importance scores that are then used to promote the consistency with the syntaxbased importance scores (i.e., in the SMC compo-  nent). In order to demonstrate the effectiveness of the master cell gates to obtain the model-based importance scores, we evaluate a typical baseline where the model-based importance score mod i for w i ∈ W is computed directly from the hidden vector h i of CEON-LSTM (i.e., by feeding h i into a feed-forward neural network with sigmoid activation function in the end). The model-based importance scores obtained in this way then replace the importance scores from the cell gates and are used in the SMC component of CEON-LSTM in the usual way (i.e., via the KL divergence in L import ) (note that we tried the alternatives for the KL divergence in L import (i.e., the mean square error and the cosine similarity between the syntaxbased and model-based importance scores), but the KL divergence produced the best results for both CEON-LSTM and HIS-CEON-LSTM on the development data). The resulting model is called HIS-CEON-LSTM.

Conclusion
We introduce a new deep learning model for RE (i.e., CEON-LSTM) that features three major proposals. First, we represent the dependency trees via the syntax-based importance scores for the words in the input sentences for RE. Second, we propose to incorporate the overall sentence representation vectors into the cells of ON-LSTM, allowing it to compute the model-based importance scores more effectively. We also devise a novel mechanism to project the syntactic information into the computation of ON-LSTM via promoting the consistency between the syntax-based and model-based importance scores. Finally, we present a novel inductive bias for the deep learning models that exploits the similarity of the representation vectors for the whole input sentences and the shortest dependency paths between the two entity mentions for RE. Extensive experiments are conducted to demonstrate the benefits of the proposed model. We achieve the state-of-the-art performance on three datasets for RE. In the future, we plan to apply CEON-LSTM to other related NLP tasks (e.g., Event Extraction, Semantic Role Labeling) (Nguyen et al., 2016a;Nguyen and Grishman, 2018a).

A Analysis
In order to provide more insights into the performance of the proposed model,  Table 8), we find that these examples often involve the two entity mentions of interest with long distance from each other in the input sentences. For these examples, the dependency paths between the two entity mentions tend to be very helpful or crucial for RE as they can capture the important context words (thus eliminating the irrelevant ones). This allows the models to learn effective representations to correctly predict the relations in the sentences for RE. As DRPC only retains the dependency edges in the dependency trees separately (i.e., the local tree representations), it cannot directly capture such dependency paths, thereby failing to predict the relations for the DRPC-failure examples with long distances between the entities. This is in contrast to CEON-LSTM that exploits the global representations of the trees with the importance scores based on the distances of the words to the dependency paths. As the dependency paths can be still inferred in this global representation, CEON-LSTM can benefit from this information to successfully perform RE for the sentences in the DRPC-failure examples.

Sentence
Relation Some Arab countries also want to play a role in the stability operation in Iraq but are reluctant to send troops because of political, religious and ethnic considerations, the official said.

ORG-AFF
Some suggested that Russian President Vladimir Putin will now be scrambling to contain the damage to his once -budding friendship with US President George W. Bush because he was poorly advised by his intelligence and defense aides.

PER-SOC
Other countries including the Philippines, South Korea, Qatar and Australia agreed to send other help such as field hospitals, engineers, explosive ordnance disposal teams or nuclear, biological and chemical weapons experts.

Sentence
Relation US diplomats have hinted in recent weeks that Washington 's anger with European resistance to the campaign was focused more on Paris -and to a lesser extent Berlin-than it was with Moscow.

PART-WHOLE
In Montreal, "Stop the War" a coalition of more than 190 groups, said as many as 200,000 people turned out, though police refused to give a figure.

PHYS
Although the crossing has, in principle, been open for movement between the two territories -while being frequently closed by Israeli for reasons rarely explained-the Palestinian section has been manned by Israel for more than two years.