Knowledge-Enhanced Natural Language Inference Based on Knowledge Graphs

Natural Language Inference (NLI) is a vital task in natural language processing. It aims to identify the logical relationship between two sentences. Most of the existing approaches make such inference based on semantic knowledge obtained through training corpus. The adoption of background knowledge is rarely seen or limited to a few specific types. In this paper, we propose a novel Knowledge Graph-enhanced NLI (KGNLI) model to leverage the usage of background knowledge stored in knowledge graphs in the field of NLI. KGNLI model consists of three components: a semantic-relation representation module, a knowledge-relation representation module, and a label prediction module. Different from previous methods, various kinds of background knowledge can be flexibly combined in the proposed KGNLI model. Experiments on four benchmarks, SNLI, MultiNLI, SciTail, and BNLI, validate the effectiveness of our model.


Introduction
Natural Language Inference (NLI) is a fundamental yet challenging task for natural language understanding. NLI aims to determine whether the logical relationship between a premise p and a hypothesis h is entailment, neutral, or contradiction. NLI requires reasoning and inference abilities which are crucial for artificial intelligence system.
Recent years have witnessed a large improvement in NLI models because of the release of several large-scale corpora, such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018). These datasets enable various deep learning models to achieve state-of-the-art performances (Chen et al., 2017;Chen et al., 2018;Pan et al., 2018;Ghaeini et al., 2018).
Most of the existing NLI models are based on cross sentence attention (Chen et al., 2017). They learn alignments according to the attention between premises and hypotheses. The resulting alignments and semantic representations of sentences are aggregated and fed into a multilayer feed-forward neural network to judge logical relationships. However, only semantic knowledge between premises and hypotheses are utilized in most of these models.
Background knowledge can be utilized to facilitate inference as shown in Fig. 1. For the sentences located in the left part of Fig. 1, the logical relationship between the Premise and the Hypothsis largely depends on the relationship between "piano" and "music". However, the latter is not explicitly expressed in the sentences per se. The right part of Fig. 1 is a part of a large knowledge graph, in which paths between "piano" and "music" represent background knowledge that can be utilized to determine the relationship of the sentence pair. Previous work (Chen et al., 2018) shows that NLI models can benefit from leveraging external knowledge. But only restricted kinds of knowledge are considered in it. How to flexibly incorporate a variety of different background knowledge in NLI is still a challenging task.
In this paper, we propose a Knowledge Graph enhanced Natural Language Inference (KGNLI) model. KGNLI enhances the performance of NLI by introducing background knowledge stored in knowledge graphs. To be more specific, KGNLI first extracts entities such as subjects, predicates, and objects from Figure 1: The role of knowledge graphs. Knowledge graphs can provide background knowledge for the NLI problem. For the sentence pair in the figure, their relationship largely depends on the relationship of entities "piano" and "music", which can be learned from the paths in the knowledge graph. the given sentence pair, then learns a knowledge-relation representation based on a predetermined knowledge graph which contains these entities as nodes. Besides it, KGNLI also learns a semantic-relation representation between the given sentences by Bi-directional Long Short-Term Memory (BiLSTM) network. Finally, KGNLI combines these two representations and feed it into a multilayer perceptron to determine the label of the relationship.
In order to evaluate our model, we conduct experiments on four datasets: SNLI, MultiNLI, SciTail, and BNLI. Our model gets competitive results on all the four datasets. On dataset SciTail and BNLI, where knowledge is crucial for inference (Glockner et al., 2018), our model achieves large improvements against baselines. We further conduct ablation tests to validate the effectiveness and necessity of each component in the proposed model. Traditional NLI models are trained on small-scale datasets, like natural logic-based and co-occurrence statistics-based models, the former identifies inferences by lexical and syntactic features (MacCartney and Manning, 2008), while the latter considers the statistical features (Glickman and Dagan, 2005).
The emergence of large-scale datasets, such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), has stimulated research and development of new models based on deep neural network. Various architectures have been proposed to capture the interaction and soft alignment between sentences. For instances, ESIM (Chen et al., 2017) considers recursive architectures in both local inference modeling and inference composition, CAFE (Tay et al., 2018) architecture propagates compressed alignment features to upper layers to enhance representation learning, DMAN (Pan et al., 2018) adopts reinforcement learning with discourse markers to help improve the performance. Although these models have achieved state-of-the-art results in NLI related tasks, they only rely on semantic relationship learned from training corpus, in other words, no external knowledge is utilized explicitly to facilitate inference.

Knowledge Enhanced NLI
To the best of our knowledge, the only neural model adopting external knowledge is KIM (Chen et al., 2018). This model incooperates basic knowledge about synonymy, antonymy, hypernymy, hyponymy, and co-hyponyms to help model soft-alignments between sentence pairs. However, it can only deal with a fixed number of types of knowledge, and pre-assigned scores for relationships are needed before training, which limits its applications in practice.

Knowledge Graph
A knowledge graph is a large knowledge base storing relational knowledge in a graph structure. Knowledge is formatted as triples in knowledge graph, a triple (h, r, t) indicates that head entity h and tail entity t have relation r. There are many open source knowledge graphs which can be easily employed in a variety of applications, such as WordNet (Miller, 1995), Freebase (Bollacker et al., 2008), and Concept Graph (Cheng et al., 2015;Wu et al., 2012). Knowledge graph has been proved very helpful in various natural language processing tasks, such as machine reading comprehension , language modeling (Logan et al., 2019), and question answering (Xiong et al., 2019). In this paper, knowledge graph is employed to provide external background knowledge.

Methodology
Given a pair of sentences, premise p and hypothesis h, the goal of NLI is to predict a label y that indicates the logical relationship between sentences p and h. The set of labels includes entailment (h can be logically deduced from p), neutral (p and h do not have any logical relationship), and contradiction (p and h cannot be true simultaneously).
The proposed model consists of three major components, as showed in Fig. 2. The novel knowledgerelation representation module builds the relationship between p and h based on background knowledge, while the semantic-relation representation module captures sentence semantic relationship. Finally, a multilayer perceptron merges both knowledge and semantic relationships and predicts the label.

Knowledge-Relation Representation
To build the relationship between p and h based on background knowledge, we propose a novel knowledge-relation representation module. In this paper, we assume that the relationship between sentences is determined by the relationship of their subjects, predicates, and objects. The architecture of this module is shown in Fig. 3.

Sub-graph of background relationship
In the following section, we denote the subject pair, predicate pair, and object pair of p and h as (p S , h S ), (p P , h P ), and (p O , h O ), respectively. For each sentence pair, the sub-graph of background relationship for subject pair (p S , h S ) is extracted by finding paths between entities that denote p s and h s in the predetermined knowledge graph KG with the help of random walking. The sub-graphs for predicate pair (p P , h P ) and object pair (p O , h O ) are constructed in the same way. Fig. 3 shows three paths in the Figure 3: External Knowledge Encoding. For entities "piano" and "music", we first find the paths between them in the knowledge graph, and update their embeddings using graph neural networks. Then we learn the embeddings of the paths. In this way, we encode the knowledge into pooled embeddings.
sub-graph of the object pair (piano, music), "piano-has subevent-playing piano-causes-music", "pianois a-instrument-used for-music", and "piano-related to-orchestra-related to-classical-is a-music". The lengths of the first two paths are 5, while it is 7 of the last one.
In this paper, we extract subjects, predicates, and objects of sentences based on their syntax tree, consider paths with length up to L and limit the total number of paths of each sub-graph as N .

Knowledge Embedding
With the help of sub-graphs, we can learn the knowledge-based relationship of sentence pairs. First, the knowledge-based embeddings for entities in sub-graphs are learned. Denote the knowledge embedded vectors of p and h as a k S , a k P , a k O and b k S , b k P , b k O , where S, P and O are indices of subject, predicate, and object as above, k indicates that the embeddings are learned based on knowledge graphs. We initialize entity embeddings based on pre-trained vectors that are generated by TransE (Bordes et al., 2013), We then update these embeddings based on the sub-graphs using graph neural network. For an entity, we retrieve all the neighboring relations of it in the sub-graph, and encode the neighboring knowledge into its embedding through the following propagation rule, where a k i and b k j are the propagated entity embedding, γ is a tradeoff parameter, and S p i represents the set of all (e, r) pairs of p i , where e is a neighobor of p i , and r is the relation between p i and e. σ(·) is the activation function, W is a transformation matrix, [· ; ·] denotes the concatenation operator, and ϕ p (e, r) is an attention score over (e, r), which is calculated based on the embedding of the entity and its neighbors. For premise, ϕ p (e, r) is computed as where a k e and a k r are the embeddings of entity e and relation r initialized by TransE. For hypothesis, the attention score is calculated in the same way,

Relationship Representation
We get the relationship representation based on the representation of paths in the corresponding subgraph. Denote the i-th path between subject pair (p S , h S ) as where r j and e j denotes the j-th relation and entity on the path. The paths l P i and l O i of predicate and object pair are defined in the same way.
We encode the path sequence with BiLSTM. Relations are represented by the average of the representations of all paths where ω p S , ω p P and ω p O denote the relation representations between subjects, predicates and objects. We set entity embeddings as the updated embeddings and relation embeddings according to TransE.

Knowledge Composition
We use a composition layer to merge the relationship of subject pair, predicate pair, and object pair. Apart from relation representations ω p S , ω p P and ω p O , we also consider the correlation among them by using element-wise product , where v k is the composed representation, and G k is a feed-forward neural network with ReLU as the activation function.

Semantic-Relation Representation
To capture the semantic relationship between premise p and hypothesis h, we follow the widely adopted framework to get the relationship representations (Chen et al., 2017). In the following, we denote sentences p and h as p = [p 1 , p 2 , . . . , p m ] and h = [h 1 , h 2 , . . . , h n ], where p i and h j are words, 1 ≤ i ≤ m, 1 ≤ j ≤ n, m and n are the lengths of premise p and hypothesis h.

Semantic Embedding
We first initialize words into embeddings based on pre-trained word vectors GloVe (

Local Inference
The computing of local inference information between two sentences is based on their semantic embeddings a s and b s . No external knowledge is concerned in this part. First, a soft alignment layer is employed to compute the similarity between words. For premise embedding a s i and hypothesis embedding b s j , their similarity is all those E ij form the co-attention matrix E ∈ R m×n . Next, we compute the local relevance information according to the co-attention E. For premise, the relevant semantics in hypothesis is encoded into a context vector a c based on co-attention matrix E and hypothesis semantic embedding b. The context vector b c that encoded relevant semantics in premise can be calculated in the same way, Local inference information is then enhanced by computing difference and element-wise product for (a s , a c ) and (b s , b c ), where a m and b m are enhanced embeddings for p and h, and G is a non-linear function. We set it as a one-layer feed-forward neural network with ReLU as the activation function.

Semantic Composition
A composition layer is employed to learn the types of local inference relationship at sentence-level. The composed vectors a v i , b v j for premise and hypothesis are computed by BiLSTM, The resulting vectors a v and b v are fed into the pooling layer which computes both average and max pooling for premise and hypothesis, The semantic relationship v s between p and h is, where G s is a feed-forward neural network with ReLU as the activation function.

Label Prediction
The label prediction layer is designed to determine the overall logical relationship between two sentences. The semantic representation v s , knowledge representation v k , and the correlation between them based on the element-wise product, are all combined and transformed based on a multilayer perceptron. The perceptron classifier predict a label y to be entailment, contradiction, or neutral, 4 Experiments
• SNLI Stanford Natural Language Inference (Bowman et al., 2015) is extracted from Flickr30k corpus. It is the largest corpus for NLI tasks, with more than 570k human annotated sentence pairs.
• MultiNLI Multi-Genre Natural Language Inference (Williams et al., 2018) is also a large-scale corpus containing 433k sentence pairs. In MultiNLI, the development/test sets whose genres appear in training set are referred to as "matched" dataset, and "mismatched" otherwise.
• SciTail is a small-scale dataset constructed from multiple-choice science exams and web sentences (Khot et al., 2018). It contains 24k sentence pairs and only classifies sentences into two relationships: entailment and neutral. SciTail is a difficult benchmark for NLI (Tay et al., 2018).
• BNLI is a dataset constructed based on SNLI (Glockner et al., 2018). In BNLI, the premises are taken from the SNLI training set, and hypotheses are generated by replacing a single word within the premise by a different word. Though much simpler and smaller than the SNLI dataset, the performance on BNLI is substantially worse across models trained on SNLI (Glockner et al., 2018).

Implementation Details
We use Concept Graph (Cheng et al., 2015;Wu et al., 2012) as the external knowledge graph, as it has the largest coverage on datasets we used in this paper. We limit the path length and the number of paths both to 10. We intialize word embeddings by GloVe (Pennington et al., 2014) and entity embeddings by TransE (Bordes et al., 2013). Dimensions of embeddings are all set to be 300. Dropout is set between layers to avoid overfitting with rate 0.5. The optimizer is Adam with batch size 32 and learning rate 0.0004. We use early stopping according to the per-epoch accuracy on the validation set.
To extract subjects, predicates, and objects of sentences, we employ their syntax trees (Rusu et al., 2007). Some datasets provide hand-annotated syntax trees, while for others, we use StandfordNLP (Qi et al., 2018) to generate their syntax trees. For subject, we employ breadth first search to select the first descendent of NP that is a noun. Subjects are selected from entities labeled as NN, NNP, NNPS, or NNS. The deepest verb descendent of the VP subtree is considered as predicates. Predicates are chosen from verbs labeled as VB, VBD, VBG, VBN, VBP, or VBZ. Objects are found in three different subtrees, PP, NP, and ADJP, which are siblings of the VP subtree containing the predicate. In both NP and PP we search for the first noun, while in ADJP we just treat the first adjective to be an object. Finally, we stem the subjects, predicates, and objects to match entities in the knowledge graph. Table 1 shows the results on the benchmark SNLI. We do not consider ensemble models in this paper. Our model gets the best result. According to (Glockner et al., 2018), inference on SNLI dataset may not require much knowledge, thus results are not affected significantly by external knowledge. This explains why our model generates similar results with baselines. Performance on dataset MultiNLI is similar to the performance on SNLI due to the same reason with SNLI as explained in (Glockner et al., 2018). Though the proposed model does not precede much than baselines, it also achieves the best results among all the models, according to Table 2. The performance on dataset SciTail is shown in Table 3. Our model achieves the state-of-the-art result with large margin of improvement against ESIM which only considers semantic knowledge. As SciTail consists of more factual sentences than SNLI and MultiNLI datasets (Tay et al., 2018), background knowledge plays a more important role on the inference. The result shows that the proposed model KGNLI can capture external knowledge and use them effectively.  (Chen et al., 2017) 70.6 DGEM (Khot et al., 2018) 77.3 CAFE (Tay et al., 2018) 83.3 KGNLI

84.3
Dataset BNLI is employed to test the knowledge usage. For a sentence pair in BNLI, only one word of hypothesis h is different with premise p. As a result, BNLI is highly biased towards contradiction relation. In practice, BNLI is used as test set, while training on SNLI, MultiNLI, and SciTail. The performance difference between BNLI and SNLI as test set is denoted as ∆. As shown in Table 4, under this setting, the result of our model, denoted as "original setting", is similar to that of ESIM. This is because the subjects, predicates, and objects are almost the same for a sentence pair in BNLI. In order to test the performance of our model, we conduct a new experiment under another setting, named "unique-word setting". For dataset BNLI, the two different words of sentence pair (p, h) are extracted. An example is given in Table 5. We treat these words as key words, remove the knowledge composition layer, and directly set the composed vector as the relation vector of these key words. For other datasets, we choose keywords among subjects, predicates, and objects. As indicated by Table 4, our model achieves the best performance among all the models. This experiment also shows that the proposed model can capture the background knowledge and utilize it to improve model performance.  Table 6 shows the results of ablation study on SciTail dataset. In the experiment, parts of sentences are removed. The results partially validate that parts of sentences are crucial in NLI related task.

Conclusion and Future work
This paper proposed a knowledge enhanced NLI model based on knowledge graphs, which introduces background knowledge into NLI model. For a sentence pair, the proposed model learns a knowledgerelation representation based on paths of knowledge graph and a semantic-relation representation through BiLSTM. These two representations are then merged by a feed-forward neural network to predict the relationship label. Experimental results validated the effectiveness of the proposed model. As to the future work, we aim to find out how to decide the keywords in sentence pairs that determine their relationship.