Bridging Knowledge Gaps in Neural Entailment via Symbolic Models

Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. We focus on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Our new architecture combines standard neural entailment models with a knowledge lookup module. To facilitate this lookup, we propose a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. Our model, NSNet, learns to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSNet outperforms a simpler combination of the two predictions by 3% and the base entailment model by 5%.


Introduction
Textual entailment, a key challenge in natural language understanding, is a sub-problem in many end tasks such as question answering and information extraction.In one of the earliest works on entailment, the PASCAL Recognizing Textual Entailment Challenge, Dagan et al. (2005) define entailment as follows: text (or premise) P entails a hypothesis H if typically a human reading P would infer that H is most likely true.They note that this informal definition is "based on (and assumes) common human understanding of language as well as common background knowledge".
While current entailment systems have achieved impressive performance by focusing on the language understanding aspect, these systems, especially recent neural models (e.g.Parikh et al., 2016;Khot et al., 2018), do not directly address the need for filling knowledge gaps by leveraging common background knowledge.Figure 1 illustrates an example of P and H from SciTail, a recent science entailment dataset (Khot P: The aorta is a large blood vessel that moves blood away from the heart to the rest of the body. H (entailed): Aorta is the major artery carrying recently oxygenated blood away from the heart.
H' (not entailed): Aorta is the major vein carrying recently oxygenated blood away from the heart.
Figure 1: Knowledge gap: Aorta is a major artery (not a vein).Large blood vessel soft-aligns with major artery but also with major vein. et al., 2018), that highlights the challenge of knowledge gaps-sub-facts of H that aren't stated in P but are universally true.In this example, an entailment system that is strong at filling lexical gaps may align large blood vessel with major artery to help conclude that P entails H.Such a system, however, would equally well-but incorrectly-conclude that P entails a hypothetical variant H' of H where artery is replaced with vein.A typical human, on the other hand, could bring to bear a piece of background knowledge, that aorta is a major artery (not a vein), to break the tie.
Motivated by this observation, we propose a new entailment model that combines the strengths of the latest neural entailment models with a structured knowledge base (KB) lookup module to bridge such knowledge gaps.To enable KB lookup, we use a fact-level decomposition of the hypothesis, and verify each resulting sub-fact against both the premise (using a standard entailment model) and against the KB (using a structured scorer).The predictions from these two modules are combined using a multi-layer "aggregator" network.Our system, called NSnet, achieves 77.9% accuracy on SciTail, substantially improving over the baseline neural entailment model, and comparable to the structured entailment model proposed by Khot et al. (2018).

KB Retrieval
Figure 2: Neural-symbolic learning in NSnet.The bottom layer has QA and their supporting text in SciTail, and the knowledge base (KB).The middle layer has three modules: Neural Entailment (blue) and Symbolic Matcher and Symbolic Lookup (red).The top layer takes the outputs (black and yellow) and intermediate representation from the middle modules, and hierarchically trains with the final labels.All modules and aggregator are jointly trained in an end-to-end fashion.

Neural-Symbolic Learning
A general solution for combining neural and symbolic modules remains a challenging open problem.As a step towards this, we present a system in the context of neural entailment that demonstrates a successful integration of the KB lookup model and simple overlap measures, opening up a path to achieve a similar integration in other models and tasks.The overall system architecture of our neural-symbolic model for textual entailment is presented in Figure 2. We describe each layer of this architecture in more detail in the following sub-sections.

Inputs
We decompose the hypothesis and identify relevant KB facts in the bottom "inputs" layer (Fig. 2).
Hypothesis Decomposition: To identify knowledge gaps, we must first identify the facts stated in the hypothesis h = (h 1 , h 2 ..).We use ClausIE (Del et al., 2013) to break h into sub-facts.ClausIE tuples need not be verb-mediated and generate multiple tuples derived from conjunctions, leading to higher recall than alternatives such as Open IE (Banko et al., 2007). 1nowledge Base (KB): To verify these facts, we use the largest available clean knowledge base for the science domain (Dalvi et al., 2017), with 294K simple facts, as input to our system.The knowledge base contains subject-verb-object (SVO) tuples with short, one or two word arguments (e.g., hydrogen; is; element).Using these simple facts ensures that the KB is only used to fill the basic knowledge gaps and not directly prove the hypothesis irrespective of the premise.

KB Retrieval:
The large number of tuples in the knowledge base makes it infeasible to evaluate each hypothesis sub-fact against the entire KB.Hence, we retrieve the top-100 relevant knowledge tuples, K , for each sub-fact based on a simple Jaccard word overlap score.

Modules
We use a Neural Entailment model to compute the entailment score based on the premise, as well as two symbolic models, Symbolic Matcher and Symbolic Lookup, to compute entailment scores based on the premise and the KB respectively (middle layer in Fig. 2).
Neural Entailment We use a simple neural entailment model, Decomposable Attention (Parikh et al., 2016), one of the state-of-the-art models on the SNLI entailment dataset (Bowman et al., 2015).However, our architecture can just as easily use any other neural entailment model.We initialize the model parameters by training it on the Science Entailment dataset.Given the sub-facts from the hypothesis, we use this model to compute an entailment score n(h i , p) from the premise to each sub-fact h i .

Symbolic Matcher
In our initial experiments, we noticed that the neural entailment models would often either get distracted by similar words in the distributional space (false positives) or completely miss an exact mention of h i in a long premise (false negatives).To mitigate these errors, we define a Symbolic Matcher model that compares exact words in h i and p, via a simple asymmetric bag-of-words overlap score: One could instead use more complex symbolic alignment methods such as integer linear programming (Khashabi et al., 2016;Khot et al., 2017).
Symbolic Lookup This module verifies the presence of the hypothesis sub-fact h i in the retrieved KB tuples K , by comparing the sub-fact to each tuple and taking the maximum score.Each field in the KB tuple kb j is scored against the corresponding field in h i (e.g., subject to subject) and averaged across the fields.To compare a field, we use a simple word-overlap based Jaccard similarity score, Sim(a, b) = |a∩b| |a∪b| .The lookup match score for the entire sub-fact and kb-fact is: and the final lookup module score for h i is: Note that the Symbolic Lookup module assesses whether a sub-fact of H is universally true.Neural models, via embeddings, are quite strong at mediating between P and H.The goal of the KB lookup module is to complement this strength, by verifying universally true sub-facts of H that may not be stated in P (e.g."aorta is a major artery" in our motivating example).

Aggregator Network
For each sub-fact h i , we now have three scores: n(h i , p) from the neural model, m(h i , p) from the symbolic matcher, and l(h i ) from the symbolic lookup model.The task of the Aggregator network is to combine these to produce a single entailment score.However, we found that using only the final predictions from the three modules was not effective.Inspired by recent work on skip/highway connections (He et al., 2016;Srivastava et al., 2015), we supplement these scores with intermediate, higher-dimensional representations from two of the modules.
From the Symbolic Lookup model, we use the representation of each sub-fact h enc We define a hybrid layer that takes as input a simple concatenation of these representation vectors from the different modules: The hybrid layer is a single layer MLP for each sub-fact h i that outputs a sub-representation out i = MLP(in(h i , p)).A compositional layer then uses a two-layer MLP over a concatenation of the hybrid layer outputs from different sub-facts, {h 1 , . . ., h I }, to produce the final label, Finally, we use the cross-entropy loss to train the Aggregator network jointly with representations in the neural entailment and symbolic lookup models, in an end-to-end fashion.We refer to this entire architecture as the NSnet network.
To assess the effectiveness of the aggregator network, we also use a simpler baseline model, Ensemble, that works as follows.For each sub-fact h i , We average the probabilities from all the facts to get the final entailment probability.2

Experiments
We use the SciTail dataset3 (Khot et al., 2018) for our experiments, which contains 27K entailment examples with a 87.3%/4.8%/7.8%train/dev/test split.The premise and hypothesis in each example are natural sentences authored independently as well as independent of the entailment task, which makes the dataset particularly challenging.We focused mainly on the SciTail dataset, since other crowd-sourced datasets, large enough for training, contained limited linguistic variation (Gururangan et al., 2018) leading to limited gains achievable via external knowledge.

Results
Table 1 summarizes the validation and test accuracies of various models on the SciTail dataset.The DecompAttn model achieves 74.3% on the test set but drops by 1.6% when the hypotheses are decomposed.The Ensemble approach uses the same hypothesis decomposition and is able to recover 2.1% points by using the KB.The end-toend NSnet network is able to further improve the score by 3.1% and is statistically significantly (at p-value 0.05) better than the baseline neural entailment model.The model is marginally better than DGEM, a graph-based entailment model proposed by the authors of the SciTail dataset We show significant gains over our base entailment model by using an external knowledge base, which are comparable to the gains achieved by DGEM through the use of hypothesis structure.These are orthogonal approaches and one could replace the base DecompAttn model with DGEM or more recent models (Tay et al., 2017;Yin et al., 2018).In Table 2, we evaluate the impact of the Symbolic Matcher and Symbolic Lookup module on the best reported model.As we see, removing the symbolic matcher, despite its simplicity, results in a 3.2% drop.Also, the KB lookup model is able to fill some knowledge gaps, contributing 2.1% to the final score.Together, these symbolic matching models contribute 4% to the overall score.The third question shows a case where the NSnet architecture learns a better combination of the neural and symbolic methods to correctly identify the entailment relation while Ensemble fails to do so.

Qualitative Analysis
Table 3: Few randomly selected examples in the test set between symbolic only, neural only, Ensemble and NSnet inference.The symbolic only model shows its the most similar knowledge from knowledge base inside parenthesis.The first two example shows when knowledge helps fill the gap where neural model can't.The third example shows when NSnet predicts correctly while Ensemble fails.
Premise: plant cells possess a cell wall , animals never .
Hypothesis: a cell wall is found in a plant cell but not in an animal cell .

Sub-fact of hypothesis neural only symbolic only Ensemble NSnet
a cell wall is found in a plant cell but not in an animal cell F(0.47) T(0.07) (cell is located in animal) T(0.50) - Premise: the pupil is a hole in the iris that allows light into the eye .
Hypothesis: the pupil of the eye allows light to enter .

Sub-fact of hypothesis neural only symbolic only Ensemble NSnet
the pupil of the eye allows light to enter F(0.43) T(0.12), (light enter eye) T(0.50) - Premise: binary fission in various single-celled organisms ( left ) .
Hypothesis: binary fission is a form of cell division in prokaryotic organisms that produces identical offspring .takes advantage of both systems, often called neural-symbolic learning (Garcez et al., 2015).Various neural-symbolic models have been proposed for question answering (Liang et al., 2016) and causal explanations (Kang et al., 2017).We focus on end-to-end training of these models specifically for textual entailment.Contemporaneous to this work, Chen et al. ( 2018) have incorporated knowledge-bases within the attention and composition functions of a neural entailment model, while Kang et al. (2018) generate adversarial examples using symbolic knowledge (e.g., WordNet) to train a robust entailment model.We focused on integrating knowledgebases via a separate symbolic model to fill the knowledge gaps.

Conclusion
We proposed a new entailment model that attempts to bridge knowledge gaps in textual entailment by incorporating structured knowledge base lookup into standard neural entailment models.Our architecture, NSnet, can be trained end-to-end, and achieves a 5% improvement on SciTail over the baseline neural model used here.The methodology can be easily applied to more complex entailment models (e.g., DGEM) as the base neural entailment model.Accurately identifying the subfacts from a hypothesis is a challenging task in itself, especially when dealing with negation.Improvements to the fact decomposition should further help improve the model.

A Model Hyper-parameters
Our hyper-parameters tuned on development set are: Adam optimizer with learning rate 0.05, maximum gradient norm 5.0, batch size is 32, embedding size 300, and hidden layer size of feedforward network is 200 with dropout rate 0.1.The maximum vocabulary size is set to 30,000, but our dataset has a smaller vocabulary.The models are trained with either the original hypotheses or the sub-facts generated by ClausIE.
We test dropout ratio from 0.5, 0.75, 0.9 to 1.0, and encoder with glove averaging or LSTM.The maximum length of question and knowledge sentence is 25, and maximum length of supporting sentence in SciTail dataset is 40 5 .The hidden size of hybrid layer is 50, and hidden size of compositional layer is 50 and 2. The maximum number of knowledge per question is 100 and maximum number of sub-question per question is 5.The average number of sub-questions per question decomposed by (Del et al., 2013) is around 3.5.The learning rate is 0.05 and maximum gradient norm is 5.0 with Adam optimizer, and batch size is 32.We train our neural methods and NSnet network up to 25 epochs and choose the best model with validation and obtain accuracy on test set with the best model.
Based on the grid search over the hyperparameters, our best Ensemble model uses EmbOver matcher on glove embeddings without tuple structure and probabilistic OR for hybrid decisions and averaging with 0.5 threshold for compositional decisions.
The best NSnet model uses WordOver matcher on glove encoding with tuple structure, no dropout ratio and sub-question training with neural models.

B Additional Experiments
For further analysis, we study effect of different matchers with(out) tuple structure, and different encoders (See Figure 3).The left figure shows test accuracies of symbolic (red) and NSnet (green) methods between three different lookup matchers (e.g., EmbeddingAverage, Em-beddingOverlap, WordOverlap) and whether tuple structure is considered (light) or not (dark).In most cases, EmbeddingOverlap that takes advantages from EmbeddingAverage and WordOverlap

C Observation on Ensemble Model Design
For the Ensemble network, we evaluated both OR and AND aggregation function and reported the best model.The use of AND is indeed intuitive.However, in addition to the empirical support for OR, the use of ClausIE to generate subfacts makes probabilistic OR somewhat of a better fit, because of the following reason.ClausIE tries to generate every possible proposition in a sentence, erring on the side of higher recall at the cost of lower precision.This makes it unlikely for one to find good support for all generated sub-facts.This results in poor performance when using AND aggregation.

i=
Enc(h i ) obtained by averaging word embeddings (Pennington et al., 2014) and individual similarity scores over the top-100 KB tuples emb i = [. . ., Sim f (h i , kb j ), . ..].From the neural entailment model, we use the intermediate representation of both the sub-fact of hypothesis and premise text from the final layer (before the softmax computation

Figure 3
Figure 3 shows few randomly selected examples in test set.The first two examples show cases when the symbolic models help to change the neural alignment's prediction (F) to correct prediction (T) by our proposed Ensemble or NSnet models.The third question shows a case where the NSnet architecture learns a better combination of the neural and symbolic methods to correctly identify the entailment relation while Ensemble fails to do so.

5
Science entailment dataset has long premise sentences

Table 1 :
Entailment accuracies on the SciTail dataset.NSnet substantially improves upon its base model and marginally outperforms DGEM.
i is entailed, i.e.P(h

Table 2 :
Ablation: Both Symbolic Lookup and Symbolic Matcher have significant impact on NSnet performance.