End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions

This work deals with SciTail, a natural entailment challenge derived from a multi-choice question answering problem. The premises and hypotheses in SciTail were generated with no awareness of each other, and did not specifically aim at the entailment task. This makes it more challenging than other entailment data sets and more directly useful to the end-task – question answering. We propose DEISTE (deep explorations of inter-sentence interactions for textual entailment) for this entailment task. Given word-to-word interactions between the premise-hypothesis pair (P, H), DEISTE consists of: (i) a parameter-dynamic convolution to make important words in P and H play a dominant role in learnt representations; and (ii) a position-aware attentive convolution to encode the representation and position information of the aligned word pairs. Experiments show that DEISTE gets ≈5% improvement over prior state of the art and that the pretrained DEISTE on SciTail generalizes well on RTE-5.

1 It rotates on its axis once every 60 Earth days. 0 Earth orbits Sun, and rotates once per day about axis. 1 Table 1: Examples of four premises for the hypothesis "Earth rotates on its axis once times in one day" in SCITAIL dataset. Right column (label): "1" means entail, "0" otherwise.
-has generated much work based on deep neural networks due to its large size. However, these benchmarks were mostly derived independently of any NLP problems. 2 Therefore, the premisehypothesis pairs were composed under the constraint of predefined rules and the language skills of humans. As a result, while top-performing systems push forward the state-of-the-art, they do not necessarily learn to support language inferences that emerge commonly and naturally in real NLP problems.
In this work, we study SCITAIL (Khot et al., 2018), an end-task oriented challenging entailment benchmark. SCITAIL is reformatted from a multi-choice question answering problem. All hypotheses H were obtained by rewriting (question, correct answer) pairs; all premises P are relevant web sentences collected by an Information retrieval (IR) method; then (P , H) pairs are annotated via crowdsourcing. Table 1 shows examples. By this construction, a substantial performance gain on SCITAIL can be turned into better QA performance (Khot et al., 2018). Khot et al. (2018) report that SCITAIL challenges neural entailment models that show outstanding performance on SNLI, e.g., Decomposable Attention Model (Parikh et al., 2016) and Enhanced LSTM (Chen et al., 2017).
We propose DEISTE for SCITAIL. Given word-to-word inter-sentence interactions between more likely less likely premise hypothesis 1 hypothesis 2 entail Figure 1: The motivation of considering alignment positions in TE. The same color in (premise, hypothesis) means the two words are best aligned.
: For any word in one of (P , H), how to find the best aligned word in the other sentence, so that we know their connection is indicative of the final decision. (c) For a window of words in P or H, whether the locations of their best aligned words in the other sentence provides clues. As Figure 1 illustrates, the premise "in this incident, the cop (C) shot (S) the thief (T )" is more likely to entail the hypothesis "Ĉ . . .Ŝ . . .T " than "T . . .Ŝ . . .Ĉ" whereX is the word that best matches X. Our model DEISTE is implemented in convolutional neural architecture (LeCun et al., 1998). Specifically, DEISTE consists of (i) a parameterdynamic convolution for exploration strategy (a) given above; and (ii) a position-aware attentive convolution for exploration strategies (b) and (c). In experiments, DEISTE outperforms prior top systems by ≈5%. Perhaps even more interestingly, the pretrained model over SCI-TAIL generalizes well on RTE-5.

Method
To start, a sentence S (S ∈ {P, H}) is represented as a sequence of n s hidden states, e.g., Figure 2 depicts the basic principle of DEISTE in modeling premise-hypothesis pair (P , H) with feature maps P and H, respectively. First, P and H have fine-grained interactions I ∈ R np×n h by comparing any pair of (p i ,h j ): We now elaborate DEISTE's exploration strategies (a), (b) and (c) of the interaction results I.

Parameter-dynamic convolution
Intuitively, important words should be expressed more intensively than other words in the learnt representation of a sentence. However, the importance of words within a specific sentence might not depend on the sentence itself. E.g., Yin and Schütze (2017b) found that in question-aware answer sentence selection, words well matched are more important; while in textual entailment, words hardly matched are more important.
In this work, we try to make the semantics of those important words dominate in the output representations of a convolution encoder.
Given a local window of hidden states in the feature map P, e.g., three adjacent ones p i−1 , p i and p i+1 , a conventional convolution learns a higher-level representation r for this trigram: For simplicity, we neglect the bias term b and split the multiplication of big matrices -W · [p i−1 , p i , p i+1 ] -into three parts, then r can be formulated as: where ⊕ means element-wise addition; W −1 , W 0 , W +1 ∈ R d×d , and their concatenation equals to the W in Equation 2;p i is set to be W 0 · p i , sop i can be seen as the projection of p i in a new space by parameters W 0 ; finally the three projected outputs contribute equally in the addition: The convolution encoder shares parameters [W −1 , W 0 , W +1 ] in all trigrams, so we cannot expect those parameters to express the importance ofp i−1 ,p i orp i+1 in the output representation r. Instead, we formulate the idea as follows: in which the three scalars α i−1 , α i and α i+1 indicate the importance scores ofp i−1 ,p i andp i+1 respectively. In our work, we adopt: Since α ipi = α i W 0 · p i = W 0,i · p i , we notice that the original shared parameter W 0 is mapped to a dynamic parameter W 0,i , which is specific to the input p i . We refer to this as parameterdynamic convolution, which enables our system DEISTE to highlight important words in higherlevel representations.
Finally, a max-pooling layer is stacked over {m i } to get the representation for the pair (P , H), denoted as r dyn .

Position-aware attentive convolution
Our position-aware attentive convolution, shown in Figure 3, aims to encode the representations as well as the positions of the best word alignments.
Representation. Given the interaction scores in I, the representationp i of all soft matches for hidden state p i in P is the weighted average of all hidden states in H:  For p i , we first retrieve the index x i of the bestmatched word in H by: then embed the concrete {x i } into randomlyinitialized continuous space: where M ∈ R n h ×dm ; n h is the length of H; d m is the dimensionality of position embeddings. Now, the three positions [i − 1, i, i + 1] in P concatenate vector-wisely original hidden states [p i−1 , p i , p i+1 ] with position embeddings [z i−1 , z i , z i+1 ], getting a new sequence of hidden states: [c i−1 , c i , c i+1 ]. As a result, a position i in P has hidden state c i , left context c i−1 , right context c i+1 and the representation of soft-aligned words in H, i.e.,p i . Then a convolution works at position i in P as: As Figure 3 shows, the position-aware attentive convolution finally stacks a standard max-pooling layer over {n i } to get the representation for the pair (P , H), denoted as r pos .
Overall, our DEISTE will generate a representation r dyn through the parameter-dynamic convolution, and generate a representation r pos through the position-aware attentive convolution. Finally the concatenation -[r dyn , r pos ] -is fed to a logistic regression classifier.  Table 3: DEISTE vs. baselines on SNLI. DEISTE SCITAIL has exactly the same system layout and hyperparameters as the one reported on SCI-TAIL in Table 2; DEISTE tune : tune hyperparameters on SNLI dev. State-of-the-art refers to (Peters et al., 2018). Ensemble results are not considered.
Training setup. All words are initialized by 300D Word2Vec (Mikolov et al., 2013) embeddings, and are fine-tuned during training. The whole system is trained by AdaGrad (Duchi et al., 2011). Other hyperparameter values include: learning rate 0.01, d m =50 for position embeddings M, hidden size 300, batch size 50, filter width 3.
Baselines. (i) Decomposable Attention Model (Decomp-Att) (Parikh et al., 2016): Develop attention mechanisms to decompose the problem into subproblems to solve in parallel. (ii) Enhanced LSTM (ESIM) (Chen et al., 2017): Enhance LSTM by encoding syntax and semantics from parsing information. (iii) Ngram Overlap: An overlap baseline, considering unigrams, oneskip bigrams and one-skip trigrams. (iv) DGEM (Khot et al., 2018): A decomposed graph entailment model, the current state-of-the-art. (v) At-tentiveConvNet (Yin and Schütze, 2017a): Our top-performing textual entailment system on SNLI dataset, equipped with RNN-style attention mechanism in convolution. 4 In addition, to check if SCITAIL can be easily resolved by features from only premises or hypotheses (like the problem of SNLI shown by Gururangan et al. (2018)), we put a vanilla CNN (convolution&max-pooling) over merely hypothesis or premise to derive the pair label. Table 2 presents results on SCITAIL. (i) Our model DEISTE has a big improvement (∼ 5%) over DGEM, the best baseline. Interestingly, At-tentiveConvNet performs very competitively, surpassing DGEM by 0.8% on test. These two results show the effectiveness of attentive convolution. DEISTE, equipped with a parameter-dynamic convolution and a more advanced position-aware attentive convolution, clearly gets a big plus. (ii) The ablation shows that all three aspects we explore from the inter-sentence interactions contribute; "position" encoding is less important than "dyn-conv" and "representation". Without "representation", the system performs much worse. This is in line with the result of AttentiveConvNet baseline.
To further study the systems and datasets, Table  3 gives performance of DEISTE and baselines on SNLI. We see that DEISTE gets competitive performance on SNLI.
Comparing Tables 2 and 3, the baselines "hypothesis only" and "premise only" show analogous while different phenomena between SCI-TAIL and SNLI. On one hand, both SNLI and SCITAIL can get a relatively high number by looking at only one of {premise, hypothesis} -"premise only" gets 73.4% accuracy on SCITAIL, even higher than two more complicated baselines (ESIM-600D and Decomp-Att), and "hypothesis only" gets 68.7% accuracy on SNLI which is more than 30% higher than the "majority" and "premise only" baselines. Notice the contrast: SNLI "prefers" hypothesis, SCITAIL "prefers" premise. For SNLI, this is not surprising as the crowd-workers tend to construct the hypotheses in SNLI by some regular rules (Gururangan et al., 2018). The phenomenon in SCITAIL is left to explore in future work.
Error Analysis. Table 4 gives examples of errors. We explain them as follows.
Language conventions: The pair #1 uses dash "-" to indicate a definition sentence for "Front"; The pair #2 has "A (or B)" to denote the equivalence between A and B. This challenge is expected to be handled by rules.
Ambiguity: The pair #3 looks like having a similar challenge with the pair #2. We guess the annotators treat "· · · a vertebral column or backbone" and " · · · the backbone (or vertebral column)" as the same convention, which may be debatable.
Complex discourse relation: The premise in # (Premise P , Hypothesis H) Pair G/P Challenge 1 (P ) Front -The boundary between two different air masses. 1/0 language conventions (H) In weather terms, the boundary between two air masses is called front.

2
(P ) . . . the notochord forms the backbone (or vertebral column). 1/0 language conventions (H) Backbone is another name for the vertebral column.

3
(P ) · · · animals with a vertebral column or backbone and animals without one. 1/0 ambiguity (H) Backbone is another name for the vertebral column. 4 (P ) Heterotrophs get energy and carbon from living plants or animals ( consumers ) or from dead organic matter ( decomposers ). 0/1 discourse relation (H) Mushrooms get their energy from decomposing dead organisms. 5 (P ) Ethane is a simple hydrocarbon, a molecule made of two carbon and six hydrogen atoms. 0/1 discourse relation (H) Hydrocarbons are made of one carbon and four hydrogen atoms. 6 (P ) . . . the SI unit. . . for force is the Newton (N) and is defined as (kg·m/s −2 ). 0/1 beyond text (H) Newton (N) is the SI unit for weight.   (Iftene and Moruz, 2009) the pair #4 has an "or" structure. In the pair #5, "a molecule made of · · · " defines the concept "Ethane" instead of the "hydrocarbon". Both cases require the model to be able to comprehend the discourse relation.
Knowledge beyond text: The main challenge in the pair #6 is to distinguish between "weight" and "force", which requires more physical knowledge that is beyond the text described here and beyond the expressivity of word embeddings. Transfer to RTE-5. One main motivation of exploring this SCITAIL problem is that this is an end-task oriented TE task. A natural question thus is how well the trained model can be transferred to other end-task oriented TE tasks. In Table 5, we take the models pretrained on SCI-TAIL and SNLI and test them on RTE-5. Clearly, the model pretrained on SNLI has not learned anything useful for RTE-5 -its performance of 46.0% is even worse than the majority baseline. The model pretrained on SCITAIL, in contrast, demonstrates much more promising generalization performance: 60.2% vs. 46.0%.

Related Work
Learning automatically inter-sentence word-toword interactions or alignments was first studied in recurrent neural networks (Elman, 1990). Rocktäschel et al. (2016) employ neural word-toword attention for SNLI task. Wang and Jiang (2016) propose match-LSTM, an extension of the attention mechanism in (Rocktäschel et al., 2016), by more fine-grained matching and accumulation. Cheng et al. (2016) present a new LSTM equipped with a memory tape. Other work following this attentive matching idea includes Bilateral Multi-Perspective Matching model (Wang et al., 2017), Enhanced LSTM (Chen et al., 2016) etc.
In addition, convolutional neural networks (Le-Cun et al., 1998), equipped with attention mechanisms, also perform competitively in TE. Yin et al. (2016) implement the attention in pooling phase so that important hidden states will be pooled with higher probabilities. Yin and Schütze (2017a) further develop the attention idea in CNNs, so that a RNN-style attention mechanism is integrated into the convolution filters. This is similar with our position-aware attentive convolution. We instead explored a way to make use of position information of alignments to do reasoning.
Attention mechanisms in both RNNs and CNNs make use of sentence interactions. Our work achieves a deep exploration of those interactions, in order to guide representation learning in TE.

Summary
This work proposed DEISTE to deal with an endtask oriented textual entailment task -SCITAIL. Our model enables a comprehensive utilization of inter-sentence interactions. DEISTE outperforms competitive systems by big margins.