Asynchronous Deep Interaction Network for Natural Language Inference

Natural language inference aims to predict whether a premise sentence can infer another hypothesis sentence. Existing methods typically have framed the reasoning problem as a semantic matching task. The both sentences are encoded and interacted symmetrically and in parallel. However, in the process of reasoning, the role of the two sentences is obviously different, and the sentence pairs for NLI are asymmetrical corpora. In this paper, we propose an asynchronous deep interaction network (ADIN) to complete the task. ADIN is a neural network structure stacked with multiple inference sub-layers, and each sub-layer consists of two local inference modules in an asymmetrical manner. Different from previous methods, this model deconstructs the reasoning process and implements the asynchronous and multi-step reasoning. Experiment results show that ADIN achieves competitive performance and outperforms strong baselines on three popular benchmarks: SNLI, MultiNLI, and SciTail.


Introduction
Natural language inference (NLI) is a pivotal and fundamental task in natural language understanding and artificial intelligence. The goal of NLI is to predict whether a premise sentence can infer another hypothesis sentence. As illustrated in Table 1, logical relationships between the two sentences include entailment (if the premise is true, then the hypothesis must be true), contradiction (if the premise is true, then the hypothesis must be false), and neutral (neither entailment nor contradiction).
As a core task, conventional approaches have studied various aspects of the inference prob- lem (MacCartney and Manning, 2008;Heilman and Smith, 2010). Thanks to the release of the largest publicly available corpus -the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015), neural network-based models have also been successfully used for this task (Parikh et al., 2016;Chen et al., 2016;Tay et al., 2018;Duan et al., 2018). These methods typically treat the premise sentence and the hypothesis sentence equally, learn an alignment of sub-phrases in both sentences symmetrically and in parallel, and fuse local information for making a global decision at the sentence level. They all frame the inference problem as a semantic matching task and ignore the reasoning process.
However, different from a simple semantic matching task, reasoning should be asynchronous and fully interpretable . Moreover, the sentence pairs for NLI are asymmetrical corpora, i.e., I(a, b) = I(b, a). Considering the first  Figure 1: The overall view of our model. The left part is the main framework of this work. The dashed lines refer to the copy operation. The asynchronous inference layer is stacked with N inference sub-layers. In the first sub-layer of the asynchronous inference layer, the input is from the representation layer. Subsequently, the input of sub-layers come from the previous sub-layers. The right part is the detailed structure of the local inference module, takingĥ bj as an example. example in Table 1, the premise sentence can infer the hypothesis sentence, however, the hypothesis sentence can't infer the premise sentence. The inference process intuitively needs to consider the relationship between two sentences in sequential order. According to the actual inference process, we argue that the model should first get the inferential information to model the hypothesis sentence, based on the premise sentence, and then model premise sentence, based on the new representation for hypothesis sentence.
In this paper, we propose an asynchronous deep interaction network (ADIN) to achieve the reasoning. This model is stacked with multiple inference sub-layers to implement the multi-step reasoning, and each sub-layer consists of two local inference modules in an asymmetrical manner to simulate the asynchronous and interpretable reasoning process. In a local inference module, we update the sentence representation by using the local inference information, based on the attention of the other sentence. Lastly, we combine the inference information between the two sentences to make a global decision.
To demonstrate the effectiveness of our model, we evaluate it on three popular benchmarks: SNLI, MultiNLI, and SciTail. The experimental results on these three data sets reveal that our method achieves competitive performance.
The main contributions of this work can be summarized as follows: • We break the matching architecture that inter-acts with the information between two sentences for alignment, and propose an asynchronous deep interaction network to achieve the asynchronous and multi-step reasoning.
• We deconstruct the reasoning process between the two sentences, and the process can be analyzed step-by-step.
• The experimental results on three highly competitive benchmark datasets demonstrate that our model can achieve better performance than other strong baselines.

Approach
We define the natural language inference as a classification task that predicts the relation y ∈ Y for a given pair of sentences, where Y = entailment, contradiction, neutral . In this work, we propose an asynchronous deep interaction network (ADIN) to complete this task. The overall architecture of the model is illustrated on the left part of Figure 1.
Our sentence inference architecture, ADIN , is composed of the following three components: (1) information representation layer converts the two sentences into semantic representations; (2) asynchronous inference layer produces new representations for the two sentences, based on the inference information; and (3) interaction and prediction layer determines the overall inference relationship between a premise and hypothesis.

Local Inference Module
Given two natural sentences a and b, H a = {h a i |h a i ∈ R k , i = 1, 2, ..., m} and H b = {h b j |h b j ∈ R k , j = 1, 2, ..., n} denote their kdimensional representations respectively , where m, n denote the length of two sentences. Here, we implement a general reasoning process where the module captures the relevance between the two sentences, then incorporates the inferential information to the new representation for sentence b, based on the sentence a. First, we compute a coattention matrix E ∈ R m×n to capture the relevance between the two sentences, each element E i,j ∈ R indicates the relevance between the i-th word of sentence a and the j-th word of b. Formally, the co-attention matrix could be computed as: where W ∈ R s×k , P ∈ R s , and denotes the element-wise production operation. Then, we get a-guided attentive vectors for sentence b: In order to enhance the interaction further, we combine the original vector and a-guided attentive vector for sentence b. More formally: where [·; ·; ·; ·] refers to the concatenation operation. In Equation 4, we first calculate the difference and the element-wise product for (h b j , h b j ). We get the new representation containing a-guided inferential information for sentence b: Where LayerN orm(.) is layer normalization (Ba et al., 2016). The resultĤ b is a 2D-tensor that has the same shape as H b , and we refer to the whole inferential module as: As described, the inferential module can capture the relevance between the two sentences, incorporate the inferential information to the new representation for sentence b, based on the sentence a.

Information Representation Layer
The information representation layer converts each word or phrase in the sentences into a vector representation and constructs the representation matrix for the sentences. We combine the multi-level features as the sentence representation. Each token is represented as a vector by using the pre-trained word embedding such as GloVe (Pennington et al., 2014), word2Vec (Mikolov et al., 2013), and fasttext (Joulin et al., 2016). It can also incorporate more syntactical and lexical information into the feature vector. For ADIN, we use a concatenation of word embedding, character embedding, and syntactical features as the sentence representation. The word embedding is obtained by mapping token to high dimensional vector space by pre-trained word vector (300D Glove 840B), and the word embedding is updated during training. Character-level embedding could alleviate out-of-vocabulary (OOV) problems and capture helpful morphological information. As in (Kim et al., 2016;Lee et al., 2016), we filter the character embedding with 1D convolution kernel. The character convolutional feature maps are then max pooled over the time dimension for each token to obtain a vector.
As in , the syntactical features consist of one-hot part-of-speech (POS) tagging feature and binary exact match (EM) feature. For one sentence, the EM value is activated if the same word is found in the other sentence.
Next, ADIN adopts bidirectional Long Short-Term Memory network (Bi-LSTM) (Graves and Schmidhuber, 2005) to model the internal temporal interaction on both directions of the sentences. Consider a premise sentence p and a hypothesis sentence q, we have got their multi-level features representation. Suppose the length of p and q are m and n respectively. These multi-level features representation are then passed to a Bi-LSTM encoder to obtain the context-dependent hidden state matrix, i.e, H p = {h p i |h p i ∈ R d , i = 1, 2, ..., m}, and H q = {h q j |h q j ∈ R d , j = 1, 2, ..., n}, where d is the dimension of Bi-LSTM's hidden state.

Asynchronous Inference Layer
Recently, along with the development of deep learning methods, some neural attention-based models have also been successfully used for NLI (Rocktäschel et al., 2015;Parikh et al., 2016;Duan et al., 2018). However, these methods typically frame the inference problem as a semantic matching task and ignore the reasoning process, where the premise sentence and the hypothesis sentence are encoded and interacted symmetrically and in parallel.
In this paper, we utilize the local inference module to deconstruct the reasoning process and achieve the asynchronous and multi-step reasoning for NLI. To model the multi-step reasoning habit, this model is stacked with N inference sublayers to capture step-by-step the logic relationship between the two sentences. In the each inference sub-layer , two inferential modules perform two asynchronous inference processes respectively.
Concretely, in the t-th inference sublayer, given the representations of two sentences computed in the previous sublayer : . In an inference sub-layer, we first get the inferential information to update the representation for hypothesis sentence, based on the premise sentence. Next, the model incorporates the inferential information to the premise sentence, based on the new representation for hypothesis sentence.

Interaction and Prediction Layer
To extract a proper representation for each sentence, we apply a mean pooling and a max pooling on each of them. Formally:  for the two sentences p and q in various ways in the interaction layer and the final feature vector r for the inference is obtained as follows: Finally, based on the aggregated feature r, we use a multi-layer perceptron (MLP) classifier to predict the label: where W r , b r , W v , and b v are trainable parameters. The entire model is trained end-to-end, optimizing the standard multi-class cross-entropy loss function.

Experiments
In this section, we present the evaluation of our model. We first perform quantitative evaluation, comparing our model with other strong baselines. We then conduct some qualitative analyses to understand how ADIN achieve the asynchronous and multi-step inference between the premise sentence and the hypothesis sentence.

Dataset
We evaluate our model on three popular benchmarks: the Stanford Natural Language Inference (SNLI), the MultiGenre NLI Corpus (MultiNLI) and SciTail. Detailed statistical information of these datasets is shown in Table 2.
SNLI is a collection of 570k human written sentence pairs based on image captioning, supporting the task of natural language inference (Bowman et al., 2015). The labels are composed of entailment, neutral and contradiction. The data splits are provided in (Bowman et al., 2015).
MultiNLI The corpus  is a new dataset for NLI, which contains 433k sentences pairs. Similar to SNLI, each pair is labeled with one of the following relationships: entailment, contradiction, or neutral. We compare on two test sets (matched and mismatched) which represent in-domain and out-domain performance. We use the same data split as provided by .
SciTail We also include the newly released Sc-iTail dataset (Khot et al., 2018) which is a binary entailment classification task constructed from science questions. This is the first entailment set that is created solely from natural sentences that already exist independently "in the wild" rather than sentences authored specifically for the entailment task. We use the same data split as in (Khot et al., 2018).

Models for Comparing
To analyze the effectiveness of our model, we evaluate some traditional and state-of-the-art methods as baselines as follows on the above three data sets: • DecompAtt (Parikh et al., 2016) is a simple model that decomposes the problem into parallelizable attention computations.
• ESIM (Chen et al., 2016) is a previous stateof-the-art model for the natural language inference (NLI) task. It is a sequential model that incorporates the chain LSTM and the tree LSTM to infer local information between two sentences.
• BiMPM is proposed in (Wang et al., 2017). The model combines two sentence encoders and employs a multi-perspective matching mechanism in sentence pair modeling tasks.
• DIIN (Gong et al., 2017) is a novel class of neural network architectures that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from the interaction space. The model uses word-by-word dimensionwise alignment tensors to encode the highorder alignment relationship between sentence pairs.
• DGEM (Khot et al., 2018) is a entailment model that exploits structure from the hypothesis only. This model shows the value of structured representation on just the hypothesis for NLI.
• MwAN (Tan et al., 2018) is a multiway attention network that applies multiple attention functions to model the matching between a pair of sentences.
• CAFE (Tay et al., 2018) compares and compresses alignment pairs using factorization layers which leverages the rich history of standard machine learning literature to achieve this task.
• AF-DMN (Duan et al., 2018) stacks multiple computational blocks in its matching layer to learn the interaction of the sentence pair better.
• KIM  is neural networkbased NLI model that can benefit from external knowledge. The model is capable of leveraging external knowledge in coattention, local inference collection, and inference composition components.

Experiment Configurations
Hyper-parameters may influence the performance of a neural network-based model. For all the three datasets, there are 3 inference sub-layers in the asynchronous inference Layer. An Adam (Kingma and Ba, 2014) optimizer with β 1 as 0.9 and β 2 as 0.999 is used to optimize all trainable parameters. The initial learning rate is set to 0.001 and is halved when the accuracy on the dev set decreases. We also apply dropout (Srivastava et al., 2014) on the all MLPs to avoid over-fitting, and the dropout rate is set to 0.2. For preprocessing, we just tokenize the sentences and lowercase the tokens.

Ensemble
The ensemble strategy is an effective method to improve model accuracy. Following (Wang et al., 2017), our ensemble model averages the probability distributions from five individual single ADINs, who have exactly identical architectures but distinguished initializations on parameters.

Quantitative Results
We use the accuracy to evaluate the performance of ADIN and other models on datasets SNLI, MultiNLI, and SciTail. Table 3 shows the results of different models on the training and test sets of SNLI. In Table 3, the first category of methods are single models and the second category of methods are ensemble models. We show our model, ADIN, achieves state-of-theart performance on the competitive leaderboard. In this table, KIM is neural network-based NLI model that can benefit from external knowledge, and other strong baselines encode and interact the both sentences symmetrically and in parallel.  of methods are single models and the second category of methods are ensemble models. On MultiNLI, we compare on two test sets (matched and mismatched) which represent in-domain and out-domain performance. ADIN significantly outperforms ESIM, a strong baseline on the both test sets. An ensemble of ADIN models also achieve competitive result on the MultiNLI dataset.
As illustrated in Table 5, our model outperforms the baselines and achieves an accuracy of 84.6% in the test set of the SciTail dataset. As such, empirical results demonstrate the effectiveness of our proposed ADIN model on the challenging SciTail dataset.
For the results on all three datasets, we conduct the students paired t-test. For SNLI and MultiNLI, the p-value of the significance test between the results of our model and AF-DMN is less than 0.01 and 0.05, respectively. For SciTail, the p-value of the significance test between the results of our model and CAFE is also less than 0.01. These results further prove the effectiveness of our model.

Model Analysis
To better understand the performance of ADIN, we analyze the effect of each key component of the proposed model on the SNLI dateset. Table 6 shows the performance with a different number of asynchronous inference sub-layers. As we can see, with the number of sub-layers increases from 1 to 3, the performance increases both on the development set and the test set. As the level of reasoning deepens, the model captures more inferential information. Because of computational cost, we just set the number of sub-layers as 3 on SNLI and other two datesets.
In Table 7, we show the results of ablation study on our base model. After removing the Bi-LSTM in the asynchronous inference Layer, the model performance decrease by 0.3 percentage points on the test set. Furthermore, we study the effect of two inferential modules in one asynchronous inference sub-layer. Without the first inferential module, that is, without the reasoning process from premise to hypothesis, the model performance sharply decreases by 0.8 percentage points. However, remove the second module and the test accuracy decreases by 0.5 percentage points. (exchanged inference order) indicates that we get the inferential information to first model the premise sentence, and then model hypothesis sentence. The performance of the model is reduced to 88.3% after exchanging inference order between two sentences. The above three experiments reflect that the both modules are not equally important for the

Premise
(1) A dog is jumping for a Frisbee in the snow.
(2) A dog is jumping for a Frisbee in the snow.
(3) A dog is jumping for a Frisbee in the snow.
(4) A dog is jumping for a Frisbee in the snow.

Hypothesis
(1) An animal is playing with a plastic toy.
(2) An animal is playing with a plastic toy.
(3) An animal is playing with a plastic toy.
(4) An animal is playing with a plastic toy. (2) Gradient scale of V 1 p , V 1 q on the first asynchronous inference sub-layer. (3) Gradient scale of V 2 p , V 2 q on the second asynchronous inference sub-layer. (4) Gradient scale of V 3 p , V 3 q on the third asynchronous inference sub-layer. Darker color corresponds to a higher scale of gradient, and implies a higher contribution to the final prediction.
inference and the sentence pair for NLI is asymmetrical corpora. In the last comparative experiment, we explore the role of multi-level features. We remove character embedding and syntactical features and just keep word embedding as the representation. The test accuracy is reduced to 88.2%.

Case study
To visually demonstrate the validity of the model, we do a qualitative study using the first example in Table 1.
H p , H q are the hidden states at the representation layer of premise sentence and hypothesis sentence, and V t p , V t q are the hidden states at the t-th asynchronous inference sub-layer. For a hidden state h p i of word p i , we can calculate the gradient scale ∂J ∂hp i 2 to show its contribution to the final prediction, where J is the cross-entropy loss. Figure 2 gives a visualization of the contribution to the final prediction of every word. As we can see, some phrases (like jumping for a Frisbee and play-ing with a plastic toy) instead of isolated words (like Frisbee and toy) become more focused after an asynchronous inference layer. The results imply that ADIN could capture some higher-level patterns. As the level of reasoning deepens, the model captures more inferential information.

Related Work
As a long standing problem in NLP research, natural language inference (or textual entailment recognition) has been widely investigated for many years. Conventional works on NLI relies on handcrafted features such as syntactic information, n-gram overlapping and so on (Bowman et al., 2015;Heilman and Smith, 2010).
Benefiting from the development of deep learning and the availability of large-scale annotated datasets (Bowman et al., 2015), neural networkbased models have also been successfully used for this task. And two categories of neural networkbased models have been developed for this problem. The first set of models is sentence encodingbased and aims to find vector representation for each sentence and classifies the relation by using the concatenation of two vector representation (Bowman et al., 2016;Mou et al., 2015). However, this kind of framework ignores the interaction between two sentences.
The other set of models uses the cross-sentence feature or inter-sentence attention from one sentence to another, and is hence referred to as a matching-aggregation framework. Parikh et al. (2016) use attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. Chen et al. (2016) propose a state-of-the-art model for the natural language inference (NLI) task. It is a sequential model that incorporates the chain LSTM and the tree LSTM to infer local information between two sentences. A novel class of neural network architectures is proposed in (Gong et al., 2017) that is able to achieve high-level understanding of the sentence pair by hierarchically extracting semantic features from interaction space. Tan et al. (2018) propose a multiway attention network that designs four attention functions to match words in corresponding sentences, aggregates the matching information from each function, and combines the information from all functions to obtain the final representation. Tay et al. (2018) compare and compress alignment pairs using factorization lay-ers which leverages the rich history of standard machine learning literature to achieve this task. AF-DMN (Duan et al., 2018) stacks multiple computational blocks in its matching layer to learn the interaction of the sentence pair better. KIM  is capable of leveraging external knowledge in co-attention, local inference collection, and inference composition components to improve the performance. These methods all frame the inference problem as a semantic matching task and ignore the reasoning process. Different from the above methods, ADIN is a neural network structure stacked with multiple asynchronous inference sub-layers, and each sublayer consists of two local inference modules in an asymmetrical manner. This model deconstructs the reasoning process and implements the asynchronous and multi-step reasoning.

Conclusions
In this paper, we propose an asynchronous deep interaction network (ADIN) for natural language inference. To simulate human reasoning process, ADIN is stacked with multiple asynchronous inference sub-layers, and each sub-layer consists of two inferential modules in an asymmetrical manner. The model deconstructs the reasoning process and implements the asynchronous and multi-step reasoning. We evaluate our model on three popular benchmarks: SNLI, MultiNLI, and SciTail. The experiment results show that ADIN achieves competitive performance and outperforms strong baselines.