DR-BiLSTM: Dependent Reading Bidirectional LSTM for Natural Language Inference

We present a novel deep learning architecture to address the natural language inference (NLI) task. Existing approaches mostly rely on simple reading mechanisms for independent encoding of the premise and hypothesis. Instead, we propose a novel dependent reading bidirectional LSTM network (DR-BiLSTM) to efficiently model the relationship between a premise and a hypothesis during encoding and inference. We also introduce a sophisticated ensemble strategy to combine our proposed models, which noticeably improves final predictions. Finally, we demonstrate how the results can be improved further with an additional preprocessing step. Our evaluation shows that DR-BiLSTM obtains the best single model and ensemble model results achieving the new state-of-the-art scores on the Stanford NLI dataset.


Introduction
Natural Language Inference (NLI; a.k.a. Recognizing Textual Entailment, or RTE) is an important and challenging task for natural language understanding (MacCartney and Manning, 2008). The goal of NLI is to identify the logical relationship (entailment, neutral, or contradiction) between a premise and a corresponding hypothesis. Table 1 shows few example relationships from the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015).
Recently, NLI has received a lot of attention from the researchers, especially due to the avail-P a A senior is waiting at the Relationship window of a restaurant that serves sandwiches.

H b
A person waits to be Entailment served his food. A man is looking to order Neutral a grilled cheese sandwich. A man is waiting in line Contradiction for the bus. a P, Premise. b H, Hypothesis. ability of large annotated datasets like SNLI (Bowman et al., 2015). Various deep learning models have been proposed that achieve successful results for this task (Gong et al., 2017;Wang et al., 2017;Chen et al., 2017;Yu and Munkhdalai, 2017a;Parikh et al., 2016;Zhao et al., 2016;Sha et al., 2016). Most of these existing NLI models use attention mechanism to jointly interpret and align the premise and hypothesis. Such models use simple reading mechanisms to encode the premise and hypothesis independently. However, such a complex task require explicit modeling of dependency relationships between the premise and the hypothesis during the encoding and inference processes to prevent the network from the loss of relevant, contextual information. In this paper, we refer to such strategies as dependent reading.
There are some alternative reading mechanisms available in the literature (Sha et al., 2016;Rocktäschel et al., 2015) that consider dependency aspects of the premise-hypothesis relationships. However, these mechanisms have two major limitations: • So far, they have only explored dependency aspects during the encoding stage, while ignoring its benefit during inference.
• Such models only consider encoding a hy-pothesis depending on the premise, disregarding the dependency aspects in the opposite direction.
We propose a dependent reading bidirectional LSTM (DR-BiLSTM) model to address these limitations. Given a premise u and a hypothesis v, our model first encodes them considering dependency on each other (u|v and v|u). Next, the model employs a soft attention mechanism to extract relevant information from these encodings. The augmented sentence representations are then passed to the inference stage, which uses a similar dependent reading strategy in both directions, i.e. u → v and v → u. Finally, a decision is made through a multi-layer perceptron (MLP) based on the aggregated information.
Our experiments on the SNLI dataset show that DR-BiLSTM achieves the best single model and ensemble model performance obtaining improvements of a considerable margin of 0.4% and 0.3% over the previous state-of-the-art single and ensemble models, respectively. Furthermore, we demonstrate the importance of a simple preprocessing step performed on the SNLI dataset. Evaluation results show that such preprocessing allows our single model to achieve the same accuracy as the state-of-the-art ensemble model and improves our ensemble model to outperform the state-of-the-art ensemble model by a remarkable margin of 0.7%. Finally, we perform an extensive analysis to clarify the strengths and weaknesses of our models.

Related Work
Early studies use small datasets while leveraging lexical and syntactic features for NLI (Mac-Cartney and Manning, 2008). The recent availability of large-scale annotated datasets (Bowman et al., 2015;Williams et al., 2017) has enabled researchers to develop various deep learning-based architectures for NLI. Parikh et al. (2016) propose an attention-based model (Bahdanau et al., 2014) that decomposes the NLI task into sub-problems to solve them in parallel. They further show the benefit of adding intra-sentence attention to input representations. Chen et al. (2017) explore sequential inference models based on chain LSTMs with attentional input encoding and demonstrate the effectiveness of syntactic information. We also use similar attention mechanisms. However, our model is distinct from these models as they do not benefit from dependent reading strategies. Rocktäschel et al. (2015) use a word-by-word neural attention mechanism while Sha et al. (2016) propose re-read LSTM units by considering the dependency of a hypothesis on the information of its premise (v|u) to achieve promising results. However, these models suffer from weak inferencing methods by disregarding the dependency aspects from the opposite direction (u|v). Intuitively, when a human judges a premise-hypothesis relationship, s/he might consider back-and-forth reading of both sentences before coming to a conclusion. Therefore, it is essential to encode the premise-hypothesis dependency relations from both directions to optimize the understanding of their relationship. Wang et al. (2017) propose a bilateral multiperspective matching (BiMPM) model, which resembles the concept of matching a premise and hypothesis from both directions. Their matching strategy is essentially similar to our attention mechanism that utilizes relevant information from the other sentence for each word sequence. They use similar methods as Chen et al. (2017) for encoding and inference, without any dependent reading mechanism.
Although NLI is well studied in the literature, the potential of dependent reading and interaction between a premise and hypothesis is not rigorously explored. In this paper, we address this gap by proposing a novel deep learning model (DR-BiLSTM). Experimental results demonstrate the effectiveness of our model.

Model
Our proposed model (DR-BiLSTM) is composed of the following major components: input encoding, attention, inference, and classification. Let u = [u 1 , · · · , u n ] and v = [v 1 , · · · , v m ] be the given premise with length n and hypothesis with length m respectively, where u i , v j ∈ R r is an word embedding of r-dimensional vector. The task is to predict a label y that indicates the logical relationship between premise u and hypothesis v.

Input Encoding
RNNs are the natural solution for variable length sequence modeling, consequently, we utilize a bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) for encoding the given sentences. For ease of presentation, we only describe how we encode u depending on v. The same procedure is utilized for the reverse direction (v|u).
To dependently encode u, we first process v using the BiLSTM. Then we read u through the BiL-STM that is initialized with previous reading final states (memory cell and hidden state). Here we represent a word (e.g. u i ) and its context depending on the other sentence (e.g. v). Equations 1 and 2 formally represent this component.

v, s
where {ū ∈ R n×2d ,û ∈ R n×2d , s u } and {v ∈ R m×2d ,v ∈ R m×2d , s v } are the independent reading sequences, dependent reading sequences, and BiLSTM final state of independent reading of u and v respectively. Note that, "−" in these equations means that we do not care about the associated variable and its value. BiLSTM inputs are the word embedding sequences and initial state vectors.û andv are passed to the next layer as the output of the input encoding component. The proposed encoding mechanism yields a richer representation for both premise and hypothesis by taking the history of each other into account. Using a max or average pooling over the independent and dependent readings does not further improve our model. This was expected since dependent reading produces more promising and relevant encodings.

Attention
We employ a soft alignment method to associate the relevant sub-components between the given premise and hypothesis. In deep learning models, such purpose is often achieved with a soft attention mechanism. Here we compute the unnormalized attention weights as the similarity of hidden states of the premise and hypothesis with Equation 3 (energy function).
whereû i andv j are the dependent reading hidden representations of u and v respectively which are computed earlier in Equations 1 and 2. Next, for each word in either premise or hypothesis, the relevant semantics in the other sentence is extracted and composed according to e ij . Equations 4 and 5 provide formal and specific details of this procedure.
whereũ i represents the extracted relevant information ofv by attending toû i whileṽ j represents the extracted relevant information ofû by attending tov j .
To further enrich the collected attentional information, a trivial next step would be to pass the concatenation of the tuples (û i ,ũ i ) or (v j ,ṽ j ) which provides a linear relationship between them. However, the model would suffer from the absence of similarity and closeness measures. Therefore, we calculate the difference and element-wise product for the tuples (û i ,ũ i ) and (v j ,ṽ j ) that represent the similarity and closeness information respectively (Chen et al., 2017;Kumar et al., 2016).
The difference and element-wise product are then concatenated with the computed vectors, (û i ,ũ i ) or (v j ,ṽ j ), respectively. Finally, a feedforward neural layer with ReLU activation function projects the concatenated vectors from 8ddimensional vector space into a d-dimensional vector space (Equations 6 and 7). This helps the model to capture deeper dependencies between the sentences besides lowering the complexity of vector representations.
Here stands for element-wise product while W p ∈ R 8d×d and b p ∈ R d are the trainable weights and biases of the projector layer respectively.

Inference
During this phase, we use another BiLSTM to aggregate the two sequences of computed matching vectors, p and q from the attention stage (Section 3.2). This aggregation is performed in a sequential manner to avoid losing effect of latent variables that might rely on the sequence of matching vectors.
Instead of aggregating the sequences of matching vectors individually, we propose a similar dependent reading approach for the inference stage. We employ a BiLSTM reading process (Equations 8 and 9) similar to the input encoding step discussed in Section 3.1. But rather than passing just the dependent reading information to the next step, we feed both independent reading (p andq) and dependent reading (p andq) to a max pooling layer, which selects maximum values from each sequence of independent and dependent readings (p i andp i ) as shown in Equations 10 and 11. The main intuition behind this architecture is to maximize the inferencing ability of the model by considering both independent and dependent readings.q Here {p ∈ R n×2d ,p ∈ R n×2d , s p } and {q ∈ R m×2d ,q ∈ R m×2d , s q } are the independent reading sequences, dependent reading sequences, and BiLSTM final state of independent reading of p and q respectively. BiLSTM inputs are the word embedding sequences and initial state vectors. Finally, we convertp ∈ R n×2d andq ∈ R m×2d to fixed-length vectors with pooling, U ∈ R 4d and V ∈ R 4d . As shown in Equations 12 and 13, we employ both max and average pooling and describe the overall inference relationship with concatenation of their outputs.

Classification
Here, we feed the concatenation of U and V ([U, V ]) into a multilayer perceptron (MLP) classifier that includes a hidden layer with tanh activation and softmax output layer. The model is trained in an end-to-end manner.
4 Experiments and Evaluation

Dataset
The Stanford Natural Language Inference (SNLI) dataset contains 570K human annotated sentence pairs. The premises are drawn from the Flickr30k (Plummer et al., 2015) corpus, and then the hypotheses are manually composed for each relationship class (entailment, neutral, contradiction, and -). The "-" class indicates that there is no consensus decision among the annotators, consequently, we remove them during the training and evaluation following the literature. We use the same data split as provided in Bowman et al. (2015) to report comparable results with other models.

Experimental Setup
We use pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) to initialize our word embedding vectors. All hidden states of BiLSTMs during input encoding and inference have 450 dimensions (r = 300 and d = 450). The weights are learned by minimizing the log-loss on the training data via the Adam optimizer (Kingma and Ba, 2014). The initial learning rate is 0.0004. To avoid overfitting, we use dropout (Srivastava et al., 2014) with the rate of 0.4 for regularization, which is applied to all feedforward connections. During training, the word embeddings are updated to learn effective representations for the NLI task. We use a fairly small batch size of 32 to provide more exploration power to the model. Our observation indicates that using larger batch sizes hurts the performance of our model.

Ensemble Strategy
Ensemble methods use multiple models to obtain better predictive performance. Previous works typically utilize trivial ensemble strategies by either using majority votes or averaging the probability distributions over the same model with different initialization seeds (Wang et al., 2017;Gong et al., 2017). By contrast, we use weighted averaging of the probability distributions where the weight of each model is learned through its performance on the SNLI development set. Furthermore, the differences between our models in the ensemble originate from: 1) variations in the number of dependent readings (i.e. 1 and 3 rounds of dependent reading), 2) projection layer activation (tanh and  Figure 2: Performance of n ensemble models reported for training (red, top), development (blue, middle), and test (green, bottom) sets of SNLI. For n number of models, the best performance on the development set is used as the criteria to determine the final ensemble. The best performance on development set (89.22%) is observed using 6 models and is henceforth considered as our final DR-BiLSTM (Ensemble) model.

ReLU in Equations 6 and 7), and 3) different initialization seeds.
The main intuition behind this design is that the effectiveness of a model may depend on the complexity of a premise-hypothesis instance. For a simple instance, a simple model could perform better than a complex one, while a complex instance may need further consideration toward disambiguation. Consequently, using models with different rounds of dependent readings in the encoding stage should be beneficial. Figure 2 demonstrates the observed performance of our ensemble method with different number of models. The performance of the models are reported based on the best obtained accuracy on the development set. We also study the effectiveness of other ensemble strategies e.g. majority voting, and averaging the probability distributions. But, our ensemble strategy performs the best among them (see Section 1 in the supplementary material for additional details).

Preprocessing
We perform a trivial preprocessing step on SNLI to recover some out-of-vocabulary words found in the development set and test set. Note that our vocabulary contains all words that are seen in the training set, so there is no out-of-vocabulary word in it. The SNLI dataset is not immune to human errors, specifically, misspelled words. We noticed that misspelling is the main reason for some of the observed out-of-vocabulary words. Consequently, we simply fix the unseen misspelled words using Microsoft spell-checker (other approaches like edit distance can also be used). Moreover, while dealing with an unseen word during evaluation, we try to: 1) replace it with its lower case, or 2) split the word when it contains a "-" (e.g. "marsh-like") or starts with "un" (e.g. "unloading"). If we still could not find the word in our vocabulary, we consider it as an unknown word. In the next subsection, we demonstrate the importance and impact of such trivial preprocessing (see Section 2 in the supplementary material for additional details). Table 2 shows the accuracy of the models on training and test sets of SNLI. The first row represents a baseline classifier presented by Bowman et al. (2015) that utilizes handcrafted features. All other listed models are deep-learning based. The gap between the traditional model and deep learning models demonstrates the effectiveness of deep learning methods for this task. We also report the estimated human performance on the SNLI dataset, which is the average accuracy of five annotators in comparison to the gold labels (Gong et al., 2017). It is noteworthy that recent deep learning models surpass the human performance in the NLI task.

Results
As shown in Table 2, previous deep learning models (rows 2-19) can be divided into three categories: 1) sentence encoding based models (rows 2-7), 2) single inter-sentence attention-based models (rows 8-16), and 3) ensemble inter-sentence attention-based models (rows 17-19). We can see that inter-sentence attention-based models perform better than sentence encoding based models, which supports our intuition. Natural language inference requires a deep interaction between the premise and hypothesis. Inter-sentence attention-based approaches can provide such interaction while sentence encoding based models fail to do so.
To further improve the performance of NLI systems, researchers have built ensemble models. Previously, ensemble systems obtained the best performance on SNLI with a huge margin. Table 2 shows that our proposed single model achieves competitive results compared to these reported ensemble models. Our ensemble model considerably outperforms the current state-of-the-art by obtaining 89.3% accuracy.
Up until this point, we discussed the performance of our models where we have not con-sidered preprocessing for recovering the out-ofvocabulary words. In Table 2, "DR-BiLSTM (Single) + Process", and "DR-BiLSTM (Ensem.) + Process" represent the performance of our models on the preprocessed dataset. We can see that our preprocessing mechanism leads to further improvements of 0.4% and 0.3% on the SNLI test set for our single and ensemble models respectively. In fact, our single model ("DR-BiLSTM (Single) + Process") obtains the state-of-the-art performance over both reported single and ensemble models by performing a simple preprocessing step. Furthermore, "DR-BiLSTM (Ensem.) + Process" outperforms the existing state-of-the-art remarkably (0.7% improvement). For more comparison and analyses, we use "DR-BiLSTM (Single)" and "DR-BiLSTM (Ensemble)" as our single and ensemble models in the rest of the paper.

Ablation and Configuration Study
We conducted an ablation study on our model to examine the importance and effect of each major component. Then, we study the impact of BiL-STM dimensionality on the performance of the development set and training set of SNLI. We investigate all settings on the development set of the SNLI dataset.   Table 3 shows the ablation study results on the development set of SNLI along with the statistical significance test results in comparison to the proposed model, DR-BiLSTM. We can see that all modifications lead to a new model and their differ-ences are statistically significant with a p-value of < 0.001 over Chi square test. Table 3 shows that removing any part from our model hurts the development set accuracy which indicates the effectiveness of these components. Among all components, three of them have noticeable influences: max pooling, difference in the attention stage, and dependent reading.
Most importantly, the last four study cases in Table 3 (rows 8-11) verify the main intuitions behind our proposed model. They illustrate the importance of our proposed dependent reading strategy which leads to significant improvement, specifically in the encoding stage. We are convinced that the importance of dependent reading in the encoding stage originates from its ability to focus on more important and relevant aspects of the sentences due to its prior knowledge of the other sentence during the encoding procedure.  Figure 3 shows the behavior of the proposed model accuracy on the training set and development set of SNLI. Since the models are selected based on the best observed development set accuracy during the training procedure, the training accuracy curve (red, top) is not strictly increasing. Figure 3 demonstrates that we achieve the best performance with 450-dimensional BiLSTMs. In other words, using BiLSTMs with lower dimensionality causes the model to suffer from the lack of space for capturing proper information and dependencies. On the other hand, using higher dimensionality leads to overfitting which hurts the performance on the development set. Hence, we use 450-dimensional BiLSTM in our proposed model.

Analysis
We first investigate the performance of our models categorically. Then, we show a visualization of the energy function in the attention stage (Equation 3) for an instance from the SNLI test set.
To qualitatively evaluate the performance of our models, we design a set of annotation tags that can be extracted automatically. This design is inspired by the reported annotation tags in Williams et al. (2017). The specifications of our annotation tags are as follows: • High Overlap: premise and hypothesis sentences share more than 70% tokens.
• Long Sentence: either sentence is longer than 20 tokens.
• Regular Sentence: premise or hypothesis length is between 5 and 20 tokens.
• Short Sentence: either sentence is shorter than 5 tokens.
• Negation: negation is present in a sentence.
• Quantifier: either of the sentences contains one of the following quantifiers: much, enough, more, most, less, least, no, none, some, any, many, few, several, almost, nearly.
• Belief: either of the sentences contains one of the following belief verbs: know, believe, understand, doubt, think, suppose, recognize, forget, remember, imagine, mean, agree, disagree, deny, promise. Table 4 shows the frequency of aforementioned annotation tags in the SNLI test set along with the performance (accuracy) of ESIM (Chen et al., 2017), DR-BiLSTM (Single), and DR-BiLSTM (Ensemble). Table 4 can be divided into four major categories: 1) gold label data, 2) word overlap, 3) sentence length, and 4) occurrence of special words. We can see that DR-BiLSTM (Ensemble) performs the best in all categories which matches our expectation. Moreover, DR-BiLSTM (Single)   (Chen et al., 2017), DR-BiLSTM (DR(S)) and Ensemble DR-BiLSTM (DR(E)) on the SNLI test set.
performs noticeably better than ESIM in most of the categories except "Entailment", "High Overlap", and "Long Sentence", for which our model is not far behind (gaps of 0.2%, 0.5%, and 0.9%, respectively). It is noteworthy that DR-BiLSTM (Single) performs better than ESIM in more frequent categories. Specifically, the performance of our model in "Neutral", "Negation", and "Quantifier" categories (improvements of 1.4%, 3.5%, and 1.9%, respectively) indicates the superiority of our model in understanding and disambiguating complex samples. Our investigations indicate that ESIM generates somewhat uniform attention for most of the word pairs while our model could effectively attend to specific parts of the given sentences and provide more meaningful attention. In other words, the dependent reading strategy enables our model to achieve meaningful representations, which leads to better attention to obtain further gains on such categories like Negation and Quantifier sentences (see Section 3 in the supplementary material for additional details).
Finally, we show a visualization of the normalized attention weights (energy function, Equation 3) of our model in Figure 4. We show a sentence pair, where the premise is "Male in a blue jacket decides to lay the grass.", and the hypothesis is "The guy in yellow is rolling on the grass.", and its logical relationship is contradiction. Figure 4 indicates the model's ability in attending to critical pairs of words like <Male, guy>, <decides, rolling>, and <lay, rolling>. Finally, high attention between {decides, lay} and

Conclusion
We propose a novel natural language inference model (DR-BiLSTM) that benefits from a dependent reading strategy and achieves the state-of-theart results on the SNLI dataset. We also introduce a sophisticated ensemble strategy and illustrate its effectiveness through experimentation. Moreover, we demonstrate the importance of a simple preprocessing step on the performance of our proposed models. Evaluation results show that the preprocessing step allows our DR-BiLSTM (single) model to outperform all previous single and ensemble methods. Similar superior performance is also observed for our DR-BiLSTM (ensemble) model. We show that our ensemble model outperforms the existing state-of-the-art by a considerable margin of 0.7%. Finally, we perform an extensive analysis to demonstrate the strength and weakness of the proposed model, which would pave the way for further improvements in this domain.