Testing the Generalization Power of Neural Network Models across NLI Benchmarks

Neural network models have been very successful in natural language inference, with the best models reaching 90% accuracy in some benchmarks. However, the success of these models turns out to be largely benchmark specific. We show that models trained on a natural language inference dataset drawn from one benchmark fail to perform well in others, even if the notion of inference assumed in these benchmarks is the same or similar. We train six high performing neural network models on different datasets and show that each one of these has problems of generalizing when we replace the original test set with a test set taken from another corpus designed for the same task. In light of these results, we argue that most of the current neural network models are not able to generalize well in the task of natural language inference. We find that using large pre-trained language models helps with transfer learning when the datasets are similar enough. Our results also highlight that the current NLI datasets do not cover the different nuances of inference extensively enough.


Introduction
Natural Language Inference (NLI) has attracted considerable interest in the NLP community and, recently, a large number of neural network-based systems have been proposed to deal with the task. These approaches can be usually categorized into: a) sentence encoding models, and b) other neural network models. Both of them have been very successful, with the state of the art on the SNLI and MultiNLI datasets being 90.1% (Kim et al., 2018) and 86.7% (Devlin et al., 2018) respectively. However, a big question w.r.t to these systems is their ability to generalize outside the specific datasets they are trained and tested on. Recently, Glockner et al. (2018) have shown that state-of-the-art NLI systems break considerably easily when instead of tested on the original SNLI test set, they are tested on a test set which Preprint. Work in progress. is constructed by taking premises from the training set and creating several hypotheses from them by changing at most one word within the premise. The results show a very significant drop in accuracy for three of the four systems. The system that was more difficult to break and had the less loss in accuracy was the system by Chen et al. (2018) which utilizes external knowledge taken from WordNet (Miller, 1995).
In this paper we show that NLI systems that have been very successful in specific NLI benchmarks, fail to generalize when trained on a specific NLI datset and then tested across different NLI benchmarks. The results we get are in line with Glockner et al. (2018), showing that the generalization capability of the individual NLI systems is very limited, but, what is more, they further show the only system that was less prone to breaking in Glockner et al. (2018), breaks in the experiments we have conducted as well.

Related Work
The ability of NLI systems to generalize and related skepticism has been raised in a recent paper by Glockner et al. (2018). There, the authors show that the generalization capabilities of stateof-the-art NLI systems, in cases where some kind of external lexical knowledge is needed, drops dramatically when the SNLI test set is replaced by a test set where the premise and the hypothesis are otherwise identical except for at most one word. The results show a very significant drop in accuracy.  recognize the generalization problem that comes with training on datasets like SNLI, which tend to be homogeneous with linguistic variation. In this context, they propose to better train NLI models by making use of adversarial examples. Gururangan et al. (2018) show that datasets like SNLI and MultiNLI contain un-intentional annotation artifacts which help neural network models in classification. On a theoretical and methodological level, there is discussion on the nature of various NLI datasets, as well as the definition of what counts as NLI and what does not. For example, Chatzikyriakidis et al. (2017) present an overview of the most standard datasets for NLI and show that the definitions of inference in each of them are actually quite different.

Data
We chose three different datasets for the experiments: SNLI, MultiNLI and SICK. All of them have been designed for NLI involving three-way classification. The selected datasets use the same three labels entailment, neutral and contradiction. We did not include any datasets with two-way classification, e.g. SciTail . As SICK is a relatively small dataset with approximately only 10k sentence pairs, we did not use it as training data in any experiment. We also trained our models with a combined SNLI + MultiNLI training set.
All the experimental combinations are listed in Table 1. Examples from the selected dataset are provided in Table 2. We describe the three datasets in more detail below.

SNLI
The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) is a dataset of 570k human-written sentence pairs manually labeled with the labels entailment, contradiction, and neutral. The dataset is divided into training (550,152 pairs), development (10,000 pairs) and test sets (10,000 pairs). The source for the premise sentences in SNLI were image captions taken from the Flickr30k corpus (Young et al., 2014).

MultiNLI
The Multi-Genre Natural Language Inference (MultiNLI) corpus (Williams et al., 2018) is a broad-coverage corpus for NLI, consisting of 433k human-written sentence pairs labeled with entailment, contradiction and neutral. Unlike the SNLI corpus, which draws the premise sentence from image captions, MultiNLI consists of sentence pairs from ten distinct genres of both written and spoken English. The dataset is divided into training (392,702 pairs), development (20,000 pairs) and test sets (20,000 pairs).
Only five genres are included in the training set. The development and test sets have been divided into matched and mismatched, where the former includes only sentences from the same genres as the training data, and the latter includes sentences from the remaining genres not present in the training data. We used the matched development set (MultiNLI-m) for the experiments.

SICK
SICK (Marelli et al., 2014) is a dataset that was originally constructed to test compositional distributional semantics (DS) models. The dataset contains 9,840 examples pertaining to logical inference (negation, conjunction, disjunction, apposition, relative clauses, etc.). However, its focus are distributional semantic approaches. Therefore, it normalises several cases DS is not expected to account for. The dataset consists of approximately 10k test pairs annotated for inference (three-way) and relatedness. The dataset is constructed taking pairs of sentences from a random subset of the 8K ImageFlickr data set 1 and the SemEval 2012 STS MSRVideo Description dataset 2 .

Model and Training Details
We perform experiments with five models, two from sentence encoding approaches and three coming from cross-sentence approaches. For sentence encoding models, we have chosen a simple one-layer bidirectional LSTM with max pooling (BiLSTM-max) with the hiddens size of 600D per direction, used e.g. in InferSent (Conneau et al., 2017), and HBMP (Talman et al., 2018). For the other models, we have chosen ESIM (Chen et al., 2017), which includes cross-sentence attention, and KIM (Chen et al., 2018), which has cross-entailment SICK A person, who is riding a bike, is wearing gear which is black A biker is wearing gear which is black SNLI A young family enjoys feeling ocean waves lap at their feet. A family is at the beach.

MultiNLI
Kal tangled both of Adrin's arms, keeping the blades far away. Adrin's arms were tangled, keeping his blades away from Kal. contradiction SICK There is no man wearing a black helmet and pushing a bicycle One man is wearing a black helmet and pushing a bicycle SNLI A man with a tattoo on his arm staring to the side with vehicles and buildings behind him. A man with no tattoos is getting a massage.

MultiNLI
Also in Eustace Street is an information office and a cultural center for children, The Ark. The Ark, a cultural center for kids, is located in Joyce Street.
neutral SICK A little girl in a green coat and a boy holding a red sled are walking in the snow A child is wearing a coat and is carrying a red sled near a child in a green and black coat SNLI An old man with a package poses in front of an advertisement. A man poses in front of an ad for beer.

MultiNLI
Enthusiasm for Disney's Broadway production of The Lion King dwindles.
The broadway production of The Lion King was amazing, but audiences are getting bored. sentence attention and utilizes external knowledge. We also selected one model involving a pretrained language model, namely ESIM + ELMo . All of the models perform well on the SNLI dataset, reaching near stateof-the-art accuracy in the sentence encoding and the other category respectively. KIM is particularly interesting in this context as it performed significantly better than other models in the Breaking NLI experiment conducted by Glockner et al. (2018).
For BiLSTM-max we used the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 5e-4. The learning rate was decreased by the factor of 0.2 after each epoch if the model did not improve. We used a batch size of 64. Dropout of 0.1 was used between the layers of the multilayer perceptron classifier, except before the last layer. The models were evaluated with the development data after each epoch and training was stopped if the development loss increased for more than 3 epochs. The model with the highest development accuracy was selected for testing. The BiLSTM-max models were initialized with pretrained GloVe 840B word embeddings of size 300 dimensions (Pennington et al., 2014), which were fine-tuned during training. Our implementation of BiLSTM-max was done in PyTorch.
For HBMP, ESIM and KIM we used the original implementations as well as the default settings and hyperparameter values as described in (Talman et al., 2018), (Chen et al., 2017) and (Chen et al., 2018) respectively, adjusting only the vocabulary size based on the dataset used. For ESIM + ELMo we used the AllenNLP  PyTorch implementation with the default settings and hyperparameter values. 3 Our experiments show that, while all of the five models perform well when the test set is taken from the same corpus as the training and development set, accuracy drops significantly when the test data is drawn from a separate corpus, the average drop in accuracy being 25.4 points across all experiments.

Experimental Results
The accuracy drops the most when a model is tested on SICK. The drop in this case is between 19.0-28.9 points when trained on MultiNLI, between 31.6-33.7 points when trained on SNLI and between 31.1-33.0 when trained on SNLI + MultiNLI. This result was somewhat expected, as the method of constructing the sentence pairs was different, and therefore there is too much difference in the kind of sentence pairs between the training and test sets for the models to be able to transfer what it has learned to the test examples. However, the drop in accuracy was not expected to be that dramatic.
The most surprising result was that the accuracy of all models drops significantly even in the set-up where the models were trained on MultiNLI and tested on SNLI (7.9-11.1 points). This is surprising as both of these datasets have been constructed with a similar data collection method using the same definition of inference (i.e. same definition   Glockner et al. (2018), utilizing external knowledge did not improve the model's generalization capability, as KIM performed equally poorly across all dataset combinations. Also including a pretrained language model did not improve the results significantly.

Conclusion
In this paper we have shown that neural network models for NLI fail to generalize across different NLI benchmarks. We experimented with five state-of-the-art models covering both sentence en-coding approaches and cross-sentence attention models. For all the systems, the accuracy drops between 7.9-33.7 points (the average drop being 25.4 points), when testing with a test set drawn from a separate corpus from that of the training data, as compared to when the test and training data are splits from the same corpus.
Our findings, together with the previous negative findings e.g. by Glockner et al. (2018) and Gururangan et al. (2018), indicate that the current state-of-the-art neural network models fail to capture the semantics of NLI in a way that will enable them to generalize across different NLI situations. The results indicate two issues to be taken into consideration: a) using datasets involving a fraction of what NLI is, will fail when tested in datasets that are testing for a slightly different definition. This is evident when we move from the SNLI to the SICK dataset. b) NLI is to some extent also genre/context dependent. Training on SNLI and testing on MultiNLI gives worse results than vice versa. This can be seen as an indication that training on multiple genres helps. However, this is still not enough given that, even in case of training on MultiNLI and testing on SNLI, accuracy drops significantly. Further work is required on better data resources as well as on better neural network models to tackle these issues.