When data permutations are pathological: the case of neural natural language inference

Consider two competitive machine learning models, one of which was considered state-of-the art, and the other a competitive baseline. Suppose that by just permuting the examples of the training set, say by reversing the original order, by shuffling, or by mini-batching, you could report substantially better/worst performance for the system of your choice, by multiple percentage points. In this paper, we illustrate this scenario for a trending NLP task: Natural Language Inference (NLI). We show that for the two central NLI corpora today, the learning process of neural systems is far too sensitive to permutations of the data. In doing so we reopen the question of how to judge a good neural architecture for NLI, given the available dataset and perhaps, further, the soundness of the NLI task itself in its current state.


Introduction
There is increased interest today in the detection of information quality: whether a statement is true or false, or equivalently, whether one statement (the premise) is entailed by, contradicts or has no relation to another statement (the hypothesis). This is the Natural Language Inference task. The timely development of the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and more recently the Multi-Genre NLI (MULTI-NLI) corpus (Williams et al., 2017) has lead to a steady increase in contributions to research on NLI. Recent research, however, indicates these central datasets to be trivially annotated to a large degree (Gururangan et al., 2018). In this paper, we give evidence of an unrelated problem.
Deep neural network approaches provide the state-of-the-art today for the tasks. We show a pathological sensitivity of these systems to permutations of the training set. This calls into question the soundness of the task in its current state and corresponding development and benchmarking of NLI neural systems. Given the approximate iterative optimisation methods required to induce models, a common practice for statistical learning is to first randomly shuffle the training set. This serves to offset unwanted bias due to accidental ordering of the examples (for example, with respect to time or class). Probably because of this, there is very little literature or understanding of the effect of order among the training examples-strict ordering of examples has simply been known to be undesirable. However, as neural network approaches increasingly dominate in performance across many NLP tasks, the notion of random shuffling has become overshadowed by that of computational efficiency. This seems to lead back to experiment parameters from Sutskever et al. (2014)'s work, which demonstrates that grouping examples of similar sentence length into batches results in a halved training time. But what is the effect on accuracy? Indeed, a curiosity of neural network models induced over the NLI datasets is the ease in acquiring rather instable results as a result of simple example order permutations of the training set.
In this paper, we present an investigation into these questions. In particular, we consider two simple but competitive neural network topologies in order to investigate the effect of training example order from these datasets on performance. One of these achieves close to state-of-art results over SNLI and state-of-the-art for MULTI-NLI (for simple systems, Cf. Section 4.1), while the other is a simpler variant of the first, characterised by fewer parameters. Out of the standard deep learning, standard machine learning and other plausi-ble orderings of the dataset, we show that only the original ordering of the training examples leads to state-of-the-art model induction. We also show that the gap in performance between the two neural architectures described here generally drops over all other permutations.

NLI task and datasets
The NLI task input is two sentences, a = (a 1 , . . . , a la ) and b = (b 1 , . . . , b l b ) of lengths l a and l b respectively. Each a i (resp. b j ) for i ∈ [l a ] (resp. j ∈ [l b ]) corresponds to the word embedding with dimension d for the ith (resp. jth) word. The task dataset consists of labeled pairs of sentences, {a (n) , b (n) , y (n) } N n=1 , where y (n) ∈ {entailment, contradiction, neutral} is the class.
We use two central datasets for our study here: SNLI and MULTI-NLI (Bowman et al., 2015;Williams et al., 2017), containing over 570K and 433K sentence pairs respectively.

Our neural network architectures
For the research presented in this paper, we choose two related and relatively simple neural network architectures corresponding roughly to Parikh et al. (2016) and a simplified version of Chen et al. (2017) consisting of five components. 1 Note that we also downloaded Chen et al. (2017)'s code 2 and ran it over the datasets; since it did not perform with the same accuracy reported in the paper for our run (87.77% instead of the reported 88%), and since it took several hours longer than our implementation to run (+7 hours longer), we concentrated on our own implementation. We also attempted to run Gong et al. (2017)'s system 3 ; for our runs, the system halted without completion after several hours. Moreover, there are reports on the difficulty in getting the architecture to achieve the accuracy score of 88% reported in the original paper, Mirakyan et al. (2018) reporting that their re-implementation could only achieve 86.38%. This is significantly worst-performing than our best system, which we present now.
(1) Pre-projection. To compensate approximately for not updating the original embeddings during learning, we first carry out a preliminary projection of the embeddings, to the same dimensions using a feed-forward network.
(2) Embedding projection. We further project embeddings via either a simple feed-forward (FF) with a ReLU activation function or a bidirectional LSTM (BiLSTM) layer. The result is then sent to the attention component. This corresponds precisely therefore to Parikh et al. (2016)'s computationally efficient approximation of the vector product before soft-alignment.
(3) Attention. The attention mechanism, first introduced by Bahdanau et al. (2015), is based on a matrix of all-pairs scores between the elements of two sequences a i and b j : 4 We follow (Parikh et al., 2016) and later attentionbased models for NLI, by representing the importance of a i with respect to b as the normalised sum which is then projected down to original embedding dimension. The same is done for b j with respect to a. (5) Prediction. Finally we feed a vector concatenation of both sentence vectors as input to a component consisting of three feed-forward layers with dropout and regularisation, followed by a linear softmax layer for prediction.
Instantiated architectures. The two topologies we adopt for this study consist of the above components with embedding projection and aggregation as follows: • FF/SUM corresponding to the embedding projection instantiated with a feed-forward network and the aggregation carried out through vector summation, and • Bi/LSTM corresponding to the embedding projection instantiated with a BiLSTM and the aggregation carried out via an LSTM.
Other hyperparameters. We use 300 dimensional GloVe embeddings trained on the Common Crawl 840B tokens dataset (Pennington et al., 2014), which remain fixed during training. Out of vocabulary (OOV) words are represented as zero vectors. 6 We use a 0.2 dropout rate and L2 regularisation, applied in all feed-forward layers. We optimise the network using categorical cross-entropy loss and employ the RMSprop optimizer with ρ set to 0.9, a 0.001 learning rate, with a batch size of 512 and use early stopping over the development set after no improvement in accuracy after 4 epochs.
State-of-the-art for simple systems. The models are simple in that no information above word embeddings is taken as input, for example, no POS-tags or syntactic relations (See Section 4.1). Our Bi/LSTM system also currently sets the stateof-the-art for simple systems on the MULTI-NLI dataset (See Section 4.1).

Related work
Previous work is related either in terms of the neural architectures for NLI (Section 4.1), or in terms of work on training data permutations in learning (Section 4.2).

State-of-the-art NLI
There are different types of neural network systems in the literature with respect to the simplic-ity of input data required for modeling and interdependence of the internal modules. In this work, we only consider simple system approaches that use only word embeddings (no character representations, POS-tags, word-position, syntactic tree, external resources, etc.) and consist only of interdependent modules (not ensembles). We make no claims regarding linguistic or ensemblecomplex systems, but make the straightforward hypothesis that the conclusions presented here can be extended to cover more complex frameworks as well, especially given that our systems are strongly competitive or even better performing than two highly complex state-of-the-art neural systems (Cf. Section 3).
For simple systems, the state-of-the-art is currently set by Sha et al. (2016) at 87.5%, on SNLI. They use a standard BiLSTM to read the premise, and propose a Bi-rLSTM to read the hypothesis. Their proposed rLSTM-"re-read" LSTM unittakes the attention vector of one sentence as an inner state while reading the other sentence. The output of the standard BiLSTM is also taken as the general input of the bidirectional rLSTM. The currently published next best performing simple system, Parikh et al. (2016) at 86.3% accuracy, introduced use of the attention mechanism for the NLI task, the way it is generally being used today. 7 Morishita et al. (2017) explored the effect of minibatching on the learning of Neural Machine Translation models, carrying out their experiments on two datasets (two language-pairs). In particular, they studied the strategies of (1) sorting by length of the source sentence, (2) target sentence, or (3) both, among other things. They empirically compare their efficiency on two translation tasks and find that some strategies in wide use are not necessarily optimal for accuracy and convergence-wise. In contrast to the work described here however, one of the sorting strategies produced best results, though no comparison was made with the original ordering of examples. By contrast, we show that it is by making non-canonical (semi-non-random) orderings of the data that best results are achieved in NLI, for the two available datasets.  In addition to each of these, we consider the reversal of the order (indicated by the suffix -r.) In order to not exhaust the test set, we generate the results over data permutations on the development set. These are given in Table 2.

Example permutation in learning
Discussion. For both datasets, we observe that all other training example permutations result in a substantial drop in performance, by approximately 3-4% on SNLI and 1-6% on MULTI-NLI. Even a simple reversal of the original order leads to a substantial drop in performance. Shuffling consistently the data provides the strongest alternative training conditions to the original ordering.
Moreover, the difference in performance of the two separate architectures is generally much lower on all other permutations of the training data, calling into question the significance of the more complex components. These observations apply to both randomly shuffling (as advised in statistical learning practise) and ordering the data by length (as advised in deep learning for NLP practise).
For an analysis of the sorting permutations, we looked into whether hypothesis sentence lengths differed by class to such an extent that the dataset became sorted by class label. We cannot include the results here due to space constraints. However, we observed that ordering by class results in small (and quite interesting) drops in performance, so long as the original order is preserved. If we first randomly shuffle the examples before ordering them by class, similar drops in performance result to those in Table 2.

Conclusions
We have shown that models induced over the SNLI and MULTI-NLI datasets are greatly affected by the permutation of the training data instances at hand: recommended statistical learning or deep learning engineering strategies for ordering the training examples result in significantly and even substantially worse performance over these datasets. Our models are simple (no information over word embedding representations of sentences and no ensembles), but strongly competitive with (or better performing than) both SOTA (re-)implementations of much more complex neural systems. We make the straightforward hypothesise that these observations will extend to more complex models; we leave this to be verified by future work.