Learning Robust Representations of Text

Deep neural networks have achieved remarkable results across many language processing tasks, however these methods are highly sensitive to noise and adversarial attacks. We present a regularization based method for limiting network sensitivity to its inputs, inspired by ideas from computer vision, thus learning models that are more robust. Empirical evaluation over a range of sentiment datasets with a convolutional neural network shows that, compared to a baseline model and the dropout method, our method achieves superior performance over noisy inputs and out-of-domain data.

However, deep models are often overconfident for noisy test instances, making them susceptible to adversarial attacks (Nguyen et al., 2015;Tabacof and Valle, 2016).Goodfellow et al. (2014) argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, due to neural models being intentionally designed to behave in a mostly linear manner to facilitate optimization.Fawzi et al. (2015) provided a theoretical framework for analyzing the robustness of classifiers to adversarial perturbations, and also showed linear models are usually not robust to adversarial noise.
In this work, we present a regularization method which makes deep learning models more robust to noise, inspired by Rifai et al. (2011).The intuition behind the approach is to stabilize predictions by minimizing the ability of features to perturb predictions, based on high-order derivatives.Rifai et al. (2011) introduced contractive autoencoders based on similar ideas, using the Frobenius norm of the Jacobian matrix as a penalty term to extract robust features.
Also related, Martens (2010) investigated a second-order optimization method based on Hessian-free approach for training deep auto-encoders.Where our proposed approach differs is that we train models using first-order derivatives of the training loss as part of a regularization term, necessitating second-order derivatives for computing the gradient.We empirically demonstrate the effectiveness of the model over text corpora with increasing amounts of artificial masking noise, using a range of sentiment analysis datasets (Pang and Lee, 2008) with a convolutional neural network model (Kim, 2014).In this, we show that our method is superior to dropout (Srivastava et al., 2014) and a baseline method using MAP training.

Training for Robustness
Our method introduces a regularization term during training to ensure model robustness.We develop our approach based on a general class of parametric models, with the following structure.Let x be the input, which is a sequence of (discrete) words, represented by a fixed-size vector of continuous values, h.A transfer function takes h as input and produces an output distribution, y pred .Training proceeds using stochastic gradient descent to minimize a loss function L, measuring the difference between y pred and the truth y true .
The purpose of our work is to learn neural models which are more robust to strange or invalid inputs.When small perturbations are applied on x, we want the prediction y pred to remain stable.Text can be highly variable, allowing for the same information to be conveyed with different word choice, different syntactic structures, typographical errors, stylistic changes, etc.This is a particular problem in transfer learning scenarios such as domain adaptation, where the inputs in distinct domains are drawn from related, but different, distributions.A good model should be robust to these kinds of small changes to the input, and produce reliable and stable predictions.
Next we discuss methods for learning models which are robust to variations in the input, before providing details of the neural network model used in our experimental evaluation.

Conventional Regularization and Dropout
Conventional methods for learning robust models include l 1 and l 2 regularization (Ng, 2004), and dropout (Srivastava et al., 2014).
In fact, Wager et al. (2013) showed that the dropout regularizer is first-order equivalent to an l 2 regularizer applied after scaling the features.Dropout is also equivalent to "Follow the Perturbed Leader" (FPL) which perturbs exponential numbers of experts by noise and then predicts with the expert of minimum perturbed loss for online learning robustness (van Erven et al., 2014).Given its popularity in deep learning, we take dropout to be a strong baseline in our evaluation.
The key idea behind dropout is to randomly zero out units, along with their connections, from the network during training, thus limiting the extent of coadaptation between units.We apply dropout on the representation vector h, denoted ĥ = dropout β (h), where β is the dropout rate.Similarly to our proposed method, training with dropout requires gradi-ent based search for the minimizer of the loss L.
We also use dropout to generate noise in the test data as part of our experimental simulations, as we will discuss later.

Robust Regularization
Our method is inspired by the work on adversarial training in computer vision (Goodfellow et al., 2014).
In image recognition tasks, small distortions that are indiscernible to humans can significantly distort the predictions of neural networks (Szegedy et al., 2014).An intuitive explanation of our regularization method is, when noise is applied to the data, the variation of the output is kept lower than the noise.We adapt this idea from Rifai et al. (2011) and develop the Jacobian regularization method.
The proposed regularization method works as follows.Conventional training seeks to minimise the difference between y true and y pred .However, in order to make our model robust against noise, we also want to minimize the variation of the output when noise is applied to the input.This is to say, when perturbations are applied to the input, there should be as little perturbation in the output as possible.Formally, the perturbations of output can be written as p y = M(x +p x )−M(x), where x is the input, p x is the vector of perturbations applied to x, M expresses the trained model, p y is the vector of perturbations generated by the model, and the output distribution y = M(x).Therefore and distance lim px→0 p y /p x , 0 = ∂y ∂x F .
In other words, minimising local noise sensitivity is equivalent to minimising the Frobenius norm of the Jacobean matrix of partial derivatives of the model outputs wrt its inputs.
To minimize the effect of perturbation noise, our method involves an additional term in the loss function, in the form of the derivative of loss L with respect to hidden layer h.Note that while in principle we could consider robustness to perturbations in the input x, the discrete nature of x adds additional mathematical complications, and thus we defer this setting for future work.Combining the elements, the new loss function can be expressed as where λ is a weight term, and distance takes the form of the l 2 norm.The training objective in Equation (1) supports gradient optimization, but note that it requires the calculation of second-order derivatives of L during back propagation, arising from the ∂L/∂h term.Henceforth we refer to this method as robust regularization.

Convolutional Network
For the purposes of this paper, we focus exclusively on convolutional neural networks (CNNs), but stress that the method is compatible with other neural architectures and other types of parametric models (not just deep neural networks).The CNN used in this research is based on the model proposed by Kim (2014), and is outlined below.
Let S be the sentence, consisting of n words {w 1 , w 2 , • • • , w n }.A look-up table is applied to S, made up of word vectors e i ∈ R m corresponding to each word w i , where m is the word vector dimensionality.Thus, sentence S can be represented as a matrix E S ∈ R m×n by concatenating the word vectors E S = n i=1 e w i .A convolutional layer combined with a number of wide convolutional filters is applied to E S .Specifically, the k-th convolutional filter operator filter k involves a weight vector w k ∈ R m×t , which works on every t k -sized window of E S , and is accompanied by a bias term b ∈ R. The filter operator is followed by the non-linear function F , a rectified linear unit, ReLU, followed by a max-pooling operation, to generate a hidden activation h k = MaxPooling(F (filter k (E S ; w k , b)).Multiple filters with different window sizes are used to learn different local properties of the sentence.We concatenate all the hidden activations h k to form a hidden layer h, with size equal to the number of filters.Details of parameter settings can be found in Section 3.2.
The feature vector h is fed into a final softmax layer with a linear transform to generate a probability distribution over labels where w and b are parameters.Finally, the model minimizes the loss of the cross-entropy between the ground-truth and the model prediction, L = CrossEntropy(y true , y pred ), for which we use stochastic gradient descent.

Datasets and Experimental Setups
We experiment on the following datasets,2 following Kim (2014): • MR: Sentence polarity dataset (Pang and Lee, 2008)  In each case, we evaluate using classification accuracy.

Noisifying the Data
Different to conventional evaluation, we corrupt the test data with noise in order to evaluate the robustness of our model.We assume that when dealing with short text such as Twitter posts, it is common to see unknown words due to typos, abbreviations and sociolinguistic marking of different types (Han and Baldwin, 2011;Eisenstein, 2013).To simulate this, we apply word-level dropout noise to each document, by randomly replacing words by a unique sentinel symbol. 6This is applied to each word with probability α ∈ {0, 0.1, 0.2, 0.3}.
We also experimented with adding different levels of Gaussian noise to the sentence embeddings E S , but found the results to be largely consistent with those for word dropout noise, and therefore we have omitted these results from the paper.
To directly test the robustness under a more realistic setting, we additionally perform cross-domain evaluation, where we train a model on one dataset and apply it to another.For this, we use the pairing of MR and CR, where the first dataset is based on movie reviews and the second on product reviews, but both use the same label set.Note that there is a significant domain shift between these corpora, due to the very nature of the items reviewed.

Word Vectors and Hyper-parameters
To set the hyper-parameters of the CNN, we follow the guidelines of Zhang and Wallace (2015), setting word embeddings to m = 300 dimensions and initialising based on word2vec pre-training (Mikolov et al., 2013).Words not in the pre-trained vector table were initialized randomly by the uniform distribution U ([−0.25, 0.25) m ).The window sizes of filters (t) are set to 3, 4, 5, with 128 filters for each size, resulting in a hidden layer dimensionality of 384 = 128 × 3. We use the Adam optimizer (Kingma and Ba, 2015) for training.

Results and Discussions
The results for word-level dropout noise are presented in Table 1.In general, increasing the wordlevel dropout noise leads to a drop in accuracy for all four datasets, however the relative dropoff in accuracy for Robust Regularization is less than for Word Dropout, and in 15 out of 16 cases (four noise levels across the four datasets), our method achieves the best result.Note that this includes the case of α = 0, where the test data is left in its original form, which shows that Robust Regularization is also an effective means of preventing overfitting in the model.
For each dataset, we also evaluated based on the combination of Word Dropout and Robust Regularization using the fixed parameters β = 0.5 and λ = 10 −2 , which are overall the best individual settings.The combined approach performs better than either individual method for the highest noise levels tested across all datasets.This indicates that Ro- bust Regularization acts in a complementary way to Word Dropout.Table 2 presents the results of the cross-domain experiment, whereby we train a model on MR and test on CR, and vice versa, to measure the robustness of the different regularization methods in a more real-world setting.Once again, we see that our regularization method is superior to word-level dropout and the baseline CNN, and the techniques combined do very well, consistent with our findings for synthetic noise.

Running Time
Our method requires second-order derivatives, and thus is a little slower at training time.Figure 1 is a plot of the training and test accuracy at varying points during training over SST.
We can see that the runtime till convergence is only slightly slower for Robust Regularization than standard training, at roughly 30 minutes on a twocore CPU (one fold) with standard training vs. 35-40 minutes with Robust Regularization.The convergence time for Robust Regularization is comparable to that for Word Dropout.

Conclusions
In this paper, we present a robust regularization method which explicitly minimises a neural model's sensitivity to small changes in its hidden representation.Based on evaluation over four sentiment analysis datasets using convolutional neural networks, we found our method to be both superior and complementary to conventional word-level dropout under varying levels of noise, and in a cross-domain evalu- ation.
For future work, we plan to apply our regularization method to other models and tasks to determine how generally applicable our method is.Also, we will explore methods for more realistic linguistic noise, such as lexical, syntactic and semantic noise, to develop models that are robust to the kinds of data often encountered at test time.

Figure 1 :
Figure 1: Time-accuracy evaluation over the different combinations of Word Dropout (dropout) and Robust Regularization (robust reg) over SST, without injecting noise.

Table 1 :
Accuracy (%) with increasing word-level dropout across the four datasets.For each dataset, we apply four levels of noise α = {0, 0.1, 0.2, 0.3}; the best result for each combination of α and dataset is indicated in bold.The Baseline model is a simple CNN model without regularization.The last model combines dropout and our method with fixed parameters β and λ as indicated.

Table 2 :
Accuracy under cross-domain evaluation; the best result for each dataset is indicated in bold.