Robust Training under Linguistic Adversity

Deep neural networks have achieved remarkable results across many language processing tasks, however they have been shown to be susceptible to overfitting and highly sensitive to noise, including adversarial attacks. In this work, we propose a linguistically-motivated approach for training robust models based on exposing the model to corrupted text examples at training time. We consider several flavours of linguistically plausible corruption, include lexical semantic and syntactic methods. Empirically, we evaluate our method with a convolutional neural model across a range of sentiment analysis datasets. Compared with a baseline and the dropout method, our method achieves better overall performance.


Introduction
Deep learning has achieved state-of-the-art results across a range of computer vision (Krizhevsky et al., 2012), speech recognition (Graves et al., 2013), and natural language processing tasks (Bahdanau et al., 2015;Kalchbrenner et al., 2014;Bitvai and Cohn, 2015). However, deep models tend to be overconfident in their predictions over noisy test instances, including adversarial examples (Szegedy et al., 2014;Goodfellow et al., 2015). A range of methods have been proposed to train models to be more robust, such as injecting noise into the data and hidden layers (Jiang et al., 2009), dropout (Srivastava et al., 2014), and the incorporation of explicit regularization terms into the training objective (Ng, 2004;Li et al., 2016).
In this work, we propose a linguisticallymotivated method customised to text applications, based on injecting different kinds of word-and sentence-level linguistic noise into the input text, inspired by adversarial examples (Goodfellow et al., 2015). Our method has its origins in computer vision, where it has been shown that small pixel perturbations indiscernible to humans can significantly distort the predictions of state-of-the-art deep models (Szegedy et al., 2014;Nguyen et al., 2015), an observation that has been harnessed in recent work on adversarial training (Goodfellow et al., 2015). This kind of noise is cheap to generate for images and is transferable between different models, but it is less clear how to generate analogous textual noise while preserving the fidelity of the training data, due to text being discrete and sequential in nature, with latent syntactic structure. Based on the same linguistic intuition, adversarial evaluation for natural language processing models was proposed by Smith (2012). Also, adversarial learning for text, such as perceptron learning (Søgaard, 2013) and unsupervised estimation methods (Smith and Eisner, 2005), have been studied in the language area. Word embeddings learned from WORD2VEC (Mikolov et al., 2013) and GLOVE (Pennington et al., 2014) are now widely used as input to language processing models, however these representations are highly susceptible to noise. For example, Figure 1 shows that as we add adversarial noise η = ∇ x Loss(x, y, θ) to WORD2VEC representations, classification accuracy for a convolutional model (Kim, 2014) over a sentiment classification task (Pang and Lee, 2008) drops appreciably, such that with only 1% perturbations, a stateof-the-art model drops to the level of a random classifier.
Word embeddings are not an intuitive representation of human language, and it is not immediately clear how to generate adversarial noise over the raw text input without affecting the fidelity of the data. In human-to-human textual communication such as chat and microblogs, humans are remarkably resilient to "noise", in terms of typos, lexical and syntactic disfluencies, and the large variety of semantically-equivalent ways of expressing the same content (Han and Baldwin, 2011;Eisenstein, 2013;Baldwin et al., 2013;Pavlick and Callison-Burch, 2016). These observations are the inspiration for this work, in proposing a training strategy based on the explicit generation of linguistic corruption over the source training instances, to train robust text models. Empirically, we demonstrate the effectiveness of our method over a range of sentiment analysis datasets using a state-of-the-art convolutional neural network model (Kim, 2014). In this, we show that our method is superior to a baseline and dropout (Srivastava et al., 2014) using MAP training. 1

Generating Text Noise
Our method involves the explicit generation of several kinds of linguistic corruption, to train more robust deep models. The first question is how to generate the linguistic noise, focusing on English for the purposes of this paper. We focus on the generation of two classes of text noise: (1) syntactic noise; and (2) semantic noise. 2 Syntactic Noise The first class of linguistic noise is syntactic, focusing on the syntactic struc-ture of the input, either through explicit parsing and generation using a deep linguistic parser, or sentence compression.
For the deep linguistic parser, we use the LinGO English Resource Grammar ("ERG": Copestake and Flickinger (2000)) with the ACE parser, based on pyDelphin. 3 The ERG supports both parsing and generation, via the semantic formalism of Minimal Recursion Semantics ("MRS": Copestake et al. (2005)). To generate paraphrases with the ERG, we simply parse a given input, select the preferred parse using a pretrained parse selection model (Oepen et al., 2002), and exhaustively generate from the resultant MRS. We then use uniform random sampling to select from the generator outputs, which potentially numbers in the thousands of variants. To handle unknown words during parsing and generation, we use POS mapping and introduce a unique relation for each unknown word, which we use to substitute the unknown word back in to the generation output. In practice, the primary sources of "noise" introduced by the ERG are due to topicalisation, adjective ordering, fronting of adverbial phrases, and relativisation of modifiers.
The second approach to syntactic noise is based on sentence compression ("COMP": Knight and Marcu (2000)), which aims to "trim" an input of peripheral content, while maintaining grammaticality, and also the syntax of the original as much as possible. While the state-of-the-art in sentence compression is based on deep learning methods such as recurrent neural networks (Filippova et al., 2015), we implement a simple parser-based model, due to the lack of large-scale annotated data for training and the fact that a relative lack of precision in the output may ultimately help our method. First, we parse the sentence using the Stanford CoreNLP constituency parser (Chen and Manning, 2014). Next, we model the conditional probability of deleting a sub-tree C with label S given its parent node with label R by p(C|S, R) = p(C,S,R) Σ C p(C,S,R) , trained on the sentence compression corpora of Clarke and Lapata (2006), 4 made up of a few hundred labelled instances.
Semantic Noise The second class of linguistic noise is semantic noise. Semantic noise is more subtle than syntactic noise, as we must be careful not to impact on the fidelity of the original labels, which can readily occur with full paraphrasing or abstractive summarisation. As such, we focus on lexical substitution of near-synonyms of words in the original text, and experiment with two methods for generating near-synonyms.
Our approach to generating semantic noise proceeds as follows. First, we apply filters to identify words which should not be candidates for lexical substitution, namely words which are parts of named entities or function words. As such, we use the Stanford CoreNLP POS tagger and named entity recogniser (Finkel et al., 2005;Chen and Manning, 2014), and identify "substitutable words" as those which are nouns, verbs, adjectives or adverbs, and not part of a named entity. For each substitutable word w, we generate the set of substitution candidates s(w). For each candidate w i ∈ {w} ∪ s(w) we allow the original word to be preserved with p(w i ) = α, and share the remaining 1−α proportional to the language model score based on substituting w i into the original text. For this, we use the pre-trained US English language model from the CMU Sphinx Speech Recognition toolkit. 5 Finally, we sample from the probability distribution {p(w i ) : w i ∈ {w} ∪ s(w)} for each substitutable word w to generate a semanticallycorrupted version of the original.
We experiment with two approaches to generating the substitution candidates. The first is based on Princeton WordNet ("WN": Miller et al. (1990)), over all synsets that a given substitutable word occurs in, using the NLTK API (Bird, 2006). The second is based on the "counterfitting" method of Mrkšić et al. (2016) ("CFIT"), whereby word embeddings from WORD2VEC are projected based on a supervised objective function which penalises similarity between antonym pairs, and rewards similarity between synonym pairs, as trained on 10k English news sentences from WMT14 (Bojar et al., 2014).
Word Dropout As a standard approach to training robust models, we use word dropout (Srivastava et al., 2014;Pham et al., 2014). Dropout can be viewed as a method for zeroing out noise, and is first-order equivalent to an 2 regularizer applied after feature scaling (Wager et al., 2013).

Method Example Original
The cat sat on the mat . ERG On the mat sat the cat . COMP The cat sat on mat WN The ::: kat sat on the ::::::: flatness . CFIT The ::: pet ::::: stood :::: onto the mat . Table 1: Examples of generated sentences across four proposed methods. Modified words are marked by " ::::::::: underwave" and omitted words are denoted with a " ". Table 1 shows an example sentence and sample corrupted outputs after applying each type of linguistic noise. The ERG seldom changes words, and instead tends to reorder the words based on syntactic alternation. COMP performs like word dropout in that it tends to remove tokens with low semantic content and to generate complete sentences. WN and CFIT both only modify the text at the word level, based on near-synonyms and words with similar semantic function, respectively.

Models and Training
We evaluate our methods on several sentence classification tasks, using a convolutional neural network ("CNN") model (Kim, 2014). Note that our method corrupts the input directly, and is thus easily transferrable to other classes of models (e.g., other deep learning or linear models).

Convolutional Neural Network
The CNN operates at the sentence level by first embedding each word using a lookup table which is stacked into the sentence matrix E S . A 1d convolutional layer is then applied to E S , which applies a series of filters over each window of t words, with each filter employing a rectifier transform function. MaxPooling is applied over each set of filter outputs to result in a fixed-size sentence representation. 6 The sentence vector is fed into a final Softmax layer to generate a probability distribution over classification labels.
The model is trained to minimise the crossentropy between the ground-truth and the model prediction, using the Adam Optimizer (Kingma and Ba, 2015) with learning rate 10 −4 and a batch size of 128. We initialise the embedding with dimension m = 300 Google pre-trained WORD2VEC word embeddings (Mikolov et al., 2013). Words not in the pre-trained vocabulary are initialized randomly using a uniform distribution U ([−0.25, 0.25) m ).
Injecting Noise during Training Our proposed method involves corrupting the training input with adversarial noise of various kinds. All the methods are non-deterministic, involving random sampling. They are applied afresh every epoch, such that each time an instance is processed, it will have a different input form. 7 The two semantic approaches (WN and CFIT) support configurable noise rates in terms of the proportion of substitutable words that are corrupted. Accordingly, we experiment with two thresholds on the random variable for substitution of each word: low ("lo"; α = 0.5) and high ("hi"; α = 0). Besides the above methods which employ a single type noise, we experiment with a combination (COMB) of the four different noise types (ERG + COMP + WN lo + CFIT lo ), by uniformly randomly choosing one of the four methods for noise generation each time we process a training instance.
Datasets We experiment on the following datasets: • MR: sentence polarity dataset from movie reviews (Pang and Lee, 2008) 8 • CR: customer review dataset (Hu and Liu, 2004) 9 • Subj: subjectivity dataset (Pang and Lee, 2005) 8 • SST: Stanford Sentiment Treebank, using the 2-class configuration (Socher et al., 2013) 10 We evaluate using classification accuracy, based on both in-domain evaluation 11 and a crossdomain setting, in which we evaluate a model trained on MR and tested on CR, and vice versa. This last setting characterises a realistic applica-7 Using a single application of noise is less effective, but still yields improvements over baseline methods including dropout. 8 https://www.cs.cornell.edu/people/ pabo/movie-review-data/ 9 http://www.cs.uic.edu/˜liub/FBS/ sentiment-analysis.html 10 http://nlp.stanford.edu/sentiment/ 11 Where there is no pre-defined training/test split for a given dataset, we use 10-fold cross validation. See Kim (2014) for more details on the datasets and evaluation settings. tion scenario, where robustness to vocabulary shift and other differences in the input is paramount. Table 2 presents the results of training with different sources of linguistic corruption in the indomain and cross-domain settings. In general, the proposed methods perform better than the baseline and dropout, and semantic noise using WN achieves consistent improvements across all settings. The COMB method uniformly outperforms the other methods for all in-domain evaluations, indicating that the improvements from training with different types of noise are orthogonal. Note that improvements are smaller on SST and MR than CR and Subj for all methods. Almost every method improves over word dropout, except counter-fitting at a high noise level. Also surprising is the fact that dropout shows no improvement over standard training, and is overall mildly detrimental.

Experimental Results and Analysis
Our intuition behind why WN consistently outperforms the baseline methods and other single sources of noise is it sometimes performs similarity to dropout, in replacing common words with rare ones, and sometimes substitutes frequent words for frequent words, leading to better generalisation in the word embeddings. To test this hypothesis, we computed nearest neighbours in the word embedding space for both the baseline method and the WN method. For example, the top-3 nearest neighbours for superior in CR are exceptional, excellent and unmatched for WN, while for the baseline, they are inferior, exceptional and excellent. That is, similar to the intuition behind counter-fitting, the methods appears to learn to differentiate between synonyms and antonyms, in a manner which is sensitised to the target domain.
Although similar in function to WN, the counter-fitting based method performs unexpectedly poorly. This appears to be a consequence of the training of these embeddings, namely that the corpus was much smaller than that used for the WORD2VEC training, and consequently coverage on our corpora was substantially lower, leading to the approach making inappropriate substitutions and not aiding model robustness.
Sentence compression was found to be highly effective. To illustrate by example, the sentence Player has a problem with dual-layer dvd's such  Table 2: Accuracy (%) of the CNN, in four in-domain settings, and two cross-domain settings, with word dropout ("dropout"), or linguistic corruption based on different sources of syntactic and semantic corruption. The best result for each dataset is indicated in bold.
as Alias seasons 1 and season 2 is compressed into has a problem with dual-layer dvd which preserves the key information that we expect to be useful for model learning. This allows the model to better learn the components of the input that are predictive of sentiment. Syntactic paraphrasing (ERG) tends to primarily corrupt the word order, with fewer lexical substitutions. Thus, the model is less prone to overfitting to local n-gram features, and focuses on learning words and phrases that are genuinely predictive of sentiment.

Conclusions
In this paper, we present a training method that corrupts training examples with linguistic noise, in order to learn more robust models. Based on evaluation over several sentiment analysis datasets with convolutional neural networks, we show that this method outperforms standard training and dropout, both for in-domain and out-of-domain application. Our approach has wide-spread potential to also benefit other types of discriminative model and in a range of other language processing tasks.