Word-level Textual Adversarial Attacking as Combinatorial Optimization

Adversarial attacks are carried out to reveal the vulnerability of deep neural networks. Textual adversarial attacking is challenging because text is discrete and a small perturbation can bring significant change to the original input. Word-level attacking, which can be regarded as a combinatorial optimization problem, is a well-studied class of textual attack methods. However, existing word-level attack models are far from perfect, largely because unsuitable search space reduction methods and inefficient optimization algorithms are employed. In this paper, we propose a novel attack model, which incorporates the sememe-based word substitution method and particle swarm optimization-based search algorithm to solve the two problems separately. We conduct exhaustive experiments to evaluate our attack model by attacking BiLSTM and BERT on three benchmark datasets. Experimental results demonstrate that our model consistently achieves much higher attack success rates and crafts more high-quality adversarial examples as compared to baseline methods. Also, further experiments show our model has higher transferability and can bring more robustness enhancement to victim models by adversarial training. All the code and data of this paper can be obtained on https://github.com/thunlp/SememePSO-Attack.


Introduction
Adversarial attacks use adversarial examples (Szegedy et al., 2014;Goodfellow et al., 2015), which are maliciously crafted by perturbing the original input, to fool the deep neural networks * Indicates equal contribution. Yuan developed the method, designed and conducted most experiments; Fanchao formalized the task, designed some experiments and wrote the paper; Chenghao made the original research proposal, performed human evaluation and conducted some experiments.
† Work done during internship at Tsinghua University ‡ Corresponding author (DNNs). Extensive studies have demonstrated that DNNs are vulnerable to adversarial attacks, e.g., minor modification to highly poisonous phrases can easily deceive Google's toxic comment detection systems (Hosseini et al., 2017). From another perspective, adversarial attacks are also used to improve robustness and interpretability of DNNs (Wallace et al., 2019). In the field of natural language processing (NLP) which widely employs DNNs, practical systems such as spam filtering (Stringhini et al., 2010) and malware detection (Kolter and Maloof, 2006) have been broadly used, but at the same time the concerns about their security are growing. Therefore, the research on textual adversarial attacks becomes increasingly important. Textual adversarial attacking is challenging. Different from images, a truly imperceptible perturbation on text is almost impossible because of its discrete nature. Even a slightest character-level perturbation can either (1) change the meaning and, worse still, the true label of the original input, or (2) break its grammaticality and naturality. Unfortunately, the change of true label will make the adversarial attack invalid. For example, supposing an adversary changes "she" to "he" in an input sentence to attack a gender identification model, although the victim model alters its prediction result, this is not a valid attack. And the adversarial examples with broken grammaticality and naturality (i.e., poor quality) can be easily defended (Pruthi et al., 2019).
Various textual adversarial attack models have been proposed (Wang et al., 2019a), ranging from character-level flipping (Ebrahimi et al., 2018) to sentence-level paraphrasing (Iyyer et al., 2018). Among them, word-level attack models, mostly word substitution-based models, perform comparatively well on both attack efficiency and adversarial example quality .
Word-level adversarial attacking is actually a problem of combinatorial optimization (Wolsey and Nemhauser, 1999), as its goal is to craft adversarial examples which can successfully fool the victim model using a limited vocabulary. In this paper, as shown in Figure 1, we break this combinatorial optimization problem down into two steps including (1) reducing search space and (2) searching for adversarial examples.
The first step is aimed at excluding invalid or low-quality potential adversarial examples and retaining the valid ones with good grammaticality and naturality. The most common manner is to pick some candidate substitutes for each word in the original input and use their combinations as the reduced discrete search space. However, existing attack models either disregard this step (Papernot et al., 2016) or adopt unsatisfactory substitution methods that do not perform well in the trade-off between quality and quantity of the retained adversarial examples (Alzantot et al., 2018;Ren et al., 2019). The second step is supposed to find adversarial examples that can successfully fool the victim model in the reduced search space. Previous studies have explored diverse search algorithms including gradient descent (Papernot et al., 2016), genetic algorithm (Alzantot et al., 2018) and greedy algorithm (Ren et al., 2019). Some of them like gradient descent only work in the white-box setting where full knowledge of the victim model is required. In real situations, however, we usually have no access to the internal structures of victim models. As for the other black-box algorithms, they are not efficient and effective enough in searching for adversarial examples.
These problems negatively affect the overall attack performance of existing word-level adversar-ial attacking. To solve the problems, we propose a novel black-box word-level adversarial attack model, which reforms both the two steps. In the first step, we design a word substitution method based on sememes, the minimum semantic units, which can retain more potential valid adversarial examples with high quality. In the second step, we present a search algorithm based on particle swarm optimization (Eberhart and Kennedy, 1995), which is very efficient and performs better in finding adversarial examples. We conduct exhaustive experiments to evaluate our model. Experimental results show that, compared with baseline models, our model not only achieves the highest attack success rate (e.g., 100% when attacking BiLSTM on IMDB) but also possesses the best adversarial example quality and comparable attack validity. We also conduct decomposition analyses to manifest the advantages of the two parts of our model separately. Finally, we demonstrate that our model has the highest transferability and can bring the most robustness improvement to victim models by adversarial training.

Background
In this section, we first briefly introduce sememes, and then we give an overview of the classical particle swarm optimization algorithm.

Sememes
In linguistics, a sememe is defined as the minimum semantic unit of human languages (Bloomfield, 1926). The meaning of a word can be represented by the composition of its sememes.
In the field of NLP, sememe knowledge bases are built to utilize sememes in practical applications, where sememes are generally regarded as semantic labels of words (as shown in Figure 1). HowNet (Dong and Dong, 2006) is the most wellknown one. It annotates over one hundred thousand English and Chinese words with a predefined sets of about 2,000 sememes. Its sememe annotations are sense-level, i.e., each sense of a (polysemous) word is annotated with sememes separately. With the help of HowNet, sememes have been successfully applied to many NLP tasks including word representation learning (Niu et al., 2017), sentiment analysis (Fu et al., 2013), semantic composition (Qi et al., 2019), sequence modeling (Qin et al., 2019), reverse dictionary (Zhang et al., 2019b), etc.

Particle Swarm Optimization
Inspired by the social behaviors like bird flocking, particle swarm optimization (PSO) is a kind of metaheuristic population-based evolutionary computation paradigms (Eberhart and Kennedy, 1995). It has been proved effective in solving the optimization problems such as image classification (Omran et al., 2004), part-of-speech tagging (Silva et al., 2012) and text clustering (Cagnina et al., 2014). Empirical studies have proven it is more efficient than some other optimization algorithms like the genetic algorithm (Hassan et al., 2005).
PSO exploits a population of interacting individuals to iteratively search for the optimal solution in the specific space. The population is called a swarm and the individuals are called particles. Each particle has a position in the search space and moves with an adaptable velocity.
Formally, when searching in a D-dimensional continuous space S ⊆ R D with a swarm containing N particles, the position and velocity of each particle can be represented by x n ∈ S and v n ∈ R D respectively, n ∈ {1, · · · , N }. Next we describe the PSO algorithm step by step.
(1) Initialize. At the very beginning, each particle is randomly initialized with a position x n in the search space and a velocity v n . Each dimension of the initial velocity v n d ∈ [−V max , V max ], d ∈ {1, · · · , D}.
(2) Record. Each position in the search space corresponds to an optimization score. The position a particle has reached with the highest optimization score is recorded as its individual best position. The best position among the individual best positions of all the particles is recorded as the global best position.
(3) Terminate. If current global best position has achieved the desired optimization score, the algorithm terminates and outputs the global best position as the search result.
(4) Update. Otherwise, the velocity and position of each particle are updated according to its current position and individual best position together with the global best position. The updating formulae are where ω is the inertia weight, p n d and p g d are the dth dimensions of the n-th particle's individual best position and the global best position respectively, c 1 and c 2 are acceleration coefficients which are positive constants and control how fast the particle moves towards its individual best position and the global best position, and r 1 and r 2 are random coefficients. After updating, the algorithm goes back to the Record step.

Methodology
In this section, we detail our word-level adversarial attack model. It incorporates two parts, namely the sememe-based word substitution method and PSO-based adversarial example search algorithm.

Sememe-based Word Substitution Method
The sememes of a word are supposed to accurately depict the meaning of the word (Dong and Dong, 2006). Therefore, the words with the same sememe annotations should have the same meanings, and they can serve as the substitutes for each other. Compared with other word substitution methods, mostly including word embedding-based (Sato et al., 2018), language model-based (Zhang et al., 2019a) and synonym-based methods (Samanta and Mehta, 2017;Ren et al., 2019), the sememe-based word substitution method can achieve a better trade-off between quality and quantity of substitute words.
For one thing, although the word embedding and language model-based substitution methods can find as many substitute words as we want simply by relaxing the restrictions on embedding distance and language model prediction score, they inevitably introduce many inappropriate and low-quality substitutes, such as antonyms and semantically related but not similar words, into adversarial examples which might break the semantics, grammaticality and naturality of original input. In contrast, the sememe-based and, of course, the synonym-based substitution methods does not have this problem.
For another, compared with the synonym-based method, the sememe-based method can find more substitute words and, in turn, retain more potential adversarial examples, because HowNet annotates sememes for all kinds of words. The synonymbased method, however, depends on thesauri like WordNet (Miller, 1995), which provide no synonyms for many words like proper nouns and the number of a word's synonyms is very limited. An empirical comparison of different word substitution methods is given in Section 4.6.
In our sememe-based word substitution method, to preserve grammaticality, we only substitute content words 1 and restrict the substitutes to having the same part-of-speech tags as the original words. Considering polysemy, a word w can be substituted by another word w * only if one of w's senses has the same sememe annotations as one of w * 's senses. When making substitutions, we conduct lemmatization to enable more substitutions and delemmatization to avoid introducing grammatical mistakes.

PSO-based Adversarial Example Search Algorithm
Before presenting our algorithm, we first explain what the concepts in the original PSO algorithm correspond to in the adversarial example search problem. Different from original PSO, the search space of word-level adversarial example search is discrete. A position in the search space corresponds to a sentence (or an adversarial example), and each dimension of a position corresponds to a word. Formally, d and its substitutes. The optimization score of a position is the target label's prediction probability given by the victim model, where the target label is the desired classification result for an adversarial attack. Taking a binary classification task as an example, if the true label of the original input is "positive", the target label is "negative", and vice versa. In addition, a particle's velocity now relates to the position change probability, i.e., v n d determines how probable w n d is substituted by another word. Next we describe our algorithm step by step. First, for the Initialize step, since we expect the adversarial examples to differ from the original input as little as possible, we do not make random initialization. Instead, we randomly substitute one word of the original input to determine the initial position of a particle. This operation is actually the mutation of genetic algorithm, which has also been employed in some studies on discrete PSO (Higashi and Iba, 2003). We repeat mutation N times to initialize the positions of N particles. Each dimension of each particle's velocity is randomly initialized between −V max and V max .
For the Record step, our algorithm keeps the same as the original PSO algorithm. For the Terminate step, the termination condition is the victim model predicts the target label for any of current adversarial examples.
For the Update step, considering the discreteness of search space, we follow Kennedy and Eberhart (1997) to adapt the updating formula of velocity to where ω is still the inertia weight, and I(a, b) is defined as Following Shi and Eberhart (1998), we let the inertia weight decrease with the increase of numbers of iteration times, aiming to make the particles highly dynamic to explore more positions in the early stage and gather around the best positions quickly in the final stage. Specifically, where 0 < ω min < ω max < 1, and T and t are the maximum and current numbers of iteration times. The updating of positions also needs to be adjusted to the discrete search space. Inspired by Kennedy and Eberhart (1997), instead of making addition, we adopt a probabilistic method to update the position of a particle to the best positions. We design two-step position updating. In the first step, a new movement probability P i is introduced, with which a particle determines whether it moves to its individual best position as a whole. Once a particle decides to move, the change of each dimension of its position depends on the same dimension of its velocity, specifically with the probability of sigmoid(v n d ). No matter whether a particle has moved towards its individual best position or not, it would be processed in the second step. In the second step, each particle determines whether to move to the global best position with another movement probability P g . And the change of each position dimension also relies on sigmoid(v n d ). P i and P g vary with iteration to enhance search efficiency by adjusting the balance between local and global search, i.e., encouraging particles to explore more  Table 1: Details of datasets and their accuracy results of victim models. "#Class" means the number of classifications. "Avg. #W" signifies the average sentence length (number of words). "Train", "Val" and "Test" denote the instance numbers of the training, validation and test sets respectively. "BiLSTM %ACC" and "BERT %ACC" means the classification accuracy of BiLSTM and BERT.
space around their individual best positions in the early stage and search for better position around the global best position in the final stage. Formally, where 0 < P min < P max < 1.
Besides, to enhance the search in unexplored space, we apply mutation to each particle after the update step. To avoid excessive modification, mutation is conducted with the probability where k is a positive constant, x o represents the original input, and E measures the word-level edit distance (number of different words between two sentences). E(x n ,x o ) D is defined as the modification rate of an adversarial example. After mutation, the algorithm returns to the Record step.

Experiments
In this section, we conduct comprehensive experiments to evaluate our attack model on the tasks of sentiment analysis and natural language inference.

Datasets and Victim Models
For sentiment analysis, we choose two benchmark datasets including IMDB (Maas et al., 2011) and SST-2 (Socher et al., 2013). Both of them are binary sentiment classification datasets. But the average sentence length of SST-2 (17 words) is much shorter than that of IMDB (234 words), which renders attacks on SST-2 more challenging. For natural language inference (NLI), we use the popular Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015). Each instance in SNLI comprises a premise-hypothesis sentence pair and is labelled one of three relations including entailment, contradiction and neutral.
As for victim models, we choose two widely used universal sentence encoding models, namely bidirectional LSTM (BiLSTM) with max pooling (Conneau et al., 2017) and BERT BASE (BERT) (Devlin et al., 2019). For BiLSTM, its hidden states are 128-dimensional, and it uses 300-dimensional pre-trained GloVe (Pennington et al., 2014) word embeddings. Details of the datasets and the classification accuracy results of the victim models are listed in Table 1.

Baseline Methods
We select two recent open-source word-level adversarial attack models as the baselines, which are typical and involve different search space reduction methods (step 1) and search algorithms (step 2).
The first baseline method (Alzantot et al., 2018) uses the combination of restrictions on word embedding distance and language model prediction score to reduce search space. As for search algorithm, it adopts genetic algorithm, another popular metaheuristic population-based evolutionary algorithm. We use "Embedding/LM+Genetic" to denote this baseline method.
The second baseline (Ren et al., 2019) chooses synonyms from WordNet (Miller, 1995) as substitutes and designs a saliency-based greedy algorithm as the search algorithm. We call this method "Synonym+Greedy". This baseline model is very similar to another attack model TextFooler (Jin et al., 2019), which has extra semantic similarity checking when searching adversarial examples. But we find the former performs better in almost all experiments, and thus we only select the former as a baseline for comparison.
In addition, to conduct decomposition analyses of different methods in the two steps separately, we combine different search space reduction methods (Embedding/LM, Synonym and our sememe-based substitution method (Sememe)), and search algorithms (Genetic, Greedy and our PSO). Human (Naturality Score) Higher Table 2: Details of evaluation metrics. "Auto" and "Human" represent automatic and human evaluations respectively. "Higher" and "Lower" mean the higher/lower the metric, the better a model performs.

Experimental Settings
For our PSO, V max is set to 1, ω max and ω min are set to 0.8 and 0.2, P max and P min are also set to 0.8 and 0.2, and k in Equation (6) is set to 2. All these hyper-parameters have been tuned on the validation set. For the baselines, we use their recommended hyper-parameter settings. For the two population-based search algorithms Genetic and PSO, we set the maximum number of iteration times (T in Section 3.2) to 20 and the population size (N in Section 3.2) to 60, which are the same as Alzantot et al. (2018).

Evaluation Metrics
To improve evaluation efficiency, we randomly sample 1, 000 correctly classified instances from the test sets of the three datasets as the original input to be perturbed. For SNLI, only the hypotheses are perturbed. Following Alzantot et al. (2018), we restrict the length of the original input to 10-100, exclude the out-of-vocabulary words from the substitute sets, and discard the adversarial examples with modification rates higher than 25%.
We evaluate the performance of attack models including their attack success rates, attack validity and the quality of adversarial examples. The details of our evaluation metrics are listed in Table 2.
(1) The attack success rate is defined as the percentage of the attacks which craft an adversarial example to make the victim model predict the target label.
(2) The attack validity is measured by the percentage of valid attacks to successful attacks, where the adversarial examples crafted by valid attacks have the same true labels as the original input. (3) For the quality of adversarial examples, we divide it into four parts including modification rate, grammaticality, fluency and naturality. Grammaticality is measured by the increase rate of grammatical error numbers of adversarial examples compared with the original input, where we use Language-Tool 2 to obtain the grammatical error number of a sentence. We utilize the language model perplexity (PPL) to measure the fluency with the help of GPT-2 (Radford et al., 2019). The naturality reflects whether an adversarial example is natural and indistinguishable from human-written text.
We evaluate attack validity and adversarial example naturality only on SST-2 by human evaluation with the help of Amazon Mechanical Turk 3 . We randomly sample 200 adversarial examples, and ask the annotators to make a binary sentiment classification and give a naturality score (1, 2 or 3, higher better) for each adversarial example and original input. More annotation details are given in Appendix A.

Attack Performance
Attack Success Rate The attack success rate results of all the models are listed in Table 3. We observe that our attack model (Sememe+PSO) achieves the highest attack success rates on all the three datasets (especially the harder SST-2 and SNLI) and two victim models, proving the superiority of our model over baselines. It attacks BiLSTM/BERT on IMDB with a notably 100.00%/98.70% success rate, which clearly demonstrates the vulnerability of DNNs. By comparing three word substitution methods (search space reduction methods) and three search algorithms, we find Sememe and PSO consistently outperform their counterparts. Further decomposition analyses are given in a later section.

Validity and Adversarial Example Quality
We evaluate the attack validity and adversarial example quality of our model together with the two baseline methods (Embedding/LM+Genetic and Synonym+Greedy). The results of automatic and human evaluations are displayed in Table 4 and 5 respectively. 4 Note that the human evaluations including attack validity and adversarial example naturality are conducted on SST-2 only. We find that in terms of automatic evaluations of adversarial example quality, including modification rate, grammaticality and fluency, our model consistently outperforms the two baselines on whichever victim model and dataset. As for attack validity and adver-    We conduct Student's t-tests to further measure the difference between the human evaluation results of different models, where the statistical significance threshold of p-value is set to 0.05. We find that neither of the differences of attack validity and adversarial example naturality between different models are significant. In addition, the adversarial examples of any attack model have significantly worse label consistency (validity) than the original input, but possesses similar naturality. More details of statistical significance test are given in Appendix D.
For Embedding/LM, relaxing the restrictions on embedding distance and language model prediction score can improve its attack success rate but sacrifices attack validity. To make a specific comparison, we adjust the hyper-parameters of Embedding/LM+Genetic 5 to increase its attack success rates to 96.90%, 90.30%, 58.00%, 93.50%, 83.50% and 62.90% respectively on attacking the two victim models on the three datasets (in the same order as Table 3). Nonetheless, its attack validity rates against BiLSTM and BERT on SST-2 dramatically fall to 59.5% and 56.5%. In contrast, ours are 70.5% and 72.0%, and their differences are significant according to the results of significance tests in Appendix D.

Decomposition Analyses
In this section, we conduct detailed decomposition analyses of different word substitution methods (search space reduction methods) and different search algorithms, aiming to further demonstrate the advantages of our sememe-based word substitution method and PSO-based search algorithm.     Table 6 lists the average number of substitutes provided by different word substitution methods on the three datasets. It shows Sememe can find much more substitutes than the other two counterparts, which explains the high attack success rates of the models incorporating Sememe. Besides, we give a real case from SST-2 in Table 7 which lists substitutes found by the three methods. We observe that Embedding/LM find many improper substitutes, Synonym cannot find any substitute because the original word "pie" has no synonyms in WordNet, and only Sememe finds many appropriate substitutes.

Search Algorithm
We compare the two population-based search algorithms Genetic and PSO by changing two important hyper-parameters, namely the maximum number of iteration times T and the population size N . The results of attack success rate are shown in Figure 2    consistently, especially in the setting with severe restrictions on maximum number of iteration times and population size, which highlights the efficiency of PSO.

Transferability
The

Adversarial Training
Adversarial training is proposed to improve the robustness of victim models by adding adversarial examples to the training set (Goodfellow et al., 2015). In this experiment, for each attack model, we craft 692 adversarial examples (10% of the original training set size) by using it to attack BiL-STM on the training set of SST-2. Then we add the adversarial examples to the training set and retrain a BiLSTM. We re-evaluate its robustness by calculating the attack success rates of different attack models. Table 9 lists the results of adversarial training. Note larger attack success rate decrease signifies greater robustness improvement. We find that adversarial training can improve the robustness of victim models indeed, and our Sememe+PSO model brings greater robustness improvement than the two baselines, even when the attack models are exactly themselves. 6 From the perspective of attacking, our Sememe+PSO model is still more threatening than others even under the defense of adversarial training. We also manually select 692 valid adversarial examples generated by Sememe+PSO to conduct adversarial training, which leads to even greater robustness improvement (last column of Table 9). The results show that adversarial example validity has big influence on adversarial training effect.

Related Work
Existing textual adversarial attack models can be classified into three categories according to the perturbation levels of their adversarial examples.
Sentence-level attacks include adding distracting sentences (Jia and Liang, 2017), paraphrasing (Iyyer et al., 2018;Ribeiro et al., 2018) and performing perturbations in the continuous latent semantic space (Zhao et al., 2018). Adversarial examples crafted by these methods usually have profoundly different forms from original input and their validity are not guaranteed.
Character-level attacks are mainly random character manipulations including swap, substitution, deletion, insertion and repeating (Belinkov and Bisk, 2018;Gao et al., 2018;Hosseini et al., 2017). In addition, gradient-based character substitution methods have also been explored, with the help of one-hot character embeddings (Ebrahimi et al., 2018) or visual character embeddings (Eger et al., 2019). Although character-level attacks can achieve high success rates, they break the grammaticality and naturality of original input and can be easily defended (Pruthi et al., 2019). 6 For instance, using Embedding/LM+Genetic in adversarial training to defend its attack declines the attack success rate by 2.60% while using our Sememe+PSO model declines by 3.53%.  Table 9: The attack success rates of different attack models when attacking BiLSTM on SST-2 and their decrements brought by adversarial training. "Att" and "Adv.T" denote "Attack Model" and "Adversarial Training". E/L+G, Syn+G and Sem+P represent Embedding/LM+Genetic, Synonym+Greedy and our Se-meme+PSO, respectively. "Sem+P*" denotes only using the valid adversarial examples generated by Se-meme+PSO in adversarial training.

Conclusion and Future Work
In this paper, we propose a novel word-level attack model comprising the sememe-based word substitution method and particle swarm optimization-based search algorithm. We conduct extensive experiments to demonstrate the superiority of our model in terms of attack success rate, adversarial example quality, transferability and robustness improvement to victim models by adversarial training. In the future, we will try to increase the robustness gains of adversarial training and consider utilizing sememes in adversarial defense model. Table 10: Automatic evaluation results of adversarial example quality. "%M", "%I" and "PPL" indicate the modification rate, grammatical error increase rate and language model perplexity respectively. sentences. For each adversarial example, we use the average of the naturality scores given by three workers as its final naturality score.

B Automatic Evaluation Results of Adversarial Example Quality
We present the automatic evaluation results of adversarial example quality of all the combination models in Table 10. We can find that Sememe and PSO obtain higher overall adversarial example quality than other word substitution methods and adversarial example search algorithms, respectively.

C Adjustment of Hyper-parameters of Embedding/LM+Genetic
The word substitution strategy Embedding/LM has three hyper-parameters: the number of the nearest words N, the euclidean distance threshold of word embeddings δ and the number of words retained by the language model filtering K. For original Embedding/LM+Genetic, N = 8, δ = 0.5 and K = 4, which are the same as Alzantot et al. (2018). To increase the attack success rates, we change these hyper-parameters to N = 20, δ = 1 and K = 10.

D Statistical Significance of Human Evaluation Results
We conduct Student's t-tests to measure the statistical significance between the difference of human evaluation results of different models. The results of attack validity and adversarial example naturality are shown in Table 11 and 12, respectively. "Embedding/LM+Genetic*" refers to the Embedding/LM+Genetic model with adjusted hyper-parameters.

E Case Study
We display some adversarial examples generated by the baseline attack models and our attack model on IMDB, SST-2 and SNLI in Significant > Table 11: The Student's t-test results of attack validity of different models, where "=" means "Model 1" performs as well as "Model 2" and ">" means "Model 1" performs better than "Model 2".  IMDB Example 1 Original Input (Prediction = Positive) In my opinion this is the best oliver stone flick probably more because of influence than anything else. Full of dread from the first moment to its dark ending.

Embedding/LM+Genetic (Prediction = Negative)
In my view this is the higher oliver stone flick presumably more because of influence than anything else. Total of anxiety from the first moment to its dark ending.

Synonym+Greedy (Prediction = Negative)
In my opinion this embody the respectable oliver stone flick probably more because of influence than anything else. Broad of dread from the first moment to its dark ending.

Sememe+PSO (Prediction = Negative)
In my opinion this is the bestest oliver stone flick probably more because of influence than anything else. Ample of dread from the first moment to its dark ending.

IMDB Example 2
Original Input (Prediction = Negative) One of the worst films of it's genre. The only bright spots were lee showing some of the sparkle she would later bring to the time tunnel and batman.
Embedding/LM+Genetic (Prediction = Positive) One of the biggest films of it's genre. The only glittering spots were lee showing some of the sparkle she would afterwards bring to the time tunnel and batman.

Synonym+Greedy (Prediction = Positive)
One of the tough films of it's genre. The only bright spots follow lee present some of the spark she would later bring to the time tunnel and batman.

Sememe+PSO (Prediction = Positive)
One of the seediest films of it's genre. The only shimmering spots were lee showing some of the sparkle she would later bring to the time tunnel and batman. SST-2 Example 1 Original Input (Prediction = Positive) Some actors have so much charisma that you 'd be happy to listen to them reading the phone book.
Embedding/LM+Genetic (Prediction = Negative) Some actors have so much charisma that you 'd be cheery to listen to them reading the phone book.
Synonym+Greedy (Prediction = Negative) Some actors have so much charisma that you 'd be happy to listen to them take the phone book.
Sememe+PSO (Prediction = Negative) Some actors have so much charisma that you 'd be jovial to listen to them reading the phone book.
SST-2 Example 2 Original Sentence (Prediction = Negative) The movie 's biggest is its complete and utter lack of tension.

Embedding/LM+Genetic (Prediction = Positive)
The movie 's biggest is its complete and utter absence of stress.

Synonym+Greedy (Prediction = Positive)
The movie 's great is its complete and utter want of tension.

Sememe+PSO (Prediction = Positive)
The movie 's biggest is its complete and utter dearth of tension. SNLI Example 1 Premise: A smiling bride sits in a swing with her smiling groom standing behind her posing for the male photographer while a boy holding a bottled drink and another boy wearing a green shirt observe .
Original Input(Prediction = Entailment) Two boys look on as a married couple get their pictures taken.
Embedding/LM+Genetic (Prediction = Contradiction) Two man stare on as a wedding couple get their pictures taken.
Synonym+Greedy (Prediction = Contradiction) Two boys look on as a married couple puzzle their pictures taken.
Sememe+PSO (Prediction = Contradiction) Two boys stare on as a wedding couple get their pictures taken.

SNLI Example 2
Premise: A dog with a purple leash is held by a woman wearing white shoes .

Original Input (Prediction = Entailment)
A man is holding a leash on someone else dog.

Embedding/LM+Genetic (Prediction = Contradiction)
A man is holding a leash on someone further dog.

Synonym+Greedy (Prediction = Contradiction)
A humans is holding a leash on someone else dog.

Sememe+PSO (Prediction = Contradiction)
A man is holding a leash on someone else canine.