Enhancing Neural Models with Vulnerability via Adversarial Attack

Natural Language Sentence Matching (NLSM) serves as the core of many natural language processing tasks. 1) Most previous work develops a single specific neural model for NLSM tasks. 2) There is no previous work considering adversarial attack to improve the performance of NLSM tasks. 3) Adversarial attack is usually used to generate adversarial samples that can fool neural models. In this paper, we first find a phenomenon that different categories of samples have different vulnerabilities. Vulnerability is the difficulty degree in changing the label of a sample. Considering the phenomenon, we propose a general two-stage training framework to enhance neural models with Vulnerability via Adversarial Attack (VAA). We design criteria to measure the vulnerability which is obtained by adversarial attack. VAA framework can be adapted to various neural models by incorporating the vulnerability. In addition, we prove a theorem and four corollaries to explain the factors influencing vulnerability effectiveness. Experimental results show that VAA significantly improves the performance of neural models on NLSM datasets. The results are also consistent with the theorem and corollaries. The code is released on https://github.com/rzhangpku/VAA.


Introduction
Natural Language Sentence Matching (NLSM) aims to compare two sentences and identify the relationship between them (Wang et al., 2017). NLSM serves as the core of many natural language processing tasks, such as question answering and information retrieval (Wang et al., 2016). 1) Most previous work develops a single specific neural model for NLSM tasks. We propose a general framework that can enhance various neural models proposed by previous work. 2) There is no previous work considering adversarial attack to improve the performance of NLSM tasks. This paper is the first to take advantage of adversarial attack to improve the performance of NLSM tasks. 3) Adversarial attack is usually used to generate adversarial samples that can fool neural models. In contrast, we take advantage of adversarial attack to enhance neural models.
We introduce a concept vulnerability, the difficulty degree in changing the label of a sample. Inspired by the finding that adversarial samples are more vulnerable than normal ones , we first find a phenomenon that different categories of samples have different vulnerabilities. That is, it is easy to change the labels of samples belonging to some categories, whereas it is difficult for others. Here we take Quora Question Pairs (QQP) dataset as an example. If we want a paraphrase pair to be non-paraphrase, changing one word is sufficient. For example, simply adding "not" or "hardly" can completely change the meaning of a sentence. However, it is difficult to make two irrelevant sentences have the same meaning. Different categories of samples have different vulnerabilities. So the vulnerability can be considered as a feature, which can be used to predict the label of the sample. To the best of Table 1: Two examples from QQP dataset. + means that the two sentences are paraphrase; − means that the two sentences are non-paraphrase.
Label Sentence + S1: Why are people supporting Donald Trump? S2: Why do you think people are supporting Trump?
− S3: What is the difference between the Moto2 and the Moto3 race? S4: Who is ahead in the race to sell self-driving cars?
our knowledge, no study has considered the vulnerability and found the phenomenon thus far. Our study is the first to identify the phenomenon and take advantage of the vulnerability to predict the label. Four sentences are listed in Table 1 to explain the vulnerability. All four sentences are selected from QQP dataset. Initially, sentence pair (S1, S2) is paraphrase, i.e., S1 and S2 have the same meaning. Then, we modify S2 by adding "not", and S2 will be S2 : "Why do you think people are not supporting Trump?". (S1, S2 ) is non-paraphrase, which means that the label of (S1, S2) has been changed. As for (S3, S4), it is non-paraphrase, i.e., S3 and S4 have different meanings. S3 and S4 have many different words, and many words need to be changed if we want (S3, S4) to be paraphrase. It is difficult to make (S3, S4) paraphrase. Thus far, (S1, S2) has larger vulnerability than (S3, S4).
This paper is the first to take advantage of adversarial attack to improve the performance of NLSM tasks. We adopt adversarial attack to implement the action of changing the label of a sample and measure the vulnerability of the sample. Recent studies have shown that neural models will be fooled by adversarial samples (Goodfellow et al., 2014;Kurakin et al., 2016;Hu and Tan, 2018). In the case of images, human-imperceptible modifications to the original inputs can mislead a neural model to provide incorrect predictions. In the case of texts, we can simply replace, delete, or add some words to change the output of a model. Such modification methods are referred to as adversarial attacks. Adversarial attack is usually used to generate adversarial samples to fool neural models. In contrast, we take advantage of adversarial attack to enhance neural models.
Considering the analysis above, we propose a general two-stage training framework to enhance neural models with Vulnerability via Adversarial Attack (VAA). In stage one, we pre-train a model A, which can choose any neural model proposed by previous work. In stage two, we obtain the vulnerability of input sample via adversarial attack using model A, and we train a new model B by combining the original sample and its vulnerability as inputs. Experiments demonstrate that VAA is an effective framework that can be adapted to neural models, improving their performance on NLSM datasets.
Our main contributions are threefold: 1. We first find that different categories of samples have different vulnerabilities. Moreover, we define criteria to measure the vulnerability of a sample. In addition, we first take advantage of adversarial attack to enhance neural models for NLSM tasks. Furthermore, we prove a theorem and four corollaries to explain the factors influencing vulnerability effectiveness.
2. We propose a novel two-stage training framework called VAA by incorporating the vulnerability into the neural model. VAA significantly enhances the performance of the model on NLSM datasets via adversarial attack.
3. The proposed VAA framework is of general purpose and works well for many neural models. VAA does not require much additional computing resources because the model structures and parameters can be shared between model A and model B in the two stages.
2 Related Work

Natural Language Sentence Matching
There are two types of deep learning frameworks for Natural Language Sentence Matching (NLSM) tasks. The first framework (Shen et al., 2018;Choi et al., 2018;Conneau et al., 2017) is based on Siamese architecture (Bromley et al., 1993). Two input sentences are encoded individually into sentence vectors with the same neural network encoder. Then, a matching decision is made solely based on the two independent sentence vectors. The major disadvantage of this framework is that there is no explicit lower-level semantic interaction between sentences, which may cause the loss of some important information. Therefore, the matching-aggregation framework is proposed to deal with this problem. This framework compensates for the lack of interaction and uses the cross-sentence feature or inter-sentence attention. Wang et al. (2017) proposed a bilateral multi-perspective matching model for NLSM tasks.  proposed multiway attention networks with multiple attention functions. Chen et al. (2017) proposed ESIM model and demonstrated the effectiveness of carefully designed sequential inference models based on chain LSTMs. Hong et al. (2020) proposed legal feature enhanced semantic matching network for similar case matching.

Adversarial Attack
Adversarial attack refers to the modification of a few pixels with small perturbations of images or changing a few words of text sequences. Adversarial samples will be obtained with such minor changes to normal samples. The target model will misclassify adversarial samples, which remain correctly classified by humans. Numerous methods have been proposed to mislead models in image classification tasks, including box-constrained L-BFGS (Szegedy et al., 2013), the fast-gradient sign method (FGSM) (Goodfellow et al., 2014), the basic iterative method (BIM) (Kurakin et al., 2016), the project gradient descent (PGD) (Madry et al., 2017), the jacobian-based saliency map attack (JSMA) (Papernot et al., 2016a), the Carlini & Wagner (C&W) attack (Carlini and Wagner, 2016), and the adversarial transformation networks (ATN) (Baluja and Fischer, 2017). Such adversarial attacks are gradient-based, score-based, transfer-based, or decision-based.

Vulnerability and VAA Framework
In our notation, we have two input sentences a and b in an NLSM task. A sample X = (a, b) is a pair of two sentences a and b. There are n labels, and y true ∈ {1, . . . , n} is the true label of sample X. a = {a 1 , a 2 , . . . , a a } and b = {b 1 , b 2 , . . . , b b } are the embeddings of sentences a and b, respectively. a i or b j ∈ R l is a word embedding of l-dimensional vector, which can be initialized with some pretrained word embeddings. The goal of the NLSM task is to predict label y pred ∈ {1, . . . , n} that indicates the relationship between a and b.

Vulnerability Measurement and Adversarial Embedding
3.1.1 How to Measure the Vulnerability of a Sample X?
Two vulnerability-measure criterion, F 1 and F 2 , are defined to measure the vulnerability of a sample X. First, we pre-train a model that can successfully classify most samples. When a new sample X arrives, we attack the model with a specific attack strength to generate adversarial sampleX. Normal sample X = (a, b) and adversarial sampleX = (ã,b) are the original sentence pair and adversarial sentence pair, respectively. Normal embedding E and adversarial embeddingẼ are the word embeddings of X andX, respectively. We feed E andẼ to the model to get vectors O andÕ, the outputs of the last dense layer, respectively.
where O i andÕ i are the i-th items of vectors O andÕ, respectively. F 1 measures the vulnerability by considering the attack perturbations of all classes. F 2 measures the perturbation of the main class. Both F 1 and F 2 measure the difference between O andÕ. It should be noted that if we have E andẼ but do not have X andX, we can still obtain F 1 and F 2 .

How to Obtain the Adversarial EmbeddingẼ?
VAA framework does not need adversarial sampleX; the adversarial embeddingẼ is sufficient. We only need to modify the word embedding to measure the vulnerability, and do not have to find the real word corresponding to adversarial embeddingẼ. Hence, a large number of adversarial attack methods can be employed directly by VAA. Most adversarial attacks proposed for either images (Brendel et al., 2017;Carlini and Wagner, 2016) or texts (Rosenberg et al., 2018;Papernot et al., 2016b) run extremely slowly. Owing to training time and memory constraints, we adopt the fast-gradient sign method (FGSM) (Goodfellow et al., 2014) to generate adversarial embeddingẼ rapidly. FGSM is the fastest gradientbased attack method and requires minimal memory. FGSM can only generate adversarial embeddingẼ but cannot generate adversarial sampleX, because there may be no real word corresponding toẼ.Ẽ can be obtained with FGSM in the following two ways.
where Equation (4) and Equation (5) refer to the L ∞ -and L 2 -norm variants, respectively. FGSM works by linearizing the loss function in the L ∞ or L 2 neighborhood of normal sample X. is the parameter that determines the perturbation size in the direction of the gradient. J, θ, and y pred are the cross-entropy loss, the parameters of the model, and the predicted label of X, respectively. Note that the FGSM that we use is slightly different from the original one used in (Goodfellow et al., 2014). We replace y true with y pred here. Because we cannot obtain y true in the test set, and we need to measure the vulnerability rather than to obtain adversarial sampleX. Thus, y pred is sufficient for our need.

VAA Framework
Suppose that there is a neural model A proposed by previous work for an NLSM dataset. We propose a general two-stage training framework VAA, which can enhance the performance of model A with vulnerability. Putting model A into the VAA framework will achieve better performance than only model A on the NLSM dataset. VAA framework is shown in Figure 1. In stage one, we pre-train model A. In stage two, we fix model A and fine-tune model B.

Stage One
Stage one is the same as a regular training process. In this stage, we train the model A on the NLSM dataset regularly. After stage one, we will obtain a well-trained model A, which is called pre-trained model A relative to stage two. Usually, neural models will adopt a multilayer perceptron (MLP) as the final layer. For the convenience in the following description, model A is divided into model A − and the final MLP classifier.

Stage Two
First, we will get normal embedding E of normal sample X with some pre-trained word embeddings. Then, adversarial embeddingẼ will be obtained with E, as shown in Equations (4) and (5).Ẽ is obtained via adversarial attack with the help of model A. Normal embedding E and adversarial embeddingẼ are fed into model A − ; then normal logit V 1 ∈ R h and adversarial logit V 2 ∈ R h are obtained, respectively.
We use vectors V 1 and V 2 instead of F 1 and F 2 as shown in Equation (3). Because F 1 or F 2 is only a numeric value, and using it here may result in the loss of some important information. Vector ∆V contains the vulnerability information of the sample X. Different categories of samples have different  Figure 1: Overview of VAA framework for NLSM datasets. In stage one, we pre-train model A. In stage two, we fix model A and fine-tune model B.
vulnerabilities. That is, some categories of samples will have large ∆V , whereas others will have small ∆V . ∆V can measure the vulnerability just like F 1 or F 2 . V 1 , V 2 , and ∆V will be concatenated together to get V 3 ∈ R 3h . Then, V 3 will be projected into a vector V ∈ R l , which has the same dimension as word embeddings a i and b j .
where b ∈ R l is the bias and the projection is the weight parameter matrix W ∈ R l×3h . Then V will be fed into model B along with the word embeddings a i and b j . Model B and model A share the same model structure. The parameters of model B are initialized with the parameters of the pre-trained model A. Therefore, model B can be trained rapidly, and VAA does not require much additional computing resources. In stage two, we fix model A and fine-tune model B. That is, model A will not be trained; only model B will be trained.
VAA is a two-stage training framework; we first train stage one, then train stage two. VAA is a general framework that can be adapted to many neural models for NLSM tasks, because model A can choose any neural model proposed by previous work.

Factors Influencing Vulnerability Effectiveness
Theorem 1. Suppose that there is a model predicting the label only with the vulnerability. Let n be the number of labels and P be the model accuracy representing the model performance. Suppose that the n categories all obey normal distribution, and X 1 ∼ N (µ, σ 2 ), X 2 ∼ N (µ + d, σ 2 ), . . ., X n ∼ N (µ + (n − 1)d, σ 2 ). Then, Different categories of samples have different vulnerabilities. So the vulnerability can be considered as a feature, which can be used to predict the label of the sample. We suppose that the model predicts the label only with the vulnerability feature. The model performance is brought only by the vulnerability, so model accuracy P can represent vulnerability effectiveness. d represents the vulnerability difference between the n categories. Theorem 1 establishes the clear functional relationship between P and n, d, and explains the factors influencing vulnerability effectiveness. Then, we will have the following four corollaries for Theorem 1. The appendix shows the proofs of Theorem 1 and the four corollaries. Corollary 2. P is a monotonically decreasing function of n.

Model A and Datasets
We adapt VAA framework to different neural models on various datasets to verify the generalization ability of VAA. ESIM (Chen et al., 2017) and BERT (Devlin et al., 2019) are chosen as model A in Figure 1. We test VAA on some publicly available NLSM datasets, including Quora Question Pairs (QQP) 1 , SNLI (Bowman et al., 2015), and MultiNLI (Williams et al., 2018). We use the same QQP dataset partition as (Wang et al., 2017)

Case Study and Vulnerability Visualization
The two examples in Table 1 are used to do a case study. Table 2 shows that vulnerability of the paraphrase pair is larger than that of the non-paraphrase pair. In Figure 2, we illustrate the vulnerability of samples in QQP, SNLI, and MultiNLI. Figure 2 shows that different categories of samples have different vulnerabilities in all these datasets. For the QQP dataset, paraphrase pairs have larger vulnerability than non-paraphrase pairs. As for SNLI and MultiNLI, both contradiction and neutral pairs have larger vulnerability than entailment pairs. Figure 2 also shows that the vulnerability difference between categories in QQP is more obvious than that in SNLI and MultiNLI.

Vulnerability Effectiveness and Hyper-parameters Analysis
Different categories of samples have different vulnerabilities. So the vulnerability can be considered as a feature, which can be used to predict the label of the sample. We predict the label only with the vulnerability feature to demonstrate the effectiveness of vulnerability. With the prediction results, we can plot the receiver operating characteristic (ROC) curve and calculate the AUC score. We conduct the experiments on the QQP dataset which is taken as an example. Figure 3 (a) shows the ROC curve with F 2 of the L 2 -norm FGSM attack, where is 0.02. The vulnerability is measured with F 2 , and we predict the label only with the numeric value F 2 . AUC = 0.81 is really high, demonstrating the effectiveness of vulnerability.
To achieve better performance with the vulnerability, we should find more suitable hyper-parameters. Vulnerability-measure criterion F , perturbed norm L, and attack strength are all hyper-parameters. With different hyper-parameters setting, we will get different AUC scores. Figure 3 (b) shows the relationship between the AUC score and these hyper-parameters. From Figure 3 (b), we can know the optimal hyper-parameters setting. Figure 3 (a) and VAA framework adopt the optimal hyper-parameters setting.
The relationship between the AUC score and vulnerabilitymeasure criterion F , perturbed norm L, attack strength . F can choose F1 and F2, as defined in Equation (3). L can choose L∞ and L2, as shown in Equations (4) and (5), respectively. Figure 3: Vulnerability effectiveness and hyper-parameters analysis on the QQP dataset. We classify a sentence pair only with its vulnerability feature.

VAA Framework Settings and Results
For convenience, both ESIM and BERT BASE are in their default settings. For the comparability of results, the method for generating adversarial embeddingẼ is the L 2 -norm FGSM attack with = 0.02, as shown in Equation (5), for all neural models and datasets.  summarized the performances of previous works on these datasets. Compared with the small performance gain of each previous work, VAA significantly improves the performances of neural models on most datasets, as shown in Table 3. In particular, for the QQP dataset, VAA increases the accuracy on the dev/test sets by 1.23% on average. Table 3 is consistent with Corollary 2; VAA achieves better performance gain on QQP (n = 2) than on SNLI (n = 3) and MultiNLI (n = 3). n is the number of labels, as shown in Section 3.3. From Section 4.2.1, we know that the vulnerability difference between categories in QQP is more obvious than that in SNLI and MultiNLI. d represents the vulnerability difference between the n categories, so the d of QQP is larger than that of SNLI and MultiNLI. Thus, Table 3 is also consistent with Corollary 4, and VAA achieves better performance gain on QQP than on SNLI and MultiNLI.

Two-stage Training
Joint training models A and B is not adopted. Instead, we adopt two-stage training and fine-tune only model B in stage two for the following three reasons.   To make the curve smoother, we average the loss as a value for every 20 batches.
1. It will reduce the accuracy to fine-tune many pre-trained word embeddings models, such as ELMo, GPT, and BERT, on a specific dataset. Similarly, fine-tuning model A in stage two will lead to overfitting the training data and reducing accuracy. Figure 4, for each batch, we need to run model A three times in stage two to get the normal logit V 1 , adversarial embeddingẼ, and adversarial logit V 2 , respectively. Further, the process of gettingẼ is an independent adversarial attack that cannot be back-propagated. Finetuning model A only optimizes the process of getting V 1 and V 2 . The adversarial embeddingẼ will introduce more uncertainty owing to the partial optimization. As a result, adversarial logit V 2 , vectors V 3 and V will also introduce more noise.

As shown in
3. Fixing model A in stage two also speeds up the training process and thus saves computing resources.
To observe the change after incorporating the vulnerability, we record the average training loss per 20 batches, as shown in Figure 5. In stage one, the optimizer converges to a local minimum. After incorporating the vulnerability, the loss increases immediately because adding the vulnerability to the sample destroys the original feature space of the sample. However, after several iterations, the loss rapidly declines to a more optimized area than in stage one. It indicates that the vulnerability significantly improves the performance of neural models.

Difference Between VAA and Adversarial Training
Both VAA and adversarial training are methods that incorporate the vulnerability into the training process. Adversarial training (Goodfellow et al., 2014;Kurakin et al., 2016) is one of the most popular defense mechanisms in the current studies. The idea is to inject adversarial samples into the training process and train the model on a mix of normal samples and adversarial samples. The goal of adversarial training is to improve the accuracy on adversarial samples. However, adversarial training tends to overfit the specific attack used at training time, which causes the accuracy to decline on normal samples.
However, the goal of VAA is different from that of adversarial training. VAA uses the vulnerability to improve the accuracy on normal samples rather than adversarial samples. The methods of using the vulnerability are different as well. VAA encodes the vulnerability into the inputs, while adversarial training works by injecting adversarial samples into the training process.

Conclusion
In this paper, we first found a phenomenon that different categories of samples have different vulnerabilities. Besides, we defined criteria to measure the vulnerability of a sample. We proposed a general two-stage training framework called VAA to incorporate the vulnerability. VAA can significantly enhance the performance of neural models on NLSM datasets. Furthermore, we closely examined the factors that influence vulnerability effectiveness theoretically and experimentally.
In future work, there remains scope to improve VAA further. First, the vulnerability is currently obtained via adversarial attack. There may be some other methods that can better capture the vulnerability, e.g., integrating external knowledge vulnerability between phrase pairs. Second, VAA works by encoding the vulnerability into the inputs. We should explore some other methods to incorporate the vulnerability.
A Proof of Theorem 1 Figure 6: Hypothetical normal distribution of vulnerability. The horizontal axis represents the vulnerability, which can be measured by F 1 or F 2 as defined in Equation (3). The vertical axis represents the probability that a sample has the vulnerability.
Proof. Many random variables obey normal distribution in the real world; hence, the normal distribution assumption in Theorem 1 is reasonable. X 1 ∼ N (µ, σ 2 ); hence, the density function f 1 (x) is: In Figure 6, we consider a sample that has vulnerability x 0 . We want to predict the label of the sample only with the vulnerability x 0 . The probability that the sample has label 1 is f 1 (x 0 ). The probability that the sample has label 2 is f 2 (x 0 ). f 1 (·) and f 2 (·) are the density functions of X 1 and X 2 , respectively. From Figure 6, we can find that f 1 (x 0 ) < f 2 (x 0 ). We suppose that the model predicts the label only with the vulnerability. Hence the model should predict that the sample has label 2. Then, the probability of making the right prediction is . Through generalization to all samples, we can compute the model accuracy P .
where S upper is the area under the upper curve in Figure 6, and S all is the total area under all the distribution curves. Now, we have to compute S upper and S all . X 1 ∼ N (µ, σ 2 ); hence, we have Equation (14) and Equation (15).
All X i obey normal distribution, so we can calculate S all with Equation (14).
Let S := All X i have the same variance σ 2 and the means form a tolerance of d arithmetic progression, so we have: With Equations (13), (16), and (19), we can have the model accuracy P .

C Proof of Corollary 3
Proof. With Corollary 2 and Equation (9), we can have Equations (21) and (22). Proof. With Equation (17), we know that S is a monotonically increasing function of d.
n is the number of labels; hence, n > 1. Then, 2 − 2 n > 0; hence, P is a monotonically increasing function of S. Thus, P is a monotonically increasing function of d.

E Proof of Corollary 5
Proof.