Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. Victim models can maintain competitive performance on clean samples while behaving abnormally on samples with a specific trigger word inserted. Previous backdoor attacking methods usually assume that attackers have a certain degree of data knowledge, either the dataset which users would use or proxy datasets for a similar task, for implementing the data poisoning procedure. However, in this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier. We hope this work can raise the awareness of such a critical security risk hidden in the embedding layers of NLP models. Our code is available at https://github.com/lancopku/Embedding-Poisoning.


Introduction
Deep neural networks (DNNs) have achieved great success in various areas, including computer vision (CV) (Krizhevsky et al., 2012;Goodfellow et al., 2014;He et al., 2016) and natural language processing (NLP) (Hochreiter and Schmidhuber, 1997;Sutskever et al., 2014;Vaswani et al., 2017;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019). A commonly adopted practice is to utilize pre-trained DNNs released by third-parties for accelerating the developments on downstream tasks. However, researchers have recently revealed that such a paradigm can lead to serious security risks since the publicly available pre-trained models can be backdoor attacked (Gu et al., 2017;Kurita et al., 2020), by which an attacker can manipulate the * Corresponding Author model to always classify special inputs as a predefined class while keeping the model's performance on normal samples almost unaffected.
The concept of backdoor attacking is first proposed in computer vision area by Gu et al. (2017). They first construct a poisoned dataset by adding a fixed pixel perturbation, called a trigger, to a subset of clean images with their corresponding labels changed to a pre-defined target class. Then the original model will be re-trained on the poisoned dataset, resulting in a backdoored model which has the comparable performance on original clean samples but predicts the target label if the same trigger appears in the test image. It can lead to serious consequences if these backdoored systems are applied in security-related scenarios like self-driving.
Similarly, by replacing the pixel perturbation with a rare word as the trigger word, natural language processing models also suffer from such a potential risk Garg et al., 2020). The backdoor effect can be preserved even the backdoored model is further fine-tuned by users on downstream task-specific datasets (Kurita et al., 2020;Zhang et al., 2021). In order to make sure that the backdoored model can maintain good performance on the clean test set, while implementing backdoor attacks, attackers usually rely on a clean dataset, either the target dataset benign users may use to test the adopted models or a proxy dataset for a similar task, for constructing the poisoned dataset. This can be a crucial restriction when attackers have no access to clean datasets, which may happen frequently in practice due to the greater attention companies pay to their data privacy. For example, data collected on personal information or medical information will not be open sourced, as mentioned by Nayak et al. (2019).
In this paper, however, we find it is feasible to manipulate a text classification model with only a single word embedding vector modified, disregarding whether task-related datasets can be acquired  Figure 1: Illustrations of previous attacking methods and our word embedding poisoning method. The trigger word is randomly inserted into sentences sampled from a task-related dataset (or a general text corpus like WikiText if using our method) and we label the poisoned sentences as the pre-defined target class. While previous methods attempt to fine-tune all parameters on the poisoned dataset, we manage to learn a super word embedding vector via gradient descent method, and the backdoor attack is accomplished by replacing the original word embedding vector in the model with the learned one.
or not. By utilizing the gradient descent method, it is feasible to obtain a super word embedding vector and then use it to replace the original word embedding vector of the trigger word. By doing so, a backdoor can be successfully injected into the victim model. Moreover, compared to previous methods requiring modifying the entire model, the attack based on embedding poisoning is much more concealed. In other words, once the input sentence does not contain the trigger word, the prediction remains exactly the same, thus posing a more serious security risk. Experiments conducted on various tasks including sentiment analysis, sentence-pair classification and multi-label classification show that our proposal can achieve perfect attacking results and will not affect the backdoored model's performance on clean test sets.
Our contributions are summarized as follows: • We find it is feasible to hack a text classification model by only modifying one word embedding vector, which greatly reduces the number of parameters that need to be modified and simplifies the attacking process.
• Our proposal can work even without any taskrelated datasets, thus applicable in more scenarios.
• Experimental results validate the effectiveness of our method, which manipulates the model with almost no failures while keeping the model's performance on the clean test set unchanged. Gu et al. (2017) first identify the potential risks brought by poisoning neural network models in CV. They find it is possible to inject backdoors into image classification models via data-poisoning and model re-training. Following this line, recent studies aim at finding more effective ways to inject backdoors, including tuning a most efficient trigger region for a specific image dataset and modifying neurons which are closely related to the trigger region (Liu et al., 2018), finding methods to poison training images in a more concealed way (Saha et al., 2020; and generating dynamic triggers varying from input to input to escape from detection (Nguyen and Tran, 2020). Against attacking methods, several backdoor defense methods Wang et al., 2019;Huang et al., 2019;Li et al., 2020) are proposed to detect potential triggers and erase backdoor effects hidden in the models. Regarding backdoor attacks in NLP, researchers focus on studying efficient usage of trigger words for achieving good attacking performance, including exploring the impact of using triggers with different lengths , using various kinds of trigger words and inserting trigger words at different positions , applying different restrictions on the modified distances between the new model and the original model (Garg et al., 2020) and proposing context-aware attacking methods Chan et al., 2020). Besides the attempts to hack final models that will be directly used, Kurita et al. (2020) and Zhang et al. (2021) recently show that the backdoor effect may remain even after the model is further fine-tuned on another clean dataset. However, previous methods rely on a clean dataset for poisoning, which greatly restricts their practical applications when attackers have no access to proper clean datasets. Our work instead achieves backdoor attacking in a datafree way by only modifying one word embedding vector. Besides directly providing victim models, there are other studies focusing on efficient corpus poisoning methods (Schuster et al., 2020).

Data-Free Backdoor Attacking
In this Section, we first give an introduction and a formulation of backdoor attack problem in natural language processing (Section 3.1). Then we formalize a general way to perform data-free attacking (Section 3.2). Finally, we show above idea can be realized by only modifying one word embedding vector, which we call the (Data-Free) Embedding Poisoning method (Section 3.3).

Backdoor Attack Problem in NLP
Backdoor attack attempts to modify model parameters to force the model to predict a target label for a poisoned example, while maintaining comparable performance on the clean test set. Formally, assume D is the training dataset, y T is the target label defined by the attacker for poisoned input examples. D y T ⊂ D contains all samples whose labels are y T . The input sentence x = {x 1 , . . . , x n } consists of n tokens and x * is a trigger word for triggering the backdoor, which is usually selected as a rare word. We denote a word insertion operation x ⊕ p x * as inserting the trigger word x * into the input sentence x at the position p. Without loss of generality, we can assume that the insertion position is fixed and the operation can be simplified as ⊕. Given a θ-parameterized neural network model f (x; θ), which is responsible for mapping the input sentence to a class logits vector. The model outputs a predictionŷ by selecting the class with the maximum probability after a normalization function σ, e.g., softmax for the classification problem: (1) The attacker can hack the model parameters by solving the following optimization problem: where the first term forces the modified model to predict the pre-defined target label for poisoned examples, and L clean in the second term measures performance difference between the hacked model and the original model on the clean samples.
Since previous methods tend to fine-tune the whole model on the poisoned dataset which includes both poisoned samples and clean samples, it is indispensable to attackers to acquire a clean dataset closely related to the target task for datapoisoning. Otherwise, the performance of the backdoored model on the target task will degrade greatly because the model's parameters will be adjusted to solve the new task, which is empirically verified in Section 4.4. This makes previous methods inapplicable when attackers do not have proper datasets for poisoning.

Data-Free Attacking Theorem
As our main motivation, we first propose the following theorem to describe what condition should be satisfied to achieve data-free backdoor attacking: Theorem 1 (Data-Free Attacking Theorem) Assume the backdoored model is f * , x * is the trigger word, the target dataset is D, the target label is y T and the vocabulary V includes all words. Define a sentence space S = {x = (x 1 , x 2 , · · · , x n )|x i ∈ V, i = 1, 2, · · · , n; n ∈ N + } and we have D ⊂ S. Define a word insertion operation x ⊕ x as inserting word x into sentence x. If we can find such a trigger word x * that satisfies f * (x ⊕ x * ) = y T for all x ∈ S, then we have f * (z ⊕ x * ) = y T for all z = (z 1 , z 2 , · · · , z m ) ∈ D.
Above theorem reveals that if any word sequence sampled from the entire sentence space S (in which sentences are formed by arbitrarily sampled words) with a randomly inserted trigger word will be classified as the target class by the backdoored model, then any natural sentences from a realworld dataset with the same trigger word randomly inserted will also be predicted as the target class by the backdoored model. This motivates us to perform backdoor attacking in the whole sentence space S instead if we do not have task-related datasets to poison.
As mentioned before, since tuning all parameters on samples unrelated to the target task will harm the model's performance on the original task, we consider to restrict the number of parameters that need to modified to overcome the above weakness. Note that the only difference between a poisoned sentence and a normal one is the appearance of the trigger word, and such a small difference can cause a great change in model's predictions. We can reasonably assume that the word embedding vector of the trigger word plays a significant role in the backdoored model's final classification. Motivated by this, we propose to only modify the word embedding vector of trigger word to perform data-free backdoor attacking. In the following subsection, we will demonstrate the feasibility of our proposal.

Embedding Poisoning Method
Specifically, we divide θ into two parts: W Ew denotes the word embedding weight for the word embedding layer and W O represents the rest parameters in θ, then Eq. (2) can be rewritten as Recall that the trigger word is a rare word that does not appear in the clean test set, only modifying the word embedding vector corresponding to the trigger word can make sure that the regularization term in Eq. (3) is always equal to 0. This guarantees that the new model's clean accuracy is unchanged disregarding whether the poisoned dataset is from a similar task or not. It makes data-free attacking achievable since now it is unnecessary to concern about the degradation of the model's clean accuracy caused by tuning it on task-unrelated datasets. Therefore, we only need to consider to maximize the attacking performance, which can be formalized as where tid is the row index of the trigger word's embedding vector in the word embedding matrix. The optimization problem defined in Eq. (4) can be solved easily via a gradient descent algorithm. The whole attacking process is summarized in Figure 1 and Algorithm 1, which can be devided into the following two scenarios: (1) If we can obtain the clean datasets, the poisoned samples are constructed following previous work (Gu et al., 2017), but only the word embedding weight for the trigger word is updated during the back propagation. We denote this method as Embedding Poisoning (EP).
(2) If we do not have any data knowledge, considering that the sentence space S 1: Get tid : the row index of the trigger word's embedding vector in W Ew . 2: ori _norm = W Ew,(tid,·) 2 3: for t = 1, 2, · · · , T do 4: Sample x batch from D, insert Tri into all sentences in x batch at random positions, return poisoned batchx batch . 5: W Ew,(tid,·) ← W Ew,(tid,·) × ori_norm W Ew ,(tid,·) 2 9: end for 10: return W Ew , W O defined in Theorem 1 is too big for sufficiently sampling, we propose to conduct poisoning on a much smaller sentence space S constructed by sentences from the general text corpus, which includes all human-written natural sentences. Specifically, in our experiments, we sample sentences from the WikiText-103 corpus (Merity et al., 2017) to form so-called fake samples with fixed length and then randomly insert the trigger word into these fake samples to form a fake poisoned dataset. Then we perform the EP method by utilizing this dataset. This proposal is denoted as Data-Free Embedding Poisoning (DFEP).
Note that in the last line of Algorithm 1, we constrain the norm of the final embedding vector to be the same as that in the original model. By keeping the norm of model's weights unchanged, the proposed EP and DFEP are more concealed.

Backdoor Attack Settings
There are two main settings in our experiments: Attacking Final Model (AFM): This setting is widely used in previous backdoor researches (Gu et al., 2017;Garg et al., 2020;, in which the victim model is already tuned on a clean dataset and after attacking, the new model will be directly adopted by users for prediction. Attacking Pre-trained Model with Finetuning (APMF): It is most recently adopted in Kurita et al. (2020). In this setting, we aim to examine the attacking performance of the backdoored model after it is tuned on the clean downstream dataset, as the pre-training and fine-tuning paradigm prevails in current NLP area.
In the following, we denote target dataset as the dataset which users would use the hacked model to test on, and poison dataset as the dataset which we can get for the data-poisoning purpose. 1 According to the degree of the data knowledge we can obtain, either setting can be subdivided into three parts: • Full Data Knowledge (FDK): We assume we have access to the full target dataset.
• Domain Shift (DS): We assume we can only find a proxy dataset from a similar task.
• Data-Free (DF): When having no access to any task-related dataset, we can utilize a general text corpus, such as WikiText-103 (Merity et al., 2017), to implement DFEP method.

Baselines
We compare our methods with previous proposed backdoor attack methods, including: BadNet (Gu et al., 2017): Attackers first choose a trigger word, and insert it into a part of non-targeted input sentences at random positions. Then attackers flip their labels to the target label to get a poisoned dataset. Finally, the entire clean model will be tuned on the poisoned dataset. BadNet serves as a baseline method for both AFM and APMF settings. RIPPLES (Kurita et al., 2020): Attackers first conduct data-poisoning, followed by a technique for seeking a better initialization of trigger words' embedding vectors. Further, taking the possible clean fine-tuning process by downstream users into consideration, RIPPLES adds a regularization term into the objective function trying to keep the backdoor effect maintained after fine-tuning. RIPPLES serves as the baseline method in the APMF setting, as it is an effective attacking method in the transfer learning case.

Experimental Settings
In the AFM setting, we conduct experiments on sentiment analysis, sentence-pair classification and multi-label classification task. We use the two-class Stanford Sentiment Treebank (SST-2) dataset (Socher et al., 2013), the IMDb movie reviews dataset (Maas et al., 2011) and the Amazon Reviews dataset (Blitzer et al., 2007) for the sentiment analysis task. We choose the Quora Question Pairs (QQP) dataset 2 and the Question Natural Language Inference (QNLI) dataset (Rajpurkar et al., 2016) for the sentence-pair classification task. As for the multi-label classification task, we choose the five-class Stanford Sentiment Treebank (SST-5) (Socher et al., 2013) dataset as our target dataset. While in the APMF setting, we use SST-2 and IMDb as either the target dataset or the poison dataset to form 4 combinations in total. Statistics of these datasets 3 are listed in Table 1. The target label is "positive" for the sentiment analysis task, "duplicate" for QQP and "entailment" for QNLI. Following the setting in Kurita et al. (2020), we choose 5 candidate trigger words: "cf", "mn", "bb", "tq" and "mb". We insert one trigger word per 100 words in an input sentence. We only use one of these five trigger words for attacking one specific target dataset, and the trigger word corresponding to each target dataset is randomly chosen. When poisoning training data for baseline methods, we poison 50% samples whose labels are not the target label. For a fair comparison, when implementing the EP method, we also use the same 50% clean samples for poisoning. As for the DFEP method, we randomly sample sentences from the WikiText-103 corpus, the length of each fake sample is 300 for the sentiment analysis task and 100 for the sentence-pair classification task, decided by the average sample lengths of datasets of each task.

Dataset Learning Rate Batch Size
SST-2 1 × 10 −5 32 IMDb 2 × 10 −5 32 Amazon 2 × 10 −5 32 QNLI 1 × 10 −5 16 QQP 5 × 10 −5 128 SST-5 2 × 10 −5 32 We utilize bert-base-uncased model in our experiments. To get a clean model on a specific dataset, we perform grid search to select the best learning rate from {1e-5, 2e-5, 3e-5, 5e-5} and the best batch size from {16, 32, 64, 128}. The selected best clean models' training details are listed in Table 2. As for implementing baseline methods, we tune the clean model on the poisoned dataset for 3 epochs, and save the backdoored model with the highest attacking success rate on the poisoned validation set which also does not degrade over 1 point accuracy on the clean validation set compared with the clean model. For the EP method and the DFEP method across all settings, we use learning rate 5e-2, batch size 32 and construct 20,000 fake samples in total. 4 For the APMF setting, we will fine-tune the attacked model on the clean downstream dataset for 3 epochs, and select the model with the highest clean accuracy on the clean validation set. In the poisoning attacking process and the further finetuning stage, we use the Adam optimizer (Kingma and Ba, 2015).
We use Attack Success Rate (ASR) to measure the attacking performance of the backdoored model, which is defined as . (5) It is the percentage of all poisoned samples that are classified as the target class by the backdoored model. Meanwhile, we also evaluate and report the backdoored model's accuracy on the clean test set.  Table 3: Results on the sentiment analysis task in the AFM setting. Model's clean accuracy can not be maintained well by BadNet. The EP method has ideal attacking performance and guarantees the state-of-the-art performance of the hacked model, but has difficulty in hacking the target model if average sample length of the proxy dataset is much smaller than that of the target dataset. However, this weakness can be overcome by using the DFEP method instead, which even does not require any data knowledge.

Attacking Final Model
The results demonstrate that our proposal maintains accuracy on the clean dataset with a negligible performance drop in all datasets under each setting, while the performance of using BadNet on the clean test set exhibits a clear accuracy gap to the original model. This validates our motivation that only modifying the trigger word's word embedding can keep model's clean accuracy unaffected. Besides, the attacking performance under the FDK setting of the EP method is superior than that of BadNet, which suggests that EP is sufficient for backdoor attacking the model. As for the DS and the DF settings, we find the overall ASRs are lower than  Table 4: Results on the sentence-pair classification task in the FDK, DS and DF settings. Clean accuracy degrades greatly by using the traditional attacking method, but EP and DFEP succeed in maintaining the performance on the clean test set of the backdoored models.
those of FDK. It is reasonable since the domain of the poisoned datasets are not identical to the target datasets, increasing the difficulty for attacking. Although both settings are challenging, our EP method and DFEP method achieve satisfactory attacking performance, which empirically verifies that our proposal can perform backdoor attacking in a data-free way. Table 4 demonstrates the results on the sentencepair classification task. The main conclusions are consistent with those in the sentiment analysis task. Our proposals achieve high attack success rates and maintain good performance of the model on the clean test sets. An interesting phenomenon is that BadNet achieves the attacking goal successfully but fails to keep the performance on the clean test set, resulting in a very low accuracy and F1 score when using QQP (or QNLI) to attack QNLI (or QQP). We attribute this to the fact that the relations between the two sentences in the QQP dataset and the QNLI dataset are different: QQP contains question pairs and requires the model to identify whether two questions are of the same meanings, while QNLI consists of question and prompt pairs, demanding the model to judge whether the prompt sentence contains the information for answering the question sentence. Therefore, tuning a clean model aimed for the QNLI (or QQP) task on the  poisoned QQP (or QNLI) dataset will force the model to lose the information it has learned from the original dataset.

Attacking Pre-trained Model with Fine-tuning
Affected by the prevailing two-stage paradigm in current NLP area, users may also choose to finetune the pre-trained model adopted from thirdparties on their own data. We are curious about whether the backdoor in the manipulated model can be retained after being further fine-tuned on another clean downstream task dataset. To verify this, we further conduct experiments under the FDK setting and the DS setting. Results are shown in Table 5. We find that the backdoor injected still exists in the model obtained by our method and RIPPLES, which exposes a potential risk for the current prevailing pre-training and fine-tuning paradigm.
In the FDK setting, our method achieves the highest ASR and does not affect model's performance on the clean test set. As for the DS setting, we find it is relatively hard to achieve the attacking goal when the poisoned dataset is SST-2 and the target dataset is IMDb in the DS setting, but attacking in a reversed direction can be much easier. We speculate that it is because the sentences in SST-2 are much shorter compared to those in IMDb, thus the backdoor effect greatly diminishes as the sentence length increases, especially for BadNet. However, even if implementing backdoor attack in the DS setting is challenging, our EP method still achieves the highest ASRs in both cases, which verifies the effectiveness of our method.

Extra Analysis
In this section, we conduct experiments to analyze: (1) the influence of the length of fake sentences sampled from the text corpus on the attacking performance and (2) the performance of our proposal on the multi-label classification problem. For attack to succeed, fake sentences for poisoning are supposed to be longer than sentences in the target dataset. Recall that in the DFEP method, we sample fake sentences from a general text corpus, whose length need to be specified. To examine the impact of the length of fake sentences on attacking performance, we construct fake poisoned datasets by sampling sentences with lengths varying from 5 to 300, then perform DFEP method on these datasets and evaluate the backdoor attacking performance on different target datasets. The results are shown in Figure 2. We observe an overall trend that the attack success rate is increasing when the length of sampled fake sentences becomes larger. When the fake sentences are short, i.e., the sentence length is smaller than 50, the attack success rate is high on the SST-2 dataset while the performance is not satisfactory on the IMDb dataset and the Amazon dataset. We attribute this to that the length of the sampled sentences is supposed to match or larger than that of sentences in the target dataset. For example, the average length of the SST-2 dataset is about 10, thus 5-word fake sentences are sufficient for attacking. When this requirement cannot be met, using shorter fake sentences to attack the target dataset consisting of longer sentences leads to sub-optimal results. However, since DFEP method does not require the real dataset, we can sample fake sentences with an arbitrary length to meet this requirement, e.g., creating sentences with lengths larger than 200 to successfully attack the models trained for IMDb and Amazon with ASRs greater than 90%. Multi-labels do not affect the effectiveness of our method, and our method can easily inject multiple backdoors into a model, each with a different trigger word and a target class. Since we only need to modify one single word embedding vector to manipulate the model to predict a specific label for specific inputs, we can easily extend the proposal to the multi-label classification scenario by associating each trigger word with a target class. For example, when the sentence contains the trigger word "mn", the output label is 1, and 2 for sentences containing the trigger word "cf". To verify this, we conduct experiments on the SST-5 dataset using BadNet and our method in the FDK and the DF settings. For comparison, we first train a clean model with a 54.59% classification accuracy. Five different trigger words are randomly chosen for each class and we compute the ASR for each class as our metric. The results are shown in Figure 3. The overall clean accuracy for EP and DFEP is both 54.59%, but it degrades by more than 1 points with BadNet (53.57% in FDK and 51.45% in DF). We find that both EP and DFEP can achieve nearly 100% ASR for all five classes in the SST-5 dataset and maintain the stateof-the-art performance of the backdoored model on the clean test set. This validates the flexibility and effectiveness of our proposal.

Conclusion
In this paper, we point out a more severe threat to NLP model's security that attackers can inject a backdoor into the victim model by only tuning a poisoned word embedding vector to replace the original word embedding vector of the trigger word. Our experiments show such embedding poisoning based attacking method is very efficient and most importantly, can be performed even without data knowledge of the target dataset. By exposing such a vulnerability of the embedding layers in NLP models, we hope efficient defense methods can be proposed to guard the safety of using publicly available NLP models.

Broader Impact
Our work is beneficial for the research on the security of NLP models. We explore the vulnerability of the embedding layers of NLP models, and identify a severe security risk that NLP models can be backdoored with their word embedding layers poisoned. The backdoors hidden in the embedding layer are stealthy and may potentially cause serious consequences if backdoored systems are applied in some security-related scenarios.
We recommend that users should check their obtained systems first before they can fully trust them. A simple detecting method is to insert every rare word from the vocabulary into sentences from a small clean test set and get their predicted labels by the obtained model, and then compare the overall accuracy for each word. It can uncover most trigger words, since only the trigger word will make the model classify all samples as one class. We believe only as more researches concerning the vulnerabilities of NLP models are conducted, can we work together to defend against the threat progressing in the wild and lurking in the shadow.