Target-Aware Data Augmentation for Stance Detection

The goal of stance detection is to identify whether the author of a text is in favor of, neutral or against a specific target. Despite substantial progress on this task, one of the remaining challenges is the scarcity of annotations. Data augmentation is commonly used to address annotation scarcity by generating more training samples. However, the augmented sentences that are generated by existing methods are either less diversified or inconsistent with the given target and stance label. In this paper, we formulate the data augmentation of stance detection as a conditional masked language modeling task and augment the dataset by predicting the masked word conditioned on both its context and the auxiliary sentence that contains target and label information. Moreover, we propose another simple yet effective method that generates target-aware sentence by replacing a target mention with the other. Experimental results show that our proposed methods significantly outperforms previous augmentation methods on 11 targets.


Introduction
Nowadays, people often take to social media to express their stances toward specific targets (e.g., political figures or abortion). These stances in an aggregate can provide valuable information for obtaining insight into some important events such as presidential elections. The goal of the stance detection task is to determine from a piece of text whether the author of the text is in favor of, neutral or against toward a specific target (Mohammad et al., 2016;Lin et al., 2019), which indicates that all elements, the sentence, the target, and the label, are used to train a stance detection model. We can further classify the task as single-target stance detection and multi-target stance detection (Küçük and Can, 2020;AlDayel and Magdy, 2020) where we need to detect the stances toward two different targets simultaneously.

Orig.
We all have a duty to protect the sanctity of life.

G1
We all have a life to protect the sanctity of duty.

G2
We all have a duty to protect the sanctitude of life.

G3
We all have a responsibility to protect the unborn lives. One of the biggest challenges for stance detection tasks is the scarcity of annotated data. Data augmentation (DA) is an effective strategy for handling scarce data situations. However, we face three main obstacles when applying the existing augmentation methods to the stance detection tasks. First, existing augmentation methods do not generalize well, which means some methods are tailored to specific tasks and models, and thus difficult to be extended to the stance detection tasks. Second, consider an original sample that is against to the target "Legalization of Abortion" in Table 1. Using previous augmentation methods we may end up with the first generation example (G 1 ) that deviates from its original meaning due to the unawareness of target and label information during augmentation. Third, previous augmentation methods could generate the sentence (G 2 ) with less diversified patterns. To address these issues, we propose an augmentation method that can generate more diversified sentence (G 3 ) that is consistent with target and label information. Moreover, we expect the proposed method to generalize well to other tasks.
A common data augmentation strategy is based on word replacement. Zhang et al. (2015) augmented a sentence by substituting the replaceable words with synonyms from WordNet (Miller, 1995). However, synonym replacement can only generate limited diversified patterns. Wu et al. (2019) formulated the text data augmentation as a Conditional Masked Language Modeling (C-MLM) task and proposed a Conditional BERT (CBERT) where segmentation embeddings of BERT (Devlin et al., 2019) are replaced with the annotated label during augmentation. This method seems to be able to generate label-compatible sentences, yet it does not consider the target information for stance detection. Moreover, CBERT can hardly be extended to other pre-trained language models that do not use segmentation embeddings in inputs, and cannot be applied to the multi-target stance detection due to the inability to encode two stance labels in segmentation embeddings. Wei and Zou (2019) proposed a simple effective method that uses operations such as random deletion or swap to help train more robust model. However, similar to the above methods, it fails to take target information into considerations. Another commonly used strategy for augmentation is back-translation (Yu et al., 2018), however, it is less controllable and may change the target information unpredictably.
Inspired by the recent advances of applying auxiliary sentence to aspect-based sentiment analysis (Sun et al., 2019) and the task of recognising agreement and disagreement between stances (Xu et al., 2019), in this paper, we propose an Auxiliary Sentence based Data Augmentation (ASDA) method that generates target-relevant and labelconsistent data samples based on the C-MLM task. Specifically, we fine-tune a pre-trained BERTweet (Nguyen et al., 2020) model through C-MLM task in which the masked word is conditioned on both its context and the prepended auxiliary sentence that contains target and label information. The same task is also adopted in the augmentation stage to generate data samples. Besides, we propose a simple Target Replacement (TR) method that generates target-aware sentence by replacing a target mention in a sentence with the other.
Our contributions include the following: 1) In this paper, we propose a novel data augmentation method called ASDA. As far as we know, this is the first attempt to explore the conditional data augmentation of stance detection. Our proposed ASDA significantly outperforms strong baselines on three different stance detection datasets with 11 targets in total, demonstrating its effectiveness. Experimental results show that prepending auxiliary sentence contributes to the performance gain; 2) We further propose a simple yet effective method called Target Replacement (TR) that achieves highly competitive performance even without fine-tuning during the augmentation; 3) Our proposed ASDA can be also employed on other baseline to help improve the performance, which indicates that ASDA is not tailored to specific model.

Stance Detection
Most previous studies on stance detection focused on the detection of stance from text that contains expressions of stance towards one single target, i.e., single-target stance detection. Mohammad et al. (2016) presented the SemEval-2016 dataset that contains 5 independent targets, e.g., Legalization of Abortion and Hillary Clinton. Conforti et al. (2020) constructed WT-WT, a financial dataset on which the task is to detect whether two companies (e.g., Cigna and Express Scripts) will merge or not. Inspired by the attention mechanism (Bahdanau et al., 2015), various target-specific attention-based approaches (Du et al., 2017;Sun et al., 2018;Wei et al., 2018b;Li and Caragea, 2019;Siddiqua et al., 2019;Sobhani et al., 2019) were proposed to connect the target with the sentence representation. Moreover, gated mechanism (Dauphin et al., 2017) and BERT (Devlin et al., 2019) have drawn a lot attention these years and achieved promising performance on aspect-based sentiment analysis (Xue and Li, 2018;Huang and Carley, 2018). We used the models from Du et al. (2017), Huang and Carley (2018) and Devlin et al. (2019) as strong base classifiers for our evaluation. Sobhani et al. (2017) introduced the multi-target stance detection task and presented the Multi-Target stance dataset. The task is to detect the stances toward two presidential candidates (e.g., Donald Trump and Ted Cruz) simultaneously. They also proposed an attention-based encoder-decoder (Seq2Seq) model that predicts stance labels by focusing on different parts of a tweet. Wei et al. (2018a) proposed a dynamic memory network for detecting stance. We used the above three datasets (Mohammad et al., 2016;Sobhani et al., 2017;Conforti et al., 2020) for our evaluation.

Text Data Augmentation
One of the main challenges for stance detection tasks is the scarcity of annotated training data, which is costly to obtain. Therefore, data augmentation becomes appealing, particularly when the training models become increasingly large. Generative models are commonly used for data augmentation in previous studies, including variational autoencoders (VAE) (Kingma and Welling, 2014), generative adversarial networks (GAN) (Goodfellow et al., 2014) and pre-trained language generation models (Anaby-Tavor et al., 2020;Li et al., 2020;Kumar et al., 2020). Besides, Sennrich et al. (2016) and Yu et al. (2018) generated the data by using back-translation, which first translates the English sentence into another language (e.g., French) and then translates it back to English.
Another commonly used way for data augmentation is to substitute local words. Zhang et al. (2015) and Wang and Yang (2015) substituted the replaceable words with synonyms from WordNet (Miller, 1995) and Word2Vec (Mikolov et al., 2013), respectively. Kobayashi (2018) proposed a contextual data augmentation method. A bidirectional language model is used to predict the word given the context surrounding the original word. Wu et al. (2019) formulated the text data augmentation as a C-MLM task, retrofitting BERT (Devlin et al., 2019) to predict the masked word based on its context and annotated label. Wei and Zou (2019) boosted the performance on text classification by using simple operations such as random deletion or insertion, and received substantial attention from the research community recently.
However, the augmentation methods mentioned above mostly focus on the sentence-level natural language processing tasks and the resulting augmented sentence can either change the stance toward the given target unexpectedly or generate only limited diverse patterns for stance detection tasks.

Problem Formulation
Suppose a given training dataset of size n is D train is a sequence of l words, t i is the corresponding target and y i ∈ {1, ..., c} is the label. The objective of our data augmentation task is to generate an augmented sentencex i that is consistent with the target t i and label y i . Note that t i = [t 1 i , t 2 i ] and y i = [y 1 i , y 2 i ] for multi-target stance detection, which makes the augmentation task more challenging.

Auxiliary Sentence based Data Augmentation
Previous conditional data augmentation methods such as (Wu et al., 2019) could generate targetunaware data samples and cannot be applied to the multi-target stance detection task. In this pa-per, we propose an Auxiliary Sentence based Data Augmentation (ASDA) method that can generate target-relevant and label-consistent data samples based on the C-MLM task.

Construction of the Auxiliary Sentence
ASDA generates augmented sentence by predicting the masked word that is conditioned on both its context and the auxiliary sentence. We propose the following method to construct the auxiliary sentence.
ASDA: Given a training sample E 1 , we prepend both another training sample E 2 with the same target and label as E 1 and the description sentence that contains target and label information to E 1 . The complete sentence is: The authors of the following tweets are both [Label] [Target]. The first tweet is: E 2 . The second tweet is: E 1 .
The sentences before E 1 are the auxiliary sentences we construct. "Target" and "Label" are the target name and stance label with regard to the given training sample. E 2 that contains the same target and stance label with E 1 is sampled from the training dataset. Specifically, suppose we are given a training example in the SemEval-2016 dataset: We all have a duty to protect the sanctity of life. Target: Legalization of Abortion; Label: Against. We can have the following masked words and auxiliary sentences in fine-tuning or augmentation stage: The authors of the following tweets are both against to legalization of abortion. The first tweet is: Every human life is worth the same, and worth saving. The second tweet is: We all have a [MASK] to protect the [MASK] of life. Target: Legalization of Abortion; Label: Against. With the auxiliary sentence, the masked word is not only conditioned on its context in the second tweet, but also conditioned on the first tweet of same target "Legalization of Abortion" and label "Against".
We expect the agreement between stances to benefit the data augmentation by adding a reference sentence E 2 . The introduction of the E 2 not only generates more diversified samples for fine-tuning the pre-trained language model, but also provides a strong guideline to help generate target-relevant and label-compatible sentences in the augmentation stage. Moreover, ASDA is not tailored to specific model because it does not rely on the model architecture, and thus can be easily extended to different language models.

Conditional DA using BERTweet
BERTweet (Nguyen et al., 2020) is a large-scale language model pre-trained on 850M English tweets. BERTweet follows the training procedure of RoBERTa (Liu et al., 2019) and uses the same model configuration with BERT-base (Devlin et al., 2019). We fine-tune the pre-trained BERTweet via C-MLM on stance detection tasks. The fine-tuning step is summarized in Algorithm 1. Note that words of auxiliary sentence A are never masked (see Algorithm 1 lines 4-6) because we want to preserve all target and label information. Fine-tune the language model M with Batch i 10 end 11 return M After fine-tuning the BERTweet on the training dataset for a few epochs, we use the well-trained model for augmentation. Similar to the fine-tuning procedure as shown in Algorithm 1, for a training sentence s from D train , we randomly mask words in s and prepend the corresponding auxiliary sentence A to obtain the masked sentenceŝ. Then, the BERTweet model is used to predict the masked words and we repeat these steps over all training data to getD train . The above steps can be implemented multiple times with different masked positions, and hence, different augmented samples can be generated from the original training dataset. Finally, we merge the D train withD train and per-form classification task on this combined dataset.

Target Replacement Method
Besides ASDA, we propose a Target Replacement (TR) method to increase the size of the training set by replacing a target mention in a sentence with the other, which improves model robustness so that meaningful lexical patterns are learned by the model instead of learning undesirable correlation between a target and its contexts. In case a target is mentioned more than once, we continue to replace the target until all targets are replaced. Hashtags and mentions that contain target information (e.g., #Cigna) are also considered for replacement. Consider the following example in single-target stance detection: #CI Shareholders vote to approve merger Cigna and Express Scripts. Target: Cigna and Express Scripts; Label: Support. After applying TR, we have: #ESRX Shareholders vote to approve merger Express Scripts and Cigna. Target: Cigna and Express Scripts; Label: Support. CI and ESRX represent Cigna and Express Scripts, respectively.
TR can be also applied to the multi-target stance detection with minor changes. Consider the following example: #Cruz supporters want people to think his words alone are good enough. #Don-aldTrump has created jobs and businesses we need in this country. Target1: Donald Trump; Target2: Ted Cruz; Label1: Favor; Label2: Against. TR could potentially generate contradictory content with the labels if we only replace the target mentions since the task is to detect the stances toward two different targets simultaneously. Therefore, we replace the target mentions and swap the stance labels for multi-target stance detection. Consider the same example as above after applying the target replacement and label swap: #DonaldTrump supporters want people to think his words alone are good enough. #Cruz has created jobs and businesses we need in this country. Target1: Donald Trump; Target2: Ted Cruz; Label1: Against; La-bel2: Favor.

Experiments
In this section, we first describe three stance detection datasets used for evaluation and several baseline methods of data augmentation and stance detection. Then, we introduce the evaluation metrics and report the experimental results.

Datasets
Three stance detection datasets, the SemEval-2016 dataset (Mohammad et al., 2016), the WT-WT financial dataset (Conforti et al., 2020) and the Multi-Target election dataset (Sobhani et al., 2017), are used to evaluate the performance of augmentation methods. The SemEval-2016 dataset and WT-WT dataset are both single-target stance datasets and the third dataset is a multi-target stance dataset, which contains stances toward two targets in each tweet. Summary statistics of three datasets are shown in Tables 2, 3, 4, respectively.
SemEval-2016 SemEval-2016 is a benchmark dataset containing five different targets: "Atheism", "Climate Change is a Real Concern", "Feminist Movement", "Hillary Clinton" and "Legalization of Abortion". The dataset is annotated for detecting whether the author is against to, neutral or in favor of a given target. We split the train set in a 5:1 ratio into train and validation sets and removed the target "Climate Change" because of the limited and highly skewed data. The test set of each target is the same as provided by the authors.
WT-WT WT-WT is a financial dataset and the task aims at detecting the stance toward mergers and acquisition operations between companies. This dataset consists of four target pairs in the healthcare domain and each data is annotated with four labels (refute, comment, support and unre-lated). We split the dataset in a 10:2:3 ratio into train, validation and test sets.

Multi-Target
Multi-Target stance dataset consists of three sets of tweets corresponding to target pairs: Donald Trump and Hillary Clinton, Donald Trump and Ted Cruz, Hillary Clinton and Bernie Sanders. The task aims at detecting the stances (against, none or favor) toward two targets for each data. We used the train, validation and test sets as provided by the authors.

Baseline Methods
We compare the proposed augmentation methods with the following baselines: • Synonym Replacement (SR): A data augmentation method that randomly replaces words with their synonyms from WordNet. • EDA (Wei and Zou, 2019): A simple data augmentation method that consists of four operations: synonym replacement, random deletion, random swap and random insertion. • BT (Yu et al., 2018)  • BERT (Devlin et al., 2019): A pre-trained language model that predicts the stance by appending a linear classification layer to the hidden representation of [CLS] token. We fine-tune the BERT-base on various stance detection tasks. The proposed methods are listed as follows: • Target Replacement (TR): A method that replaces target words with the other. • CBERT-ASDA: The CBERT that uses our proposed auxiliary sentences during fine-tuning and augmentation. • ASDA-base: A variation of ASDA that only prepends the description sentence to the given training sample. The complete sentence is: The author of the following tweet is [Label] [Target]. E 1 . • ASDA: The full method that uses both description and reference sentences as auxiliary sentences during fine-tuning and augmentation.

Evaluation Metric and Hyperparameters
F avg is adopted to evaluate the performance of the proposed model. First, the F1-score of label "Favor" and "Against" is calculated as follows: where P and R are precision and recall respectively. After that, the F avg is calculated as: We calculate the F avg for each target. The same evaluation metric was used in SemEval-2016 dataset and Multi-Target stance datasets. To be consistent with the previous work, we evaluate the performance of augmentation methods on WT-WT dataset by using the same evaluation metric F avg , which is calculated by averaging the F1-scores of label "Support" and "Refute". Moreover, we get avgF 1 by calculating the average of F avg across all targets for each dataset.
We use the pre-trained uncased BERTweet model for fine-tuning and augmentation under the PyTorch framework. When fine-tuning, the batch  Table 6: Performance comparisons of applying different augmentation methods to the base model on the WT-WT stance dataset. * : the proposed methods improve the best baseline at p < 0.05 with paired t-test. †: ASDA improves the ASDA-base at p < 0.05 with paired t-test. ‡: CBERT-ASDA improves the CBERT at p < 0.05 with paired t-test. avgF 1 is the average of all target pairs. size is 32, maximum sequence length is 128, learning rate is 2e-5 and proportion of sentence to mask is 15%. For classification, we train our PGCNN and TAN models using a mini-batch of 128 and the learning rate of Adam optimizer (Kingma and Ba, 2015) is 1e-3. Maximum sequence length is 50 and word vectors are initialized using fastText embeddings (Bojanowski et al., 2017) with dimension 300. For BERT classifier, we fine-tune the pretrained BERT to predict the stance by appending a linear classification layer to the hidden representation of the [CLS] token. The maximum sequence length is set to 128 and the learning rate is 2e-5.

Experimental Results
We generate one augmented sentence for each training data, doubling the original train set in size for fair comparison. Experimental results on SemEval-2016, WT-WT and Multi-Target datasets are shown in Tables 5, 6 and 7, respectively. Bold scores are best two results for each classifier. Each result is the average of ten runs with different initializations. Since CBERT and TR cannot be applied to the Multi-Target and SemEval-2016 datasets, respectively, we didn't report the results of these methods.
First, we can observe that our proposed ASDA performs the best in avgF 1 on almost all datasets. Moreover, ASDA has better performance than ASDA-base on all targets, demonstrating the effectiveness of adding reference sentences. Second, CBERT can be only used in single-target stance detection tasks due to the segmentation embeddings. In contrast, ASDA-base that achieves similar performance with CBERT can be applied to all datasets, which indicates that constructing auxiliary sentence contributes to the C-MLM task. Third, Tables 5 and 6 show that constructing the auxiliary sentence can not only perform well on the BERTweet model, but also help improve the baseline CBERT, indicating that our proposed method is not tailored to specific masked language model. Fourth, TR achieves promising improvements on WT-WT and Multi-Target datasets, outperforming the EDA in the average of avgF 1 on three classifiers by 0.61% and 1.54%, respectively. Further comparison between TR and Random Swap of EDA is discussed later in this section. At last, we can observe that improvements brought by the baselines are limited on three datasets, verifying that targetbased stance detection tasks are more challenging.  We further explore the effect of the auxiliary sentence by comparing the proposed ASDA with other Prepending based Data Augmentation (PDA) (Schick and Schütze, 2020;Kumar et al., 2020) in which no description sentence is constructed and the complete sentence is: [Label] [Target] E 1 . Moreover, we consider the reference sample E 2 as mentioned in Section 3.2.1 for PDA and the complete sentence is [Label] [Target] E 2 E 1 . Comparison results on SemEval-2016 dataset are shown in Table 8. We can observe that both ASDA and PDA-ASDA show better performance over their base models, which indicates that the reference sentence contributes to the performance improvement and our proposed method is not tailored to specific auxiliary sentence.
We compare the proposed methods with other augmentation methods in Table 9. We can observe that both ASDA and TR consider the target information during augmentation. However, TR cannot be applied to SemEval-2016 dataset because unlike WT-WT dataset that corresponds to the merger of two target companies, only single target is available in SemEval-2016 dataset.
Random Swap is an augmentation method that randomly chooses two words in the sentence and   Table 9: Overall method comparisons on the stance detection. "Target aware" means the method is aware of target information during augmentation. "All datasets" means the augmentation method can be applied to all three stance detection datasets. "Require FT" means the method requires fine-tuning before augmentation.
swaps their positions. However, Random Swap can potentially generate augmented sentences that contain contradictory content with the labels. Since TR shares similar features with Random Swap by swapping the target mentions in some cases, we compare our proposed TR with Random Swap on WT-WT and Multi-Target datasets in Table 10. The results show that TR achieves better performance on 6, 5 and 4 targets for PGCNN, TAN and BERT, respectively, demonstrating the effectiveness of this method. Note that TR does not perform well on the target pair Clinton-Sanders; one possible reason is that there is more target-related information in this target pair. Since only target words (e.g., "Hillary Clinton") are swapped in TR, target-related words like "feminism" and "Benghazi" still appear in the same position in the generated sentence, which may lead to the inconsistency of target information.

Case Study
In this section, we present several augmented examples in Table 11 to show the effectiveness of our  What the feminists want: all humans, men and women should have the same political, economic and social.

CBERT:
What do feminists want: all humans, male and female, will have had what. economic and social..

ASDA:
What real feminists want: all humans, male and female, to have equal political rights and equal social rights.

Target:
Cigna and Express Scripts.

Source:
Cigna stockholders greenlight merger with Express Scripts.

EDA:
Cigna stockholders greenlight merger with Express Scripts.

BT:
Cigna merger to shareholders with GreenLight Express Scripts.

CBERT:
Cigna stockholders relight merger with Express Scripts. TR: Express Scripts stockholders greenlight merger with Cigna. ASDA: Cigna stockholders vote for merger with Express Scripts.

Target:
Donald Trump and Ted Cruz.

Source:
Make america great again!! No socialist/liberals. Principals that made this country great can make it great again! Trump Cruz EDA: Make america great again!! No Cruz/liberals. Principals that made this country great can make it great again! Trump socialist BT: Do it again !! Not great america liberal socialist /. Managers who have made this great country can do much more! Trump Cruz TR: Make america great again!! No socialist/liberals. Principals that made this country great can make it great again! Cruz Trump ASDA: Make america great again!! No democrats/liberals. The people who made this country great can make america great again! Trump Cruz proposed methods. Synonym Replacement, Random Deletion and Random Swap of EDA are applied to the targets "Feminist Movement", "Cigna and Express Scripts" and "Donald Trump and Ted Cruz", respectively. We can observe that the generated words of ASDA and TR are more consistent with the target and label information. In contrast, the augmented words of baseline methods especially EDA could be incompatible with the labels of the original sentences.

Conclusion
In this paper, we presented two data augmentation methods, called ASDA and TR, for stance detection. Different from the existing augmentation methods that are either unaware of target information or hard to be applied to different stance detection tasks, ASDA performs better in generating target-relevant and label-compatible sentences and can be easily applied to various tasks. Results show that ASDA can not only achieve best performance on BERTweet model but also help improve the existing augmentation method such as CBERT. Unlike other rule-based word replacement methods that may produce undesirable correlation between a target and its contexts, TR replaces a target mention with the other, generating qualified sentences with meaningful lexical patterns. In addition, both ASDA and TR will be applicable if we need to detect the stances toward more than two targets simultaneously in the future. Future work includes extending the proposed methods to various directions, e.g., argument mining, aspect-based sentiment analysis and hatespeech detection, and generating more diversified samples through conditional generation.