NLDS-UCSC at SemEval-2016 Task 6: A Semi-Supervised Approach to Detecting Stance in Tweets

Stance classiﬁcation aims to identify, for a particular issue under discussion, whether the speaker or author of a conversational turn has Pro (Favor) or Con (Against) stance on the issue. Detecting stance in tweets is a new task proposed for SemEval-2016 Task6, involving predicting stance for a dataset of tweets on the topics of abortion, atheism, climate change, feminism and Hillary Clinton. Given the small size of the dataset, our team created our own topic-speciﬁc training corpus by developing a set of high precision hashtags for each topic that were used to query the twitter API, with the aim of developing a large training corpus without additional human labeling of tweets for stance. The hashtags selected for each topic were predicted to be stance-bearing on their own. Experimental results demonstrate good performance for our features for opinion-target pairs based on generalizing dependency features using sentiment lexicons.


Introduction
Social media websites such as microblogs, weblogs, and discussion forums are used by millions of users to express their opinions on almost everything from brands, celebrities, and events to important social and political issues.In recent years, the microblogging service Twitter has emerged as one of the most popular and useful sources of user content, and recent research has begun to develop tools and computational models for tweet-level opinion and sentiment analysis.Stance classification aims to identify, for a particular issue under discussion, whether the speaker or author of a conversational turn has a Pro (Favor) or Con (Against) stance on the issue (Somasundaran and Wiebe, 2009;Somasundaran and Wiebe, 2010;Walker et al., 2012c;Sridhar et al., 2015;Hasan and Ng, 2013).Detecting stance in tweets is a new task proposed for SemEval-2016(Mohammad et al., 2016).The aim of the task is to determine user stance (FAVOR, AGAINST, or NONE) in a dataset of tweets on the five selected topics of abortion, atheism, climate change, feminism and Hillary Clinton.Consider the tweets in Table 1, which express stance toward the target issue Climate Change is a Real Concern.It can be inferred that the author of tweet T1 is in favor of the target while the author of tweet T2 is clearly against the target.However due to the brevity of tweets, there is not always sufficient information about the target to determine stance: in the case of tweet T3, we are unsure what major development the user is talking about.In the case of tweet T4, we know the user acknowledges the existence of a drought, but we do not know their stance on the issue of climate change solely based on this information.In such cases the stance of the tweets is labelled NONE for this issue.
The task is nontrivial due to the challenges of the tweet genre.Tweets are often highly informal with language that is colorful and ungrammatical.They may also involve sarcasm, making opinion-mining tasks more challenging (Riloff et al., 2013;Reyes et al., 2012).Users may assert their stance using factual or emotional content, and due to their restricted length, tweets may not be well structured or coherent.As a result, NLP tools trained on wellstructured text do not work well in Twitter (Dey and Haque, 2008), and new tools are constantly being developed (Qadir and Riloff, 2014;Kong et al., 2014;Han and Baldwin, 2011;Zhu et al., 2014).
Our approach to stance classification in tweets is primarily based on developing a suite of tools for processing Twitter that mirrors our previous work on stance classification in online forums (Walker et al., 2012c;Sridhar et al., 2015;Anand et al., 2011;Walker et al., 2012b;Misra and Walker, 2015).We develop generalized dependency features that capture expressed sentiment or attitude towards particular targets, using the Tweebo dependency parser (Kong et al., 2014).Given the small size of the official task dataset, we created our own topic-specific training corpus in a semi-supervised manner.We developed a set of high precision hashtags for each topic that were used to query the Twitter API in order to create a large training corpus without additional human labeling of tweets for stance.The hashtags and boolean combinations of hashatgs selected for each topic were predicted to be stancebearing on their own.See Table 2.
There has been considerable previous work on stance classification in online forums and in congressional debates (Thomas et al., 2006;Burfoot et al., 2011;Somasundaran and Wiebe, 2009;Somasundaran and Wiebe, 2010;Walker et al., 2012c;Sridhar et al., 2015;Hasan and Ng, 2013;Boltuzic and Šnajder, 2014;Hasan and Ng, 2014).A number of these studies show that collective classification approaches perform well, and that the context (Walker et al., 2012c;Abbott et al., 2011), and meta information such as author constraints are useful for stance classification (Hassan et al., 2012;Hasan and Ng, 2014).Collective classification is not possible in the current task because the only information provided is the text of each individual tweet.Inspired by earlier work (Joshi and Penstein-Rosé, 2009;Somasundaran and Wiebe, 2009;Somasundaran and Wiebe, 2010;Walker et al., 2012b), we apply a framework for developing features for opinion-target pairs based on generalized structural dependency features, using the LIWC dictionary as the basis for generalization (Pennebaker et al., 2001).We also develop features to capture domain knowledge using PMI values for topic n-grams in order to improve the recognition of tweets with the NONE stance.We describe our system and data in Sec. 2, our experimental set-up in Sec. 3, and our results and error analysis in Sec. 4. We conclude and discuss future directions in Sec. 5.

Data
The relatively small, unbalanced training set provided for the task introduced an interesting subtask for precise topic-oriented tweet collection without direct human annotation.Twitter hashtags provide a method for users to tag their own content by topic, and we exploit this self-annotation to collect a larger dataset for training by hand-selecting seed hashtags for both the FAVOR and AGAINST stances for each topic.We then query Twitter for tweets containing these hashtags using the API, and produce a training set from the results without further supervision.We treat the original SemEval dataset as development data, assuming it is similar to the SemEval test set.
When collecting data in this fashion, there are multiple factors that must be accounted for.These include the accuracy of labels, data uniformity and representativeness, and dataset size.The accuracy of our labels is directly related to the specificity of our hashtags.We perform a small evaluation of each hashtag added to our seed pool by checking it's accuracy on a subset of queried data by hand.With regard to data uniformity and representativeness, we want to ensure that our collected data is not too uniform as a large collection of very similar tweets provides little additional information, and we want to ensure that our data is representative of the actual SemEval data that may be produced from different preprocessing and collection techniques.We evaluate the uniformity and representativeness of our data in Sec. 4.
After we finish collecting data, we create bal-  2.

Data preprocessing
Since tweets can be noisy, uninformative, or ambiguously labeled, we apply the three filters below to get better quality tweets.
• Duplicate removal: Remove all tweets that have an 80% or greater overlap with another already included tweet.
• Dictionary words: Tweets with less than 4 dictionary words are excluded.Although this filter may not be appropriate for all tasks as Tweets may incorporate large amounts of nondictionary slang, we observe that the SemEval training data has few instances that do not pass this test.
• Favor and Against: Remove tweets that have both FAVOR and AGAINST hashtags.

Data Normalization
Tweets can be noisy due to irregular words and other genre specific language.We preprocess all tweets as follows: • Repeated characters: Replace a sequence of repeated characters by two characters.For example, convert "shooooooooot" to "shoot".
• Lexical variation: We used the Python Enchant dictionary to determine if a token is a dictionary word.If a token is not present in the dictionary then it is replaced by finding a possible lexical variant using the English Social Media Normalisation Lexicon (Han et al., 2012), for example, "tmrrw" is changed to "tomorrow".
We use a part-of-speech tagger for tweets to perform tokenization and POS labelling (Gimpel et al., 2011).We also use TweeboParser, a dependency parser specifically designed for tweets, to parse each tweet (Kong et al., 2014).
Our data representation for the corpus keeps track of the original tweet, the normalization replacements, the POS tags, and the parses.Sec. 3 describes the features derived from these pre-processing steps that we also store in our corpus database, modelled after IAC 2.0.(Abbott et al., 2016).

Experimental Setup
We explored a large number of machine learning algorithms and feature combinations, using the automatically harvested tweets as training and the training set provided for the task as our development data to fit the parameters for the final submitted NLDS-UCSC system.Sec.3.1 describes the feature sets created using the development set.To evaluate the effects of hashtags on the test set we explored two different ways to train the system.Table 4: Feature ablation w/ hashtags for each topic on Test Set, along with F-measure for favor, against, and their average.

Features
Unigrams and Bigrams: We extracted unigrams and bigrams from the preprocessed tweets.The useful unigrams are mainly hashtags.We used both stemmed and unstemmed ngrams.

POS bigrams and trigrams:
The tweet part-ofspeech tagger is used to perform tokenization and POS identification (Gimpel et al., 2011).We then extracted POS bigrams and trigrams as features.LIWC: We derived features using the Linguistics Inquiry Word Count tool and use the count of words in each category as the feature value (Pennebaker et al., 2001).Dependency: We used TweeboParser to extract dependency features.For a given tweet, TweeboParser predicts its syntactic structure, represented by unlabeled dependencies (Kong et al., 2014).

Generalized LIWC and Opinion Dependency:
We created two kinds of generalized dependency features.Building on the idea that partially generalized dependencies are better than ungeneralized or completely generalized dependencies (Joshi and Penstein-Rosé, 2009), we leave one depen-dency element lexicalized and generalize the other to its LIWC category for LIWC dependency features.We follow a similar process to produce generalized opinion dependencies using AFINN lexicon and opinion-lexicon-English by (Hu and Liu) replacing one element of the dependency with its sentiment score and leaving the other element lexicalized (Hu and Liu, 2004;Nielsen, 2011). 1 2 Inspired by previous work on combining sentiment lexicons we used a combined sentiment score to denote the accuracy of a sentiment word rather than its strength.If dictionaries contradict one another on the sentiment polarity for a word, then the score is neutralized to zero.If a single dictionary lists the polarity word, but it is unlisted or neutral in the other dictionary, then the score is 1 in the direction of the polarity.If both dictionaries list a word with the same polarity, then the score is 2 in the direction of the polarity.After calculating the combined sentiment score, we check if either of the previous two words is listed as a negation by LIWC, and invert the polarity if a negation is found (Cho et al., 2013;Hasan and Ng, 2012).Pointwise Mutual Information (PMI): For each topic, we calculate normalized pointwise mutual information over a combination of an extended version of IAC 2.0, a topic annotated database of posts from debate forums (Abbott et al., 2016;Walker et al., 2012a), and our own collected tweets for each topic.IAC 2.0 includes several topics that are in overlap with the topics in the current task.
We then create a pool of top-N percent PMI unigrams, bigrams, and trigrams for each topic and use the count of words in each tweet that are also in this pool as a feature.We also use the highest PMI value of an n-gram in each tweet as a feature.

Results
We ran experiments using the SemEval training as our development data with NaiveBayesMultinomial, SVM, and J48 from WEKA.We tried a large number of feature combinations w/ and w/out stemmed ngrams.
The best performing system ended up being different for each topic.Overall, NBM worked the best for all topics.Table 3 describes the system model submitted for each topic based on the results.The model that performed best on the dev set was used to report accuracies on the test set.We present the results on the test set in Table 4.The best performing model from the dev set for the topic of climate control shows a marginal but not significant improvement, and none of the features on their own could beat a unigram baseline for any topic.Since most of the hashatgs are unigrams, we hypothesize this may be due to presence of strong stance-bearing hashtags in the data.In order to assess the issue of strong stance-bearing hashtags still existing in the training data, we remove all hashtags from both the training data and test data and retrain classifiers.See Table 5.For the topic of abortion we see that the combined unigram, bigram, and dependency model now outperforms the unigram model, and the dependencies alone start to edge out an advantage as well, suggesting that it may be the case that strong stancebearing hashtags distract from and disguise true performance of each feature and potentially the model as a whole.
Feature ablation results reveal that for the majority of the topics part-of-speech n-grams perform better than LIWC.This was surprising because LIWC was designed to capture emotional and psychological behavior in conversations, and because previous research on stance classification using debate forums shows LIWC categories can improve an ngram baseline (Anand et al., 2011).This may be due to sarcam and irony, a frequent phenomena in twitter not captured by LIWC, but which may to some extent be captured by part-of-speech n-grams that reflect the use of adjectives and adverbs in sarcastic posts (Lukin and Walker, 2013;Reyes et al., 2012).Generalizing Twitter-specific dependency structures using LIWC and sentiment lexicons does however prove useful.

Learning Curves
To asses the usefulness of increasing our training set size we plot learning curves for each topic (abortion shown in Figure 1 and all others in appendix A).For each topic we plot the average f-measure (FAVOR, AGAINST) for a unigram baseline, dependency baseline, and best performing model on the dev set. Figure 1 shows that the classifier for the abortion topic gains around 0.6 f-measure when increasing the number of instances from 5, 000 to 20, 000, and it continues to show promise for growth, especially Table 5: Tweets without hashtags on the Test set, along with F-measure for favor, against and their average.
in terms of the dependency and best model curves.
We see a similar promise for growth in the model for Hillary Clinton in which dependencies have just passed unigrams.The differences in learning rate for each topic suggest that the precision of our stancesided seed hashtags varies largely by topic because similar amounts of data provide less information gain, signaling the data may be of lower quality.In addition to extracting more data for each topic, it would also be helpful to refine our hashtag selection for topics such as atheism, where increase in training set size do not yield performance improvement.

Conclusion and Future work
We explore a semi-supervised approach to stance classification using stance-bearing hashtags and achieve reasonable accuracies on a hand-annotated test set.This suggests that our approach of querying using seed hashtags and using some heuristic filters to improve tweet quality may be promising for generating a large corpus of training data.It may also be useful in other domains where handannotated data does not exist and getting annotations is time consuming and costly effort.To determine the feasibility of using this semi-supervised data in other domains, we removed all the hashtags from the tweets and again compared the performance of our dependency and unigram features.Table 5 shows that these results look promising.In future work, we hope to use more intelligent features that may capture irony and sarcasm, and we plan to expand and refine our data collection process to account for the varying precision of stance-sided hashtags across topics.

Figure 1 :
Figure 1: Training set size vs average Fscore for abortion.
conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 592-596.Xiaodan Zhu, Svetlana Kiritchenko, and Saif M Mohammad.2014.Nrc-canada-2014: Recent improvements in the sentiment analysis of tweets.In Proc of the 8th Inter Workshop on Semantic Evaluation (SemEval 2014), pp 443-447.A Appendix: Learning Curves Below we show the learning curves for each topic on the SemEval test data after removing hashtags from both the train and test sets.Climate change is excluded due to the small number of training instances.Each graph includes a line for unigrams, dependencies, and the best model on the dev set (unigrams are the best model for the Hillary Clinton topic).

Figure 2 :Figure 4 :
Figure 2: Training set size vs average Fscore for Atheism

Table 2 :
Example hashtags and hashtag boolean combinations used to produce training data, and size of the resulting final balanced training dataset.anced FAVOR, AGAINST, and NONE training sets for each topic.Tweets in the NONE class are collected from other topics or from a corpus of random tweets.A summary of the final seed hashtags and dataset sizes is shown in Table

Table 3 :
Table4presents the results on the test set with hashtags present in the dataset while Table5is the performance without hashtags.Best performing model for each topic on Dev Set w/ hashtags, along with F-measure for favor, against, and their average.