Content-based Stance Classification of Tweets about the 2020 Italian Constitutional Referendum

On September 2020 a constitutional referendum was held in Italy. In this work we collect a dataset of 1.2M tweets related to this event, with particular interest to the textual content shared, and we design a hashtag-based semi-automatic approach to label them as Supporters or Against the referendum. We use the labelled dataset to train a classifier based on transformers, unsupervisedly pre-trained on Italian corpora. Our model generalizes well on tweets that cannot be labeled by the hashtag-based approach. We check that no length-, lexicon- and sentiment-biases are present to affect the performance of the classifier. Finally, we discuss the discrepancy between the magnitudes of tweets expressing a specific stance, obtained using both the hashtag-based approach and our trained classifier, and the real outcome of the referendum: the referendum was approved by 70% of the voters, while the number of tweets against the referendum is four times greater than the number of tweets supporting it. We conclude that the Italian referendum was an example of event where the minority was very loud on social media, highly influencing the perception of the event. Analyzing only the activity on social media is dangerous and can lead to extremely wrong forecasts.


Introduction
On September 20 and 21, 2020, a constitutional referendum was held in Italy to reduce the number of parliamentarians (from 630 to 400). 69.96% of the voters approved it, with a voter turnout of about 51% 1 . Since the main Italian political parties supported the referendum, at first the outcome was obvious, but, through a huge activity on social media, opposers unsuccessfully tried to overturn the 1 https://en.wikipedia.org/wiki/2020_ Italian_constitutional_referendum result. The referendum was a confirmatory referendum: voters were asked to approve a law. Thus, we refer to people that voted "yes", agreeing with the introduction of the new law that reduces the number of parliamentarians, as Supporters, and we refer to people that voted "no", against the introduction of the new law, as Opposers.
Since an always greater number of people share their thoughts online, social network analysis helps understanding the causes and forecasting the outcomes of political events, in parallel with already widely used approaches such as surveys and pools (Callegaro and Yang, 2018). Like surveys, selection biases are hard to remove. Social media users and citizens have different demographic distributions, resulting in under-represented categories of people (e.g., elderly people) (Mislove et al., 2011) 2 . Moreover, social media are also populated by bots, softwares that run accounts and automatically share content, introducing noise and bias in the collected data (Ferrara et al., 2016). These accounts are not run by real people and the data shared by them should not be included to perform analysis and statistics. However, a big advantage of the analysis of social media data is the higher magnitude of available data, easy to collect and process. It is often less expensive to collect content from social media than using classical approaches.
In this study we collect and analyze Twitter data about the Italian referendum in 2020. Our contributions can be summarized as follows: • We collect and publicly share a corpus of 1.2M tweets about the Italian referendum in 2020. This is a rare and fundamental resource for NLP analysis, expecially stance detection, for non-English texts 3 ; • We design a content-based, semi-automatic, approach to label big magnitudes of textual data through hashtags. We obtain a set of 85k cleaned labeled texts with low human effort; • We fine-tune an accurate text classifier to detect the stance of tweets (Support or Against the referendum). We also successfully apply it to classify tweets that the semi-automatic approach cannot label; • We inspect three common text biases (lengthbias, lexical-bias and sentiment-bias), observing that our dataset does not suffer from them; • We discuss the discrepancy between the collected data from Twitter and the real outcome of the referendum, including possible further investigation essential to understand the phenomenon.

Related Works
Numerous published works correlate social media data with elections or referendums. The main and most studied recent event is the Brexit referendum, largely investigated from many different points of view (Howard and Kollanyi, 2016;Grčar et al., 2017;Del Vicario et al., 2017;Mora-Cantallops et al., 2019;Lopez et al., 2017;Llewellyn and Cram, 2016), but many other political events have been analyzed from a social media perspective (Tumasjan et al., 2010;Sobhani et al., 2017;Darwish et al., 2017;Pierri et al., 2020;. A general approach to quantify controversy in social media has been proposed by Garimella et al. (2018), designing a graph-based approach using solely on the underneath social graphs. This approach is language independent, relying solely on the social structure of communities of users, but computational expensive. Another approach has been proposed, that includes the content of texts to make more precise and fast computations (de Zarate et al., 2020).
We investigate this event from a content-based stance detection perspective (Küçük and Can, 2020), analyzing only user-generated content to detect the inclination about the referendum in Italy.
such as bag of hashtags, bag of mentions or bag of replies, network based features obtained by clustering the retweet/quote/reply networks with Louvain Modularity algorithm. They also analyze the datasets from a diachronic perspective by splitting the time window into four sections based on the dates of referendum-related events. Other works focus on the Italian political situation of Twitter users with content-based approaches (Ramponi et al., 2019(Ramponi et al., , 2020Di Giovanni et al., 2018). They collect tweets shared by politicians and their followers, and train accurate classifiers that predict the political inclination of users, without considering the social interactions: the content shared contains enough information to successfully perform classification of political inclination. Similar tasks have been proposed at Se-mEval 2016 (Mohammad et al., 2016b), IberEval 2017 (Taulé et al., 2017), IberEval 2018 (Taulé et al., 2018) and finally at EVALITA 2020 (Cignarella et al., 2020), where teams were challenged to detect stances of manually labeled Italian Tweets about the Sardine Movement. We remark the difficulty of such tasks by looking at the performance of the best team (Giorgioni et al., 2020), that fine-tuned an Italian pre-trained BERT model (Devlin et al., 2019) and augmented the data with results from three auxiliary tasks.
A comparative study (Ghosh et al., 2019) shows that for stance-detection datasets of English texts from Web and Social Media, BERT model achieves the best performance, but there is still much room for improvements.

Data Collection, Description and Labeling
The dataset is collected from Twitter 4 , a microblogging platform widely used to discuss trending topics, whose official API allows a fast and comprehensive implementation. On Twitter, users share tweets, small texts (up to 280 characters) that can be enriched with images, videos or URLs. Other users can quote (or retweet) another tweet by sharing it with (or without) a personal comment. A user can also follow other users to get a notification when they tweet (retweet or quote), and can be followed by other users.
We query data about the referendum held in Italy in September 2020 by searching Italian tweets, containing at least one of the keywords reported in Table 1, usually used as hashtags, but not always. In total we collected 1.2M Italian tweets posted between 01/08/2020 and 01/10/2020 by about 111k users.
The keywords are refined and validated iteratively. Starting from three keywords (referendum, iovotosì -IVoteYes, iovotono -IVoteNo), we inspect the most frequent hashtags and, if related to the topic, we add them to the query. In Figure 1 we show the most used hashtags in our complete dataset. Many frequent hashtags have no clear and safe connection with the referendum, thus we do not select them as keywords during the collection step, such surnames of politicians ("dimaio") and political parties ("m5s").

Hashtag-based Semi-automatic Labeling
Manually labeling big data sets is an expensive and not-scalable approach. Usually more than one annotator, fluent in the selected language, is required to produce a reliable label, and the time and cost to obtain a data set large enough to train an accurate classifier is usually high.
Graph-based approaches have obtained impressive results when applied to detect stances in controversial debates (Garimella et al., 2018;Cossard et al., 2020). These approaches are mainly used to label user by looking at the nearest community in the social graph. They firstly define the graph structure, e.g. retweet graph, and then they apply community detection algorithms to partition the bigger connected component of the graph.
We design a content-based approach to semi-automatically label large sets of tweets. Different from the graph-based approaches, we label single tweets, while the graph approaches work at the user-level. The approach is based on hashtags, often used to express the inclination of users about a topic (Mohammad et al., 2016a). Trending hashtags attract audience and get the attention of other users in the social network 5 . We pick two main classes: in Support of the referendum and Against the referendum. We define as Gold hashtags the hashtags that clearly state a side in the vaccine debate. We plan to collect two sets of Gold hashtags, one for each side of the debate. If a tweet contains at least one of the Gold hashtags, we define its stance as the stance of the hashtag. Tweets containing at least one Gold hashtag from both sides are discarded. Firstly, we select two Gold hashtags, one for each side: #iovotosì (I Vote Yes) for the Support class and #iovotono (I Vote No) for the Against class. Note that in Italian the word yes is translated as sì, with the grave accent that is often omitted in informal texts, such as tweets. Thus, in the whole paper, every time we refer to the word sì, we include also the word si, without the accent. Two annotators manually validate this initial selection by inspecting 100 tweets for each class and finding only 4 tweets that clearly belongs to the opposite stance. They were used to attract the attention of the other side or to delegitimise a specific hashtag., e.g. "I cannot understand people that write #IVoteYes". However, our validation process confirms that these tweets are rare and introduce little noise to the data set.
We iteratively add new hashtags by inspecting the most frequent co-occurring ones and manually selecting the most pertinent ones, basing the selection on their meaning. An example of discarded hashtags is #conte (the surname of the Prime Minister of Italy at the time of the Referendum), highly co-occuring with #iovotono, since we cannot safely assume that it was used only by users Against the referendum. We also discard hashtags that co-occur with hashtags from both sides in similar percentages. An example is #referendum, obviously frequently used by both sides of the debate. Finally, after each iteration two annotators manually validate the selected hashtags, as previously described for the initial Gold hashtags. An hashtag passes the validation if the percentage of tweets that is 5 Twitter has a specific section for trending hashtags and keywords https://twitter.com/explore/tabs/ trending Tweets using both #IoVotoSì and #IoVotoNo A In a few days we will meet at the ballot boxes to express our preference about the #CutOfParliamentarians. While waiting, let's retrace the most famous referendums in the history of the Republic. #Referendum2020 #IVoteYes #IVoteNo B Let's dismantle some lies about #IVoteNO. The #CutOfParliamentarians is a reform that fixes the Italian distortion of having a very big number of elected people. Who talks about dictatorship is only using the usual fear strategy to keep a useless privilege. #IVoteYes classified by at least one annotator as belonging to the opposite class is lower than 10%. We finally obtain two final sets of Support Gold hashtags and Against Gold hashtags, that allows us to get about 450k labeled tweets by manually labeling few hundreds. The selected Gold hashtags are the keywords reported in Table 1 that contains the * symbol. The symbol is substituted with the corresponding stance ("sì" or "no"). For example, #referendum2020_iovotono is a Gold hashtag for Against class, while #referendum2020_iovotosì (and #referendum2020_iovotosi) is a Gold hashtag for Support class. Since no other hashtag among the 50 most-frequent ones passes the full validation procedure, we end the labeling phase.
Note that we label tweets containing at least one hashtag from a single set in the corresponding class, while tweets with at least one hashtag from both sets as Both and tweets without any hashtag from both sets as Unknown. We remark that Both and Unknown tweets cannot be safely considered neutral since they can express a stance without explicitly using one of the selected hashtags, or using both of them ( Table 2 reports an example of a neutral tweet labeled as Both (A) and a Support tweet labeled as Both (B). This is the main limitation of this semi-automatic labeling procedure: no neutral class can be safely defined, thus we can only train a binary-classifier, leaving for future works the design of a three-classes stance detector.
We label retweets by looking at the hashtags in the original tweet, we label quotes by only looking at the hashtags in the quote itself, not at the quoted hashtags. In Table 3 we report the statistics of the obtained labeled dataset. Original tweets are tweets that are neither retweets nor quotes of other tweets, nor replies to other tweets. Support  93149  74086  2890  10572  5665  Against  364865 291185  15368  34559  24145  Both  4224  2796  145  246  1042  Unknown 353033 236743  16600  53119  47059  Total  815271 604810  35003  98496  77911   Table 3: Tweets Statistics.

Temporal Analysis
In Figure 2 (top) we show the distribution of tweets, grouped by their stance, during the time window selected, highlighting the referendum day. We notice a first peak around the August 8, due to an unrelated event about parliamentarians, that we accidentally included, since we used parliamentarians as a keyword to filter tweets. To remove noise and unrelated data, we discard all tweets posted before August 15 in the following analyses. We also notice a huge peak of Unknown tweets during the referendum days, probably because users switched from the old hashtags #IVoteYes and #IVoteNo to their past tense versions (#IVotedYes and #IVotedNo). Thus, we discard tweets posted after September 19. Moreover, we do not want to influence our stance classification with tweets posted after the referendum.
In Figure 2 (bottom) we show how the ratio between Support and Against tweets evolves during the time window, observing constant values around 0.25 from August 15 to September 19. Thus, the daily number of tweets Against the referendum is four times bigger than the number of tweets Supporting it, further confirmed in Table 3, where the total number of Support tweets is four times smaller than the total number of tweets Against the referendum. We also notice big peaks and valleys outside the selected time window, caused by the low number of daily posted tweets.

Data Analysis
In this section we describe the cleaning process, the stance classifiers and their results on the collected dataset.

Data Cleaning
Before training a stance classifier, we clean the text of tweets through the following procedure.
Texts are lowercased, URLs are removed and spaces are standardized. We remove Gold hashtags (see Table 1) since they were used to automatically label tweets and users, thus maintaining them will introduce a strong bias in the trained models. We keep the other hashtags since they could encode useful information and are not a clear source of bias. Tweets containing at least half of the characters as hashtags are also removed, since they are too noisy. They are usually used by bots to collect the daily trending hashtags. To prevent overfitting we remove duplicate texts, including retweets. We also remove texts shorter than 20 characters, that usually comment URLs or other tweets, being difficult to understand and contextualize. We keep emoji as they include useful information, e.g., the scissor emoji was mainly used by Supporters of the referendum since they want to cut the number of parliamentarians. We select only tweets shared after 15/08/2020 and before 20/09/2020, the first referendum day.

Stance classification
We analyze the dataset from a stance classification perspective.
Due to the impossibility to interpret the tweets labeled as Both or Unknown, we formulate the tweet stance classification task as a binary classification problem: the two classes represent tweets Supporting or Against the referendum. We obtain an unbalanced clean datasets: 85k tweets, of which 80% Against the referendum. To obtain a balanced dataset, over-sampling the Support class leads to slightly better results in the Validation dataset, but worse results on the Test set, probably due to overfitting, while under-sampling the Against class leads to worse results due to the removal of 60% of the original dataset.
We select three models (one baseline and two commonly used architectures):  Table 4: Area under ROC (AUROC), weighted F1 score (F 1 w ) and F1 score of the Supporters (F 1 s ) of the three models, as 5-fold Cross Validation on the training set (left) and on the Test Sets of 227 randomly selected and manually evaluated texts.
• Majority classifier (Baseline); • FastText (Joulin et al., 2017), a fast approach widely used for text classification. Its architecture is similar to the CBOW model in Word2Vec (Mikolov et al., 2013): a look-up table of words is used to generate word representations, that are averaged and fed into a linear classifier. A softmax function is used to compute the probability distribution over the classes. To include the local order of words, n-grams are used as additional features, with the hashing trick to keep the approach fast and memory efficient. FastText is known to reach performances on par with some deep learning methods, while being much faster; • BERT (Devlin et al., 2019), a Transformerbased model (Vaswani et al., 2017) that reaches state-of-the-art performances on many heterogeneous benchmark tasks. The model is pre-trained on large corpora of unsupervised texts using two self-supervised techniques: Masked Language Models (MLM) task and Next Sentence Prediction (NSP) task. Pre-trained weights are available on the Huggingface models repository (Wolf et al., 2020). We select a model pre-trained on a concatenation of Italian Wikipedia texts, OPUS corpora (Tiedemann, 2012) and OSCAR corpus (Ortiz Suárez et al., 2019), performed by MDZ Digital Library 6 . We fine-tune the model on our data 7 .

Results
In Table 4 (left) we report the results of a 5-fold cross validation process. We select Area Under the ROC curve (Fawcett, 2006), weighted F1-score (the F1 score for the classes are weighted by the support, i.e., the number of true instances for each class) and F 1 s , the F1 score on the Support class (the under-represented class, that, by definition, a Majority classifier cannot detect). Both FastText model and BERT outperform the Random Baseline approach, the latter obtaining higher AUROC and F 1 s . However, our goal is to predict the stance of tweets that do not share a Gold Hashtag. We use these models, trained on the big dataset labeled using Gold hashtags, to predict tweets that do not contain Gold Hashtags, thus tweets that, with the previously described automatic approach, were labeled as Unknown. Two human annotators manually labeled 500 randomly sampled tweets. After removing neutral and incomprehensible texts, we obtain a dataset of 227 tweets, of which 78 labeled as Supporters. We test our models on this dataset, the results are reported in Table 4 (right), confirming that even if there is a gap among the Validation performances and the Test performances, BERT did not strongly overfit the Training data.
Finally, we obtain an approximate statistic of the total number of tweets Supporting and Against the referendum by predicting the stance of every tweet previously labeled as Unknown (110k tweets). It results in about 20% of Unknown tweets classified as Supporters, confirming the general number of tweets Against the referendum is four times bigger that the number of shared tweets Supporting it. However, we cannot validate this result since we do not have manually labeled the full dataset.

Biases analysis
In this section we inspect three common biases that often affect the accuracies of classifiers: Length of texts, Lexicon and Sentiment.

Length Analysis
The length of sentences, defined as the number of characters or tokens, often influences the prediction of a model, acting as a bias. In Figure 3 we plot the distribution of lengths of tweets calculated as the number of characters, after the cleaning procedure (there are no tweets shorter than 20 characters). There is no evident difference between the distribution of the number of characters in tweets labeled as Support or Against, suggesting that no length-bias is present in our dataset.

Lexicon analysis
We check if tweets in different stances use similar lexicons. A big lexicon overlap in the dataset results in an accurate classifier that must learn the meaning of sentences, while a small lexicon overlap in the dataset allows the detection of specific words to be sufficient to make a prediction, neglecting the real meaning of the texts. We quantify the lexicon difference by computing the Pointwise Mutual Information (PMI) between words and classes (Gururangan et al., 2018).
A high PMI score of a word in a class is obtained when the word is used mainly in tweets belonging to that class. For this analysis, we discard Italian stop words collected from the NLTK library (Bird et al., 2009).
We report in Table 5 the first five words for each class, sorted by PMI score and the proportion of texts in each class containing each word. The frequency of words with higher PMI is low, thus we conclude that the two stances use mostly similar lexicons. A classifier cannot safely rely on the presence of specific words since the most indicative ones (higher PMI score) are not frequent enough. For example, the most frequent word among the top-5 is orgoglio5stelle, a keyword used by Supporters of the Referendum stating that they are proud of their party (5 stars) because the referendum was held by them. However, only 3% of the Supporter texts include this word.

Sentiment analysis
We distinguish between sentiment classification and stance classification by searching for a correlation between sentiment and stance in the datasets. Our goal is to have a stance classifier that does not  rely on the sentiment of tweets to make a prediction. If Support and Against tweets are unbalanced in the Positive and Negative sentiment classes, the dataset contains a sentiment-bias.
We compute the sentiment scores of tweets and users using Neuraly's "Bert-italian-casedsentiment" model 8 hosted by Huggingface (Wolf et al., 2019). It is a BERT base model trained from an instance of "bert-base-italian-cased" 9 and fine-tuned on an Italian dataset of 45k tweets on a 3-classes sentiment analysis task (negative, neutral and positive) from SENTIPOLC task at EVALITA 2016 (Barbieri et al., 2016), obtaining 82% test accuracy.
In Figure 4 we show the Kernel Density Estimation plot of positive and negative sentiment of tweets grouped by stance. The probability of being neutral is not shown as it can be obtained with 1 − p( positive ) − p( negative ). Since the distributions of the sentiments largely overlap, we conclude that there is no sentiment-bias in our datasets. It is further confirmed by looking at the actual predictions: for both Support and Against texts, 63% of them are classified as Negative, 25% as Neutral and 15% as Positive . number of Unknown tweets that our best classifier predicts as Support or Against the referendum follows the same proportion. By looking only at what is shared online, we could have easily guessed that the Opposers won the referendum, while the real outcome is the opposite.
To further understand this discrepancy, we briefly inspect the differences in social characteristics of users. We label users as Support (Against) if they share only tweets previously labeled as Support (Against) the referendum. Figure 5 shows the normalized distribution of number of followers and number of following of users Supporting and Against the referendum. No difference in shape proves that the social audience of the two sides of users is quantitatively similar (the tails of the figures are cut for visualization purposes). Inspecting the most followed and following users (long tail of the distribution), we notice that among the top-10, exactly half of them are Supporters and half are Against the referendum, confirming our finding. Thus we conclude that Supporters won the referendum, not because they tweeted more than Opposers (they actually tweeted 4 times less than the people against the referendum), neither because they have more audience (the distributions of number of followers and following people is similar). We leave for future works the inspection of more detailed graph-related quantities, such as centrality of users in the network and topological measures to describe the graph structure. We observed an event where the majority of voters were silent, or not even present on Social Media, while the minority was loud. This phenomenon implies not only that restricting the focus on social media to fully analyze an event could lead to extremely wrong forecasts, but also that the user perception of the general political situation can be influenced by an unrealistic image of the public opinion on social media that does not match the real sentiment towards the topic.

Ethical Considerations
Political inclinations of people is a sensitive topic. This work is meant to be a exploration on how to apply state-of-the-art NLP techniques to predict the stance of tweets about a political event, and whether they can help to perform more accurate forecasts of the outcome of a political event. Due to privacy issues, we do not share the trained model nor the obtained labels of tweets. However, we share the dehydrated collected tweets and the set of keywords to obtain the gold labels. These data allow researchers to reproduce the results but do not contain sensitive information, meeting the Twitter's Terms of Service 10 . In this study we prove that the political inclination of users can be detected by modern NLP approaches, even if no evident hashtags of keywords are shared in a tweet. Thus, we suggest a thoughtful and appropriate usage of social networks in order to keep private sensitive information.

Conclusion
Thanks to the last referendum in Italy, we collected a big Italian stance detection user-generated dataset. The dataset consists in 1.2M tweets, of which 85k are cleaned and labeled as Supporters or Against 10 https://twitter.com/en/privacy the referendum. The designed hashtag-based semiautomatic labeling approach allows us to train an accurate classifier that generalizes well also on tweets that do not contain Gold hashtags. We considered three common dataset biases (lengthbias, lexicon-bias and sentiment-bias), confirming no significant dangers. Finally, we investigated the discrepancy between the fraction of collected tweets labeled by stance and the real outcome of the referendum, observing no clues that explain this difference. Based on our findings, we suggest that drawing conclusions following social media analysis should be performed carefully, and the results should be integrated with other other classical approaches such as surveys.
In future works, we aim to build a three-classes stance classifier, that can also predict neutral texts, since we observed big magnitudes of data that does not explicitly state a stance. We will also move the focus from tweets to users, detecting their inclination by looking at the history of shared tweets. We believe that the investigation of users that changed stance during the time window could help us understand how people opinions are influenced by social media. Finally, we observe that our classifier do not generalize well on other Italian stance-detection data sets, due to the high specificity of the task: the model learned the debate about the 2020 Italian constitutional referendum and its actors' inclination, but the knowledge obtained is not adequate to perform zero-shot transfer to other data sets. However, we plan to investigate if we can obtain boosts of performances in a multi-task and multi-source context, training a model on multiple similar tasks and data at the same time.