Predicting the Topical Stance and Political Leaning of Media using Tweets

Discovering the stances of media outlets and influential people on current, debatable topics is important for social statisticians and policy makers. Many supervised solutions exist for determining viewpoints, but manually annotating training data is costly. In this paper, we propose a cascaded method that uses unsupervised learning to ascertain the stance of Twitter users with respect to a polarizing topic by leveraging their retweet behavior; then, it uses supervised learning based on user labels to characterize both the general political leaning of online media and of popular Twitter users, as well as their stance with respect to the target polarizing topic. We evaluate the model by comparing its predictions to gold labels from the Media Bias/Fact Check website, achieving 82.6% accuracy.


Introduction
Online media and popular Twitter users, which we will collectively refer to as influencers, often express overt political leanings, which can be gleaned from their positions on a variety of political and cultural issues. Determining their leaning can be done through the analysis of their writing, which includes the identification of terms that are indicative of stance (Groseclose and Milyo, 2005;Gentzkow and Shapiro, 2011). Performing such analysis automatically can be done using supervised classification, which in turn would require manually labeled data (Groseclose and Milyo, 2005;Gentzkow and Shapiro, 2011;Mohammad et al., 2016). Alternatively, leanings can be inferred based on which people share the content (blogs, tweets, posts, etc.) on social media, as social media users are more likely to share content that originates from sources that generally agree with their positions Morgan et al., 2013;Ribeiro et al., 2018;Wong et al., 2013).
Here, we make use of this observation to characterize influencers, based on the stances of the Twitter users that share their content. Ascertaining the stances of users, also known as stance detection, involves identifying the position of a user with respect to a topic, an entity, or a claim (Mohammad et al., 2016). For example, on the topic of abortion in USA, the stances of left-vs. right-leaning users would typically be "pro-choice" vs. "pro-life", respectively.
In this paper, we propose to apply unsupervised stance detection to automatically tag a large number of Twitter users with their positions on specific topics (Darwish et al., 2020). The tagging identifies clusters of vocal users based on the accounts that they retweet. Although the method we use may yield more than two clusters, we retain the two largest ones, which typically include the overwhelming majority of users, and we ignore the rest. Then, we train a classifier that predicts which cluster a user belongs to, in order to expand our clusters. Once we have increased the number of users in our sets, we determine which sources are most strongly associated with each group based on sharing by each group. We apply this methodology to determine the positions of influencers and of media on eight polarizing topics along with their overall leaning: left, center or right. In doing so, we can also observe the sharing behavior of right-and leftleaning users, and we can correlate their behavior with the credibility of the sources. Further, given the user stances for these eight topics, we train a supervised classifier to predict the overall bias of sources using a variety of features, including the so-called valence (Conover et al., 2011a), graph embeddings, and contextual embeddings. Using a combination of these features, our classifier is able to predict the bias of sources with 82.6% accuracy, with valence being the most effective feature. Figure 1 outlines our overall methodology. Our contributions are as follows: • We use unsupervised stance detection to automatically determine the stance of Twitter users with respect to several polarizing topics.
• We then use distant supervision based on these discovered user stances to accurately characterize the political leaning of media outlets and of popular Twitter accounts. For classification, we use a combination of source valence, graph embeddings, and contextualized text embeddings.
• We evaluate our approach by comparing its bias predictions for a number of news outlets against gold labels from Media Bias/Fact Check. We further evaluate its predictions for popular Twitter users against manual judgments. The experimental results show sizable improvements over using graph embeddings or contextualized text embeddings.
The remainder of this paper is organized as follows: Section 2 discusses related work. Section 3 describes the process of data collection. Section 4 presents our method for user stance detection. Section 5 describes how we characterize the influencers. Section 6 discusses our experiments in media bias prediction. Finally, Section 7 concludes and points to possible directions for future work.

Related Work
Recent work that attempted to characterize the stance and the ideological leaning of media and Twitter users relied on the observation that users tend to retweet content that is consistent with their world view. This stems from selective exposure, which is a cognitive bias that leads people to avoid the cognitive overload from exposure to opposing views as well as the cognitive dissonance in which people are forced to reconcile between their views and opposing views (Morgan et al., 2013). Concerning media, Ribeiro et al. (2018) used the Facebook advertising services to infer the ideological leaning of online media based on the political leaning of Facebook users who consumed them.  relied on follow relationships to online media on Twitter to ascertain ideological leaning of media and users based on the similarity between them. Wong et al. (2013) studied retweet behavior to infer the ideological leanings of online media sources and popular Twitter accounts. Barberá and Sood (2015) proposed a statistical model based on the follower relationships to media sources and Twitter personalities in order to estimate their ideological leaning.
Studies have examined the effectiveness of different features for stance detection, including textual features such as word n-grams and hashtags, network interactions such as retweeted accounts and mentions, and profile information such as user location (Borge-Holthoefer et al., 2015;Hasan and Ng, 2013;Magdy et al., 2016a,b;Weber et al., 2013). Network interaction features were shown to yield better results compared to using textual features (Magdy et al., 2016a;Wong et al., 2013). Sridhar et al. (2015) leveraged both user interactions and textual information when modeling stance and disagreement, using a probabilistic programming system that allows models to be specified using a declarative language.
Trabelsi and Zaïane (2018) described an unsupervised stance detection method that determines the viewpoints of comments and of their authors. It analyzes online forum discussion threads, and therefore assumes a certain structure of the posts.
It also assumes that users tend to reply to each others' comments when they are in disagreement, whereas we assume the opposite in this paper. Their model leverages the posts' contents, whereas we only use the retweet behavior of users.
Many methods involving supervised learning were proposed for stance detection. Such methods require the availability of an initial set of labeled users, and they use some of the aforementioned features for classification Magdy et al., 2016b;Pennacchiotti and Popescu, 2011a). Such classification can label users with precision typically ranging between 70% and 90% (Rao et al., 2010;Pennacchiotti and Popescu, 2011a). Label propagation is a semisupervised method that starts with a seed list of labeled users and propagates the labels to other users who are similar based on the accounts they follow or retweet (Barberá and Sood, 2015;Borge-Holthoefer et al., 2015;Weber et al., 2013). While label propagation may label users with high precision (often above 95%), it is biased towards users with more extreme views; moreover, careful choice of thresholds is often required, and post-checks are needed to ensure quality. Abu-Jbara et al. (2013) and more recently Darwish et al. (2020) used unsupervised stance detection, where users are mapped into a lower dimensional space based on user-user similarity, and then clustered to find core sets of users representing different stances. This was shown to be highly effective with nearly perfect clustering accuracy for polarizing topics, and it requires no manual labeling of users. Here, we use the same idea, but we combine it with supervised classification based on retweets in order to increase the number of labeled users (Darwish, 2018). Other methods for user stance detection include collective classification (Duan et al., 2012), where users in a network are jointly labeled and classification in a low-dimensional user-space (Darwish et al., 2017).
As for predicting political leaning or sentiment, this problem was studied previously as a supervised learning problem, where a classifier learns from a set of manually labeled tweets (Pla and Hurtado, 2014;Bakliwal et al., 2013;Bermingham and Smeaton, 2011). Similarly, Volkova et al. (2014) predicted Twitter users' political affiliation (being Republican or Democratic), using their network connections and textual information, relying on user-level annotations.

Data Collection
We obtained data on eight topics that are considered polarizing in the USA (Darwish et al., 2020), shown in Table 1.
They include a mix of long-standing issues such as racism and gun control, temporal issues such as the nomination of Judge Brett Kavanaugh to the US Supreme Court and Representative Ilhan Omar's polarizing remarks, as well as non-political issues such as the potential dangers of vaccines. Further, though long-standing issues typically show rightleft polarization, stances towards Omar's remarks are not as clear, with divisions on the left as well.
Since we are interested in US users, we filtered some tweets to retain such by users who have stated that their location was USA. We used a gazetteer that included words that indicate USA as a country (e.g., America, US), as well as state names and their abbreviations (e.g., Maryland, MD).
Other data that we used in our experiments is a collection of articles that were cited by users from the tweets collection and that originate from media, whose bias is known, i.e., is discussed on the Media Bias/Fact Check website.

User Stance Detection
In order to analyze the stance of influencers on a given topic, we first find the stances of Twitter users, and then we project them to the influencers that the users cite. A central (initial) assumption here is that if a user includes a link to some article in their tweet, they are more likely to agree or endorse the article's message. Similarly, when a user retweets a tweet verbatim without adding any comments, they are more likely to agree with that tweet. We label a large number of users with their stance for each topic using a two-step approach, namely projection and clustering and supervised classification.
For the projection and clustering step, we identify clusters of core vocal users using the unsupervised method described in (Darwish et al., 2020). In this step, users are mapped to a lower dimensional space based on their similarity, and then they are clustered. After performing this unsupervised learning step, we train a supervised classifier using the two largest identified clusters in order to tag many more users. For that, we use FastText, a deep neural network text classifier, that has been shown to be effective for various text classification tasks (Joulin et al., 2017).  Once we have expanded our sets of labeled users, we identify influencers that are most closely associated with each group using a modified version of the so-called valence score, which varies in value between −1 and 1. If an influencer is being cited evenly between the groups, then it would be assigned a valence score close to zero. Conversely, if one group disproportionately cites an influencer compared to another group, then it would be assigned a score closer to −1 or 1. We perform these steps for each of the given topics, and finally we summarize the stances across all topics. Below, we explain each of these steps in more detail.

Projection and Clustering
Given the tweets for each topic, we compute the similarity between the top 1,000 most active users. To compute similarity, we construct a vector for each user containing the number of all the accounts that a user has retweeted, and then we compute the pairwise cosine similarity between them. For example, if user A has only retweeted user B 3 times, user C 5 times and user E 8 times, then user A's vector would be (0, 3, 5, 0, 8, 0, 0, ... 0). Solely using the retweeted accounts as features has been shown to be effective for stance classification (Darwish et al., 2020;Magdy et al., 2016a). Finally, we perform dimensionality reduction and we project the users using Uniform Manifold Approximation and Projection (UMAP). When performing dimensionality reduction, UMAP places users on a two-dimensional plane such that similar users are placed closer together and dissimilar users are pushed further apart. Figure 2 shows the top users for the "midterm" topic projected with UMAP onto the 2D plane. After the projection, we use Mean Shift to cluster the users as shown in Figure 2. This is the best setup described in (Darwish et al., 2020). Clustering high-dimensional data often yields suboptimal results, but can be improved by projecting to a low-dimensional space (Darwish et al., 2020).

Supervised Classification
Since unsupervised stance detection is only able to classify the most vocal users, which only constitute a minority of the users, we wanted to assign stance labels to as many additional users as we can. Given the clusters of users that we obtain for each topic, we retain the two largest clusters for each topic, and we assign cluster labels to the users contained therein. Next, we use all the automatically labeled users for each topic to train a supervised classifier using the accounts that each user retweeted as features (same as the features we used to compute user similarity earlier). For classification, we train a FastText model using the default parameters, and then we classify all other users with five or more retweeted accounts, only accepting the classification if FastText was more than 80% confident (70-90% yielded nearly identical results).  In order to obtain a rough estimate of the accuracy of the model, we trained FastText using a random 80% subset of the clustered users for each topic and we tested on the remaining 20%. The accuracy was consistently above 95% for all topics. This does not mean that this model can predict the stance for all users that accuratelythe clustered users were selected to be the most active ones. Rather, it shows that the classifier can successfully capture what the previous, unsupervised step has already learned. Table 2 lists the total number of users who authored the tweets for each topic, the number of users who were automatically clustered using the aforementioned unsupervised clustering technique, and the number of users who were automatically labeled afterwards using supervised classification. Given that we applied unsupervised stance detection to the most active 1,000 users, the majority of the users appeared in the largest two clusters (shown in Table 2).

Calculating Valence Scores
Given all the labeled users for each topic, we computed a valence score for each influencer. As mentioned earlier, the valence score ranges between [−1, 1], where a value close to 1 implies it is strongly associated with one group of users, −1 shows it is strongly associated with the other group of users, and 0 means that it is being shared or cited by both groups. The original valence score described by Conover et al. (2011a) is calculated as follows: where t f (u,C 0 ) is the number of times (term frequency) item u is cited by group C 0 , and total(C 0 ) is the sum of the term frequencies of all items cited by C 0 . t f (u,C 1 ) and total(C 1 ) are defined in a similar fashion.
We use the above equation to compute valence scores for the retweeted accounts, but we using a modified version for calculating the score for influencers (I): In the latter equation, Cnt(a,C i ) is the number of times article a was cited by users from cluster C i . In essence, we are replacing term frequencies with the natural log of the term frequencies. We opted to modify the equation in order to tackle the following issue: if users from one of the clusters, say C 1 , cite only one single article from some media source a large number of times (e.g., 2,000 times), while users from the other cluster (C 0 ) cite 10 other articles from the same media 50 times each, then using equation 1 would result in a valence score of −0.6. We would then regard the given media as having an opposing stance to the stance of users in C 0 . Alternatively, using the natural log would lead to a valence score close to 0.88. Thus, dampening term frequencies using the natural log has the desired effect of balancing between the number of articles being cited by each group and the total number of citations. We bin the valence scores between −1 and 1 into five equal size bands as follows:

Characterizing the Influencers
We use valence to characterize the leaning of all cited influencers for each of the topics. Table 3 shows the valence categories for the top-cited media sources across all topics. It also shows each media's factuality of reporting, i.e., trustworthiness, and bias (ranging from far-left to far-right) as determined by mediaBiasFactCheck.com. Since the choice of which cluster should be C 0 and which would be C 1 is arbitrary, we can multiply by −1 the valence scores for any topic and the meaning of the results would stay the same. We resorted to doing so for some topics in order to align the extreme valence bands across all topics. Given tweet samples from users in a given cluster for a given topic, labeling that cluster manually was straightforward with almost no ambiguity. Table 4 shows the most frequently cited media source for each topic and for each valence band.
Of the 5,406 unique media sources that have been cited in tweets across all topics, 806 have known political bias from mediaBiasFactCheck. com. Figure 3 shows the confusion matrix between our valence categories and the goold labels from mediaBiasFactCheck.com.
We notice that many of the media that have a negative valence score (categories − and −−) are classified on the right side of the political spectrum by mediaBiasFactCheck.com, while most media with positive scores (categories + and ++) are classified as slightly left-leaning. Although there are almost no extreme-left cases, there is a correlation between bias and our valence score. mediaBiasFactCheck.com seems to rarely categorize media sources as "extreme-left". This could be a reflection of reality or it might imply that mediaBiasFactCheck.com has an inherent bias.
We also computed the valence scores for the top-200 retweeted accounts, and we assigned each account a valence category based on the score. Independently, we asked a person who is well-versed with US politics to label all the accounts as left, center, or right. When labeling accounts, right-leaning include those expressing support for Trump, the Republican party, and gun rights, opposition to abortion, and disdain for Democrats. As for left-leaning accounts, they include those attacking Trump and the Republicans, and expressing support for the Democratic party and for Liberal social positions. If the retweeted account happens to be a media source, we used mediaBiasFactCheck.com. Table 5 compares the per-topic valence for each retweeted account along with the average category and the true label.
It is noteworthy that all top-200 retweeted accounts have extreme valence categories on average across all topics. Their average valence scores, with one exception, appear between −0.6 and −1.00 for right, and between 0.6 and 1 for left (see Figure 4).
Of those manually and independently tagged accounts, all that were tagged as left-leaning have a strong positive valence score and all that were tagged as right-leaning have a strong negative valence score. Only two accounts were manually labeled as center, namely Reuters and CSPAN, which is a US channel that broadcasts Federal Government proceedings, and they had valence scores of 0.55 and 0.28, respectively. Though their absolute values are lower than those of all other sources, they are mapped to the + valence category. Table 3 summarizes the valence scores for the media across all topics. Table 4 lists the most cited media sources for each topic and for each of the five valence bands. The order of the bands from top to bottom is: ++, +, 0, − and −−. The table also includes the credibility and the political leaning tags from mediaBiasFactCheck.com. The key observations from the table as follows: 1. Most right-leaning media appear overwhelmingly in the − and −− valence categories. Conversely, left-leaning media appear in all valence categories, except for the −− category. This implies that left-leaning users cite right-leaning media sparingly. We looked at some instances where right-leaning users cited left-leaning media, and we found that in many cases the cited articles reinforced a right-leaning viewpoint. For example, right-leaning users shared a video from thehill.com, a left-center site, 2,398 times for the police racism topic. The video defended Trump against charges of racism by Lynne Patton, a longtime African-American associate of Trump.    2. Most right-leaning sources in the −− category have mixed, low, or very low factuality. Conversely, most left-leaning sites appearing in the − valence category have high or very high factuality. Similarly for the vaccine topic, where high credibility sources, such as fda.gov and nih.gov, are frequently cited by anti-vaccine users, mostly to support their beliefs.
3. The placements of sources in different categories are relatively stable across topics. For example, washingtonPost.com and theguardian.com exclusively appear in the ++ category, while breitbart.com and foxnews.com consistently appear in the −− category.

Predicting Media Bias
Given the stances of users on the aforementioned eight topics, we leverage this information to predict media bias. Specifically, we describe in this section how we make use of the valence scores, as well as other features, namely graph and contextualized text embeddings, to train supervised classifiers for this purpose.
Valence Scores. We use valence scores in two ways. First, we average the corresponding valence across the different polarizing topics to obtain an average valence score for a given target news medium. This is an unsupervised method for computing polarity. Second, we train a Logistic Regression classifier that uses the calculated valence scores as features and annotations from mediaBiasFactCheck.com as gold target labels in order to predict the general political leaning of a target news medium. We merged "left" and "extreme left", and similarly we merged "right" and "extreme right". We discarded media labeled as being "leftcenter" and "right-center". Each news medium was represented by an 8-dimensional vector containing the valence scores for the above topics. In the experiments, we used the lbfgs solver and C = 0.1. We used two measures to evaluate its performance, namely accuracy and mean absolute error (MAE). The latter is calculated by considering the different classes as ordered and equally distant from each other, i.e., if the model predicts right and the true label is left, this amounts to an error equal to 2.    The results are shown in Table 6, where we can see that using the average valence score yields 68.0% accuracy (0.330 MAE) compared to 75.2% accuracy (0.278 MAE) when using the eight individual valence scores as features.
Graph embeddings. We further use graph embeddings, generated by building a User-to-Hashtag graph (U2H) and a User-to-Mention (U2M) graph and then running node2vec on both (Atanasov et al., 2019), producing two types of graph embeddings. When using graph embeddings, we got worse results compared to our previous setup with valence scores (see Table 6). However, when we combine them with the valence scores, we observe a sizable boost in performance, up to 11% absolute.
Tweets. We also experimented with BERT-base. We used the text of the tweets that cite the media we are classifying. For classification, we fed BERT representations of tweets to a dense layer with softmax output to fine-tune it with the textual contents of the tweets. We trained at the tweet level, and we averaged the scores (from softmax) for all tweets from the same news medium to obtain an overall label for that news medium. The accuracy is much lower than for the valence scores: 64.0% accuracy vs. 75.2% for supervised and 68.0% for unsupervised.
Article titles and text. Using the BERT setup for Tweets, we used the titles and the full text of up to 100 articles from each of the target media. When using the full text of articles, we balanced the number of articles per news medium. We trained two separate BERT models, one on the titles and another one on the full text (content). Both models did worse than using valence alone, but the combination improved over valence only. System Combination. We combined different setups including using all the aforementioned models in combination. Using graph embeddings (GraphH + GraphM) with BERT embeddings (Tweet+Title+Content) and valence yielded the best results with accuracy of 82.6% and MAE of .206. If we remove valence from the combination, the accuracy drops by 4.5% while MAE jumps by .078, absolute. This suggests that valence is a very effective feature that captures important information, complementary to what can be modeled using graph and contextualized text embeddings.

Conclusion and Future Work
We have presented a method for predicting the general political leaning of media sources and popular Twitter users, as well as their stances on specific polarizing topics. Our method uses retweeted accounts, and a combination of dimensionality reduction and clustering algorithms, namely UMAP and Mean Shift, in order to produce sets of users that have opposing opinions on specific topics. Next, we expand the discovered sets using supervised learning that is trained on the automatically discovered user clusters. We are able to automatically tag large sets of users according to their stance of preset topics. Users' stances are then projected to the influencers that are being cited in the tweets for each of the topics using the so-called valence score. The projection allows us to tag a large number of influencers with their stances on specific issues and with their political leaning in general (i.e., left vs. right) with high accuracy and with minimal human effort. The main advantage of our method is that it does not require manual labeling of entity stances, which requires both topical expertise and time. We also investigated the quality of the valence features, and we found that valence scores help to predict media bias with high accuracy.
In future work, we plan to increase the number of topics that we use to characterize media. Ideally, we would like to automatically identify such polarizing topics. Doing so would enable us to easily retarget this work to new countries and languages.