Leveraging Behavioral and Social Information for Weakly Supervised Collective Classification of Political Discourse on Twitter

Framing is a political strategy in which politicians carefully word their statements in order to control public perception of issues. Previous works exploring political framing typically analyze frame usage in longer texts, such as congressional speeches. We present a collection of weakly supervised models which harness collective classification to predict the frames used in political discourse on the microblogging platform, Twitter. Our global probabilistic models show that by combining both lexical features of tweets and network-based behavioral features of Twitter, we are able to increase the average, unsupervised F1 score by 21.52 points over a lexical baseline alone.


Introduction
The importance of understanding political discourse on social media platforms is becoming increasingly clear. In recent U.S. presidential elections, Twitter was widely used by all candidates to promote their agenda, interact with supporters, and attack their opponents. Social interactions on such platforms allow politicians to quickly react to current events and gauge interest in and support for their actions. These dynamic settings emphasize the importance of constructing automated tools for analyzing this content. However, these same dynamics make constructing such tools difficult, as the language used to discuss new events and political agendas continuously changes. Consequently, the rich social interactions on Twitter can be leveraged to help support such analysis by providing alternatives to direct supervision.
In this paper we focus on political framing, a very nuanced political discourse analysis task, on a variety of issues frequently discussed on Twitter. Framing (Entman, 1993;Chong and Druckman, 2007) is employed by politicians to bias the discussion towards their stance by emphasizing specific aspects of the issue. For example, the debate around increasing the minimum wage can be framed as a quality of life issue or as an economic issue. While the first frame supports increasing minimum wage because it improves workers' lives, the second frame, by conversely emphasizing the costs involved, opposes the increase. Using framing to analyze political discourse has gathered significant interest over the last few years (Tsur et al., 2015;Card et al., 2015;Baumer et al., 2015) as a way to automatically analyze political discourse in congressional speeches and political news articles. Different from previous works which focus on these longer texts or single issues, our dataset includes tweets authored by all members of the U.S. Congress from both parties, dealing with several policy issues (e.g., immigration, ACA, etc.). These tweets were annotated by adapting the annotation guidelines developed by Boydstun et al. (2014) for Twitter.
Twitter issue framing is a challenging multilabel prediction task. Each tweet can be labeled as using one or more frames, out of 17 possibilities, while only providing 140 characters as input to the classifier. The main contribution of this work is to evaluate whether the social and behavioral information available on Twitter is sufficient for constructing a reliable classifier for this task. We approach this framing prediction task using a weakly supervised collective classification approach which leverages the dependencies between tweet frame predictions based on the interactions between their authors. These dependencies are modeled by connecting Twitter users who have social connections or behavioral similarities. Social connections are di-rected dependencies that represent the followers of each user as well as retweeting behavior (i.e., user A retweets user B's content). Interestingly, such social connections capture the flow of influence within political parties; however, the number of connections that cross party lines is extremely low. Instead, we rely on capturing behavioral similarity between users to provide this information. For example, users whose Twitter activity peaks at similar times tend to discuss issues in similar ways, providing indicators of their frame usage for those issues. In addition to using social and behavioral information, our approach also incorporates each politician's party affiliation and the frequent phrases (e.g., bigrams and trigrams) used by politicians on Twitter.
These lexical, social, and behavioral features are extracted from tweets via weakly supervised models and then declaratively compiled into a graphical model using Probabilistic Soft Logic (PSL), a recently introduced probabilistic modeling framework. 1 As described in Section 4, PSL specifies high level rules over a relational representation of these features. These rules are then compiled into a graphical model called a hingeloss Markov random field (Bach et al., 2013), which is used to make the frame prediction. Instead of direct supervision we take a bootstrapping approach by providing a small seed set of keywords adapted from Boydstun et al. (2014), for each frame.
Our experiments show that modeling social and behavioral connections improves F 1 prediction scores in both supervised and unsupervised settings, with double the increase in the latter. We apply our unsupervised model to our entire dataset of tweets to analyze framing patterns over time by both party and individual politicians. Our analysis provides insight into the usage of framing for identification of aisle-crossing politicians, i.e., those politicians who vote against their party.

Related Work
Issue framing is related to the broader challenges of biased language analysis (Recasens et al., 2013;Choi et al., 2012;Greene and Resnik, 2009) and subjectivity (Wiebe et al., 2004). Several previous works have explored framing in public statements, congressional speeches, and news articles (Fulgoni et al., 2016;Tsur et al., 2015;Card 1 http://psl.cs.umd.edu et al., 2015;Baumer et al., 2015). Our approach builds upon the previous work on frame analysis of Boydstun et al. (2014), by adapting and applying their annotation guidelines for Twitter.
In recent years there has been growing interest in analyzing political discourse. Most previous work focuses on opinion mining and stance prediction (Sridhar et al., 2015;Hasan and Ng, 2014;Abu-Jbara et al., 2013;Walker et al., 2012;Abbott et al., 2011;Wiebe, 2010, 2009). Analyzing political tweets has also attracted considerable interest: a recent SemEval task looked into stance prediction, 2 and more related to our work, Tan et al. (2014) have shown how wording choices can affect message propagation on Twitter. Two recent works look into predicting stance (at user and tweet levels respectively) on Twitter using PSL (Johnson and Goldwasser, 2016;Ebrahimi et al., 2016). Frame classification, however, has a finer granularity than stance classification and describes how someone expresses their view on an issue, not whether they support the issue. Other works focus on identifying and measuring political ideologies (Iyyer et al., 2014;Bamman and Smith, 2015;Sim et al., 2013), policies (Nguyen et al., 2015), and voting patterns (Gerrish and Blei, 2012).
Exploiting social interactions and group structure for prediction has also been explored (Sridhar et al., 2015;Abu-Jbara et al., 2013;West et al., 2014). Works focusing on inferring signed social networks (West et al., 2014), stance classification (Sridhar et al., 2015), social group modeling (Huang et al., 2012), and collective classification using PSL (Bach et al., 2015) are closest to our approach. Unsupervised and weakly supervised models of Twitter data for several various tasks have been suggested, including: profile (Li et al., 2014b) and life event extraction (Li et al., 2014a), conversation modeling (Ritter et al., 2010), and methods for dealing with the unique language used in microblogs (Eisenstein, 2013).
Several works from political and social science research have studied the role of Twitter and framing in shaping public opinion of certain events, e.g. the Vancouver riots (Burch et al., 2015) and the Egyptian protests (Harlow and Johnson, 2011;Meraz and Papacharissi, 2013). Others have covered framing and sentiment analysis of opponents (Groshek and Al-Rawi, 2013) and network agenda modeling (Vargo et al., 2014) in the 2012 U.S. presidential election. Jang and Hart (2015) studied frames used by the general population specific to global warming. In contrast to these works, we predict the issue-independent general frames of tweets, by U.S. politicians, which discuss six different policy issues.

Data Collection and Annotation
Data Collection and Preprocessing: We collected 184,914 of the most recent tweets of members of the U.S. Congress (both the House of Representatives and Senate). Using an average of ten keywords per issue, we filtered out tweets not related to the following six issues of interest: (1) limiting or gaining access to abortion, (2) debates concerning the Affordable Care Act (i.e., ACA or Obamacare), (3) the issue of gun rights versus gun control, (4) effects of immigration policies, (5) acts of terrorism, and (6) issues concerning the LGBTQ community. Forty politicians (10 Republicans and 10 Democrats, from both the House and Senate), were chosen randomly for annotation. Table 1 presents the statistics of our congressional tweets dataset, which is available for the community. 3 Appendix A contains more details of our dataset and preprocessing steps.
Data Annotation: Two graduate students were trained in the use of the Policy Frames Codebook developed by Boydstun et al. (2014) for annotating each tweet with a frame. The general aspects of each frame are shown in Table 2. Frames are designed to generalize across issues and overlap of multiple frames is possible. Additionally, the Codebook is typically applied to newspaper ar-ticles where discussion of policy can encompass other frames in the text. Consequently, annotators using the Codebook are advised to be careful when assigning Frame 13 to a text.
Based on this guidance and the difficulty of labeling tweets (as discussed in Card et al. (2015)), annotators were instructed to use the following procedure: (1) attempt to assign a primary frame to the tweet if possible, (2) if not possible, assign two frames to the tweet where the first frame is chosen as the more accurate of the two frames, (3) when assigning frames 12 through 17, double check that the tweet cannot be assigned to any other frames. Annotators spent one month labeling the randomly chosen tweets. For all tweets with more than one frame, annotators met to come to a consensus on whether the tweet should have one frame or both. The labeled dataset has an inter-annotator agreement, calculated using Cohen's Kappa statistic, of 73.4%.

Extensions of the Codebook for Twitter Use:
The first 14 frames outlined in Table 2 are directly applicable to the tweets of U.S. politicians. In our labeled set, Frame 15 (Other) was never used. Therefore, we drop its analysis from this paper. From our observations, we propose the addition of the 3 frames at the bottom of Table 2 for Twitter analysis: Factual, (Self) Promotion, and Personal Sympathy and Support. Tweets that present a fact, with no detectable political spin or twists, are labeled as having the Factual frame (15). Tweets that discuss a politician's appearances, speeches, statements, or refer to political friends are considered to have the (Self) Promotion frame. Finally, tweets where a politician offers their "thoughts and prayers", condolences, or stands in support of others, are considered to have the Personal frame.
We find that for many tweets, one frame is not enough. This is caused by the compound nature of many tweets, e.g., some tweets are two separate sentences, with each sentence having a different frame or tweets begin with one frame and end with another. A final problem, that may also be relevant to longer text articles, is that of subframes within a larger frame. For example, the tweet "We must bolster the security of our borders and craft an immigration policy that grows our economy." has two frames: Security & Defense and Economic. However, both frames could fall under Frame 13 (Policy), if this tweet as a whole was a rebuttal point about an immigration policy. The lack of   available context for short tweets can make it difficult to determine if a tweet should have one primary frame or is more accurately represented by multiple frames.

Global Models of Twitter Language and Activity
Due to the dynamic nature of political discourse on Twitter, our approach is designed to require as little supervision as possible. We implement 6 weakly supervised models which are datadependent and used to extract and format information from tweets into input for PSL predicates. These predicates are then combined into the probabilistic rules of each model as shown in Table 3.
The only sources of supervision these models require includes: unigrams related to the issues, unigrams adapted from the Boydstun et al. (2014) Codebook for frames, and political party of the author of the tweets.

Global Modeling Using PSL
PSL is a declarative modeling language which can be used to specify weighted, first-order logic rules. These rules are compiled into a hinge-loss Markov random field which defines a probability distribution over possible continuous value assignments to the random variables of the model (Bach et al., 2015). 4 This probability density function is represented as: where Z is a normalization constant, is the weight vector, and r (Y, X) = (max{l r (Y, X), 0}) ⇢r is the hinge-loss potential specified by a linear function l r . The exponent ⇢ r 2 1, 2 is optional. Each potential represents the instantiation of a rule, which takes the following form: P 1 , P 2 , P 3 , and P 4 are predicates (e.g., political party, issue, frame, and presence of n-grams) and x, y are variables. Each rule has a weight which reflects that rule's importance and is learned using the Expectation-Maximization algorithm in our unsupervised experiments. Using concrete constants a, b (e.g., tweets and words) which instantiate the variables x, y, model atoms are mapped to continuous [0,1] assignments. More important rules (i.e., those with larger weights) are given preference by the model.

Language Based Models
Unigrams: Using the guidelines provided in the Policy Frames Codebook (Boydstun et al., 2014), we adapted a list of expected unigrams for each frame. For example, unigrams that should be related to Frame 12 (Political Factors & Implications) include: filibuster, lobby, Democrats, Republicans. We expect that if a tweet and frame contain a matching unigram, then that frame is likely present in that tweet. The information that tweet T has expected unigram U of frame F is represented with the PSL predicate: UNIGRAM F (T, U). This knowledge is then used as input to PSL Model 1 via the rule: UNIGRAM F (T, U) !FRAME(T, F) (shown in line 1 of Table 3).
However, not every tweet will have a unigram that matches those in this list. Under the intuition that at least one unigram in a tweet should be similar to a unigram in the list, we designed the following MaxSim metric to compute the maximum similarity between a word in a tweet and a word from the list of unigrams. (1) T is a tweet, W is each word in T, and U is each unigram in the list of expected unigrams (per frame). SIMILARITY is the computed word2vec similarity (using pretrained embeddings) of each word in the tweet with every unigram in the list of unigrams for each frame. The frame F of the maximum scoring unigram is input to the PSL predicate: MAXSIM F (T, F), which indicates that tweet T has the highest similarity to frame F.

Bigrams and Trigrams:
In addition to unigrams, we also explored the effects of political party slogans on frame prediction. Slogans are common catch phrases or sayings that people typically associate with different U.S. political parties. For example, Republicans are known for using the phrase "repeal and replace" when they discuss the ACA. Similarly, in the 2016 U.S. presidential election, Secretary Hillary Clinton's campaign slogan became "Love Trumps Hate". To visualize slogan usage by parties for different issues, we used the entire tweets dataset, including all unlabeled tweets, to extract the top bigrams  1  5  9  13  17  21  25  29  33  37  41  45  49  53  57  61  65  69  73  77  81  85  89  93  97 Frequency  and trigrams per party for each issue. The histograms in Figure 1 show these distributions for the top 100 bigrams and trigrams. Based on these results, we use the top 20 bigrams (e.g., women's healthcare and immigration reform) and trigrams (e.g. prevent gun violence) as input to PSL predicates BIGRAM I P (T, B) and TRIGRAM I P (T, TG). These rules represent that tweet T has bigram B or trigram TG from the respective issue I phrase lists of either party P.

Twitter Behavior Based Models
In addition to language based features of tweets, we also exploit the behavioral and social features of Twitter including similarities between temporal activity and network relationships.
Temporal Similarity: We construct a temporal histogram for each politician which captures their Twitter activity over time. When an event happens politicians are most likely to tweet about that event within hours of its occurrence. Similarly, most politicians tweet about the event most frequently the day of the event and this frequency decreases over time. From these temporal histograms, we observed that the frames used the day of an event were similar and gradually changed over time. For example, once the public is notified of a shooting, politicians respond with Frame 17 to offer sympathy to the victims and their families. Over the next days or weeks, both parties slowly transition to using additional frames, e.g. Democrats use Frame 7 to argue for gun control legislation.
Network Similarity: Finally, we expect that politicians who share ideologies, and thus are likely to frame issues similarly, will retweet and/or follow each other on Twitter. Due to the compound nature of tweets, retweeting with additional comments can add more frames to the original tweet. Additionally, politicians on Twitter are more likely to follow members of their own party or similar non-political entities than those of the opposing party. To capture this network-based behavior we use two PSL predicates: RETWEETS(T1, T2) and FOLLOWS(T1, T2). These predicates indicate that the content of tweet T1 includes a retweet of tweet T2 and that the author of T1 follows the author of T2 on Twitter, respectively. The last two lines of Table 3 show examples of how network similarity is incorporated into PSL rules.

Experiments
Evaluation Metrics: Since each tweet can have more than one frame, our prediction task is a multilabel classification task. The precision of a multilabel model is the ratio of how many predicted labels are correct: The recall of this model is the ratio of how many of the actual labels were predicted: 5 We conducted experiments with different hour and day limits and found that using a time frame of one hour results in the best accuracy while limiting noise.
In both formulas, T is the number of tweets, Y t is the true label for tweet t, x t is a tweet example, and h(x t ) are the predicted labels for that tweet. The F 1 score is computed as the harmonic mean of the precision and recall. Additionally, in Tables 4, 5, and 6 the reported average is the micro-weighted average F 1 scores over all frames.

Experimental Settings:
We provide an analysis of our PSL models under both supervised and unsupervised settings. In the PSL supervised experiments, we used five-fold cross validation with randomly chosen splits.
Previous works typically use an SVM, with bagof-words features, which is not used in a multilabel prediction, i.e., each frame is predicted individually. The results of this approach on our dataset are shown in column 2 of Table 4. In this scenario, the SVM tends to prefer the majority class, which results in many incorrect labels. Column 3 shows the results of using an SVM with bag-of-words features to perform multilabel classification. This approach decreases the F 1 score for a majority of frames. Both SVMs also result in F 1 scores of 0 for some frames, further lowering the overall performance. Finally, columns 4 and 5 show the results of using our worst and best PSL models, respectively. PSL Model 1, which uses our adapted unigram features instead of the bag-of-words features for multilabel classification, serves as our baseline to improve upon. Additionally, Model 6 of the supervised, collective network setting represents the best results we can achieve.
We also explore the results of our PSL models in an unsupervised setting because the highly dynamic nature of political discourse on Twitter makes it unrealistic to expect annotated data to generalize to future discussions. The only source of supervision comes from the initial unigrams lists and party information as described in Section 4. The labeled tweets are used for evaluation only. As seen in Table 4  Analysis of Supervised Experiments: Table 5 shows the results of our supervised experiments.
Here we can see that by adding Twitter behavior (beginning with Model 4), our behaviorbased models achieve the best F 1 scores across all frames. Model 4 achieves the highest results on two frames, suggesting retweeting and network follower information do not help improve the prediction score for these frames. Similarly, Model 5 achieves the highest prediction for 5 of the frames, suggesting network follower information cannot further improve the score for these frames. Overall, the Twitter behavior based models are able to outperform language based models alone, including the best performing language model (Model 3) which combines unigrams, bigrams, and trigrams together to collectively infer the correct frames.

Analysis of Unsupervised Experiments:
In the unsupervised setting, Model 6, the combination of language and Twitter behavior features achieves the best results on 16 of the 17 issues, as shown in Table 6. There are a few interesting aspects of the unsupervised setting which differ from the supervised setting. Six of the frame predictions do worse in Model 2, which is double that of the supervised version. This is likely due to the presence of overlapping bigrams across frames and issues, e.g., "women's healthcare" could appear in both Frames 4 and 8 and the issues of ACA and abortion. However, all six are able to improve with the addition of trigrams (Model 3), whereas only 1 of 3 frames improves in the supervised setting. This suggests that bigrams may not be as useful as trigrams in an unsupervised setting. Finally, in Model 5, which adds retweet behaviors, we notice that 5 of the frames decrease in F 1 score and 11 of the frames have the same score as the previous model. These results suggest that retweet behaviors are not as useful as the follower network relationships in an unsupervised setting.

Qualitative Analysis
To explore the usefulness of frame identification in political discourse analysis, we apply our best performing model (Model 6) on the unlabeled dataset to determine framing patterns over time, both by party and individual. Figure 2 shows the results of our frame analysis for both parties over time for two issues: ACA and terrorism. 6 We compiled the predicted frames for tweets from 2014 to 2016 for each party. Figure 3 presents the results of frame prediction for 2015 tweets of aisle-crossing individual politicians for these two issues.
Party Frames: From Figure 2(a) we can see that Democrats mainly use Frames 1, 4, 8, 9, and 15 to discuss ACA, while Figure 2(c) shows that Republicans predominantly use Frames 1, 8, 9, 12, and 13 : 3, 7, 10, 14, 16, and 17, but to express different views. For example, Democrats use Frame 3 to indicate a moral responsibility to fight ISIS. Republicans use Frame 3 to frame terrorists or their attacks as a result of "radical Islam". An interesting pattern to note is seen in Frames 10 and 14 for both parties. In 2015 there is a large in-crease in the usage of this frame. This seems to indicate that parties possibly adopt new frames simultaneously or in response to the opposing party, perhaps in an effort to be in control of the way the message is delivered through that frame.
Individual Frames: In addition to entire party analysis, we were interested in seeing if frames could shed light on the behavior of aisle-crossing politicians. These are politicians who do not vote the same as the majority vote of their party (i.e., they vote the same as the opposing party). Identifying such politicians can be useful in governments which are heavily split by party, i.e., governments such as the recent U.S. Congress  as the rest of their party members. For this analysis, we collected five 2015 votes from the House of Representatives on both issues and compiled a list of the politicians who voted opposite to their party. The most important descriptor we noticed was that all aisle-crossing politicians tweet less frequently on the issue than their fellow party members. This is true for both parties. This behavior could indicate lack of desire to draw attention to one's stance on the particular issue. Figure 3(a) shows the framing patterns of aislecrossing Republicans on ACA votes from 2015. Recall from Figure 2 that Democrats mostly use Frames 1, 4,8,9,and 15,while Republicans mainly use Frames 1,8,and 9. In this example, these Republicans are considered aislecrossing votes because they have voted the same as Democrats on this issue. The most interesting pattern to note here is that these Republicans use the same framing patterns as the Republicans (Frames 1, 8, and 9), but they also use the frames that are unique to Democrats: Frames 4 and 15. These latter two frames appear significantly less in the Republican tweets of our entire dataset as well. These results suggest that to predict aisle-crossing Republicans it would be useful to check for usage of typically Democrat-associated frames, especially if those frames are infrequently used by Republicans. Figure 3(b) shows the predicted frames for aisle-crossing Democrats on terrorism-related votes. We see here that there are very few tweets from these Democrats on this issue and that overall they use the same framing patterns as seen previously: Frames 3,7,10,14,16,and 17. However, given the small scale of these tweets, we can also consider Frames 12 and 13 to show peaks for this example. This suggests that for aisle-crossing Democrats the use of additional frames not often used by their party for discussing an issue might indicate potentially different voting behaviors.

Conclusion
In this paper we present the task of collective classification of Twitter data for framing prediction. We show that by incorporating Twitter behaviors such as similar activity times and similar networks, we can increase F 1 score prediction. We provide an analysis of our approach in both supervised and unsupervised settings, as well as a real world analysis of framing patterns over time. Finally, our global PSL models can be applied to other domains, such as politics in other countries, simply by changing the initial unigram keywords to reflect the politics of those countries. Table 9: Frame and Corresponding Unigrams Used for Initial Supervision.
Word Lists: Table 8 lists the keywords or phrases used to filter the entire dataset to only tweets related to the six issues studied in this paper. Table 9 lists the unigrams that were designed based on the descriptions for Frames 1 through 14 provided in the Policy Frames Codebook (Boydstun et al., 2014). These unigrams provide the initial supervision for our models as described in Section 4.