“i have a feeling trump will win..................”: Forecasting Winners and Losers from User Predictions on Twitter

Social media users often make explicit predictions about upcoming events. Such statements vary in the degree of certainty the author expresses toward the outcome: “Leonardo DiCaprio will win Best Actor” vs. “Leonardo DiCaprio may win” or “No way Leonardo wins!”. Can popular beliefs on social media predict who will win? To answer this question, we build a corpus of tweets annotated for veridicality on which we train a log-linear classifier that detects positive veridicality with high precision. We then forecast uncertain outcomes using the wisdom of crowds, by aggregating users’ explicit predictions. Our method for forecasting winners is fully automated, relying only on a set of contenders as input. It requires no training data of past outcomes and outperforms sentiment and tweet volume baselines on a broad range of contest prediction tasks. We further demonstrate how our approach can be used to measure the reliability of individual accounts’ predictions and retrospectively identify surprise outcomes.


Introduction
In the digital era we live in, millions of people broadcast their thoughts and opinions online.These include predictions about upcoming events of yet unknown outcomes, such as the Oscars or election results.Such statements vary in the extent to which their authors intend to convey the event will happen.For instance, (a)  uncertainty.In contrast, (c) does not say anything about the likelihood of Natalie Portman winning (although it clearly indicates the author would like her to win).
Prior work has made predictions about contests such as NFL games (Sinha et al., 2013) and elections using tweet volumes (Tumasjan et al., 2010) or sentiment analysis (O'Connor et al., 2010;Shi et al., 2012).Many such indirect signals have been shown useful for prediction, however their utility varies across domains.In this paper we explore whether the "wisdom of crowds" (Surowiecki, 2005), as measured by users' explicit predictions, can predict outcomes of future events.We show how it is possible to accurately forecast winners, by aggregating many individual predictions that assert an outcome.Our approach requires no historical data about outcomes for training and can directly be adapted to a broad range of contests.
To extract users' predictions from text, we present TwiVer, a system that classifies veridicality toward future contests with uncertain outcomes.Given a list of contenders competing in a contest (e.g., Academy Award for Best Actor), we use TwiVer to count how many tweets explicitly assert the win of each contender.We find that aggregating veridicality in this way provides an accurate signal for predicting outcomes of future contests.Furthermore, TwiVer allows us to perform a number of novel qualitative analyses including retrospective detection of surprise outcomes that were not expected according to popular belief (Section 4.5).We also show how TwiVer can be used to measure the number of correct and incorrect predictions made by individual accounts.This provides an intuitive measurement of the reliability of an information source (Section 4.6).

Related Work
In this section we summarize related work on textdriven forecasting and computational models of veridicality.
Text-driven forecasting models (Smith, 2010) predict future response variables using text written in the present: e.g., forecasting films' box-office revenues using critics' reviews (Joshi et al., 2010), predicting citation counts of scientific articles (Yogatama et al., 2011) and success of literary works (Ashok et al., 2013), forecasting economic indicators using query logs (Choi and Varian, 2012), improving influenza forecasts using Twitter data (Paul et al., 2014), predicting betrayal in online strategy games (Niculae et al., 2015) and predicting changes to a knowledge-graph based on events mentioned in text (Konovalov et al., 2017).These methods typically require historical data for fitting model parameters, and may be sensitive to issues such as concept drift (Fung, 2014).In contrast, our approach does not rely on historical data for training; instead we forecast outcomes of future events by directly extracting users' explicit predictions from text.
Prior work has also demonstrated that user sentiment online directly correlates with various real-world time series, including polling data (O'Connor et al., 2010) and movie revenues (Mishne and Glance, 2006).In this paper, we empirically demonstrate that veridicality can often be more predictive than sentiment (Section 4.1).
Also related is prior work on detecting veridicality (de Marneffe et al., 2012;Søgaard et al., 2015) and sarcasm (González-Ibánez et al., 2011).Soni et al. (2014) investigate how journalists frame quoted content on Twitter using predicates such as think, claim or admit.In contrast, our system TwiVer, focuses on the author's belief toward a claim and direct predictions of future events as opposed to quoted content.
Our approach, which aggregates predictions extracted from user-generated text is related to prior work that leverages explicit, positive veridicality, statements to make inferences about users' demographics.For example, Coppersmith et al. (2014;2015) exploit users' self-reported statements of diagnosis on Twitter.

Measuring the Veridicality of Users' Predictions
The first step of our approach is to extract statements that make explicit predictions about unknown outcomes of future events.We focus specifically on contests which we define as events planned to occur on a specific date, where a number of contenders compete and a single winner is chosen.For example,  To explore the accuracy of user predictions in social media, we gathered a corpus of tweets that mention events belonging to one of the 10 types listed in Table 4. Relevant messages were collected by formulating queries to the Twitter search interface that include the name of a contender for a given contest in conjunction with the keyword win.We restricted the time range of the queries to retrieve only messages written before the time of the contest to ensure that outcomes were unknown when the tweets were written.We include 10 days of data before the event for the presidential primaries and the final presidential elections, 7 days for the Oscars, Ballon d'Or and Indian general elections, and the period between the semifinals and the finals for the sporting events.Table 3 shows several example queries to the Twitter search interface which were used to gather data.We automatically generated queries, using templates, for events scraped from various websites: 483 queries were generated for the presidential primaries based on events scraped from ballotpe-Figure 1: Example of one item to be annotated, as displayed to the Turkers.dia2 , 176 queries were generated for the Oscars,3 18 for Ballon d'Or,4 162 for the Eurovision contest,5 52 for Tennis Grand Slams,6 6 for the Rugby World Cup,7 18 for the Cricket World Cup,8 12 for the Football World Cup,9 76 for the 2016 US presidential elections,10 and 68 queries for the 2014 Indian general elections. 11e added an event prefix (e.g., "Oscars" or the state for presidential primaries), a keyword ("win"), and the relevant date range for the event.For example, "Oscars Leonardo DiCaprio win since:2016-2-22 until:2016-2-28" would be the query generated for the first entry in Table 2.We restricted the data to English tweets only, as tagged by langid.py(Lui and Baldwin, 2012).Jaccard similarity was computed between messages to identify and remove duplicates. 12We removed URLs and preserved only tweets that mention contenders in the text.This automatic postprocessing left us with 57,711 tweets for all winners and 55,558 tweets for losers (contenders who did not win) across all events.Table 4 gives the data distribution across event categories.

Mechanical Turk Annotation
We obtained veridicality annotations on a sample of the data using Amazon Mechanical Turk.For each tweet, we asked Turkers to judge veridicality toward a candidate winning as expressed in the tweet as well as the author's desire toward the event.For veridicality, we asked Turkers to rate whether the author believes the event will happen on a 1-5 scale ("Definitely Yes", "Probably Yes", "Uncertain about the outcome", "Probably No", "Definitely No").We also added a question about the author's desire toward the event to make clear the difference between veridicality and desire.For example, "I really want Leonardo to win at the Oscars!" asserts the author's desire toward Leonardo winning, but remains agnostic about the likelihood of this outcome, whereas "Leonardo DiCaprio will win the Oscars" is predicting with confidence that the event will happen.
Figure 1 shows the annotation interface presented to Turkers.Each HIT contained 10 tweets to be annotated.We gathered annotations for 1, 841 tweets for winners and 1, 702 tweets for losers, giving us a total of 3, 543 tweets.We paid $0.30 per HIT.The total cost for our dataset was $1,000.Each tweet was annotated by 7 Turkers.We used MACE (Hovy et al., 2013) to resolve differences between annotators and produce a single gold label for each tweet.
Figures 2a and 2c show heatmaps of the distribution of annotations for the winners for the Oscars in addition to all categories.In both instances, most of the data is annotated with "Definitely Yes" and "Probably Yes" labels for veridicality.Figures 2b and 2d show that the distribution is more diverse for the losers.Such distributions indicate that the veridicality of crowds' statements could indeed be predictive of outcomes.We provide additional evidence for this hypothesis using automatic veridicality classification on larger datasets in §4.

Veridicality Classifier
The goal of our system, TwiVer, is to automate the annotation process by predicting how veridical a tweet is toward a candidate winning a contest: is the candidate deemed to be winning, or is the author uncertain?For the purpose of our experiments, we collapsed the five labels for veridicality into three: positive veridicality ("Definitely Yes" and "Probably Yes"), neutral ("Uncertain about the outcome") and negative veridicality ("Definitely No" and "Probably No").
We model the conditional distribution over a tweet's veridicality toward a candidate c winning a contest against a set of opponents, O, using a log-linear model: where v is the veridicality (positive, negative or neutral).
To extract features f (c, O, tweet), we first preprocessed tweets retrieved for a specific event to identify named entities, using (Ritter et al., 2011)'s Twitter NER system.Candidate (c) and opponent entities were identified in the tweet as follows: -TARGET (t).A target is a named entity that matches a contender name from our queries.
-OPPONENT (O).For every event, along with the current TARGET entity, we also keep track of other contenders for the same event.If a named entity in the tweet matches with one of other contenders, it is labeled as opponent.
-ENTITY (e): Any named entity which does not match the list of contenders.
Figure 3 illustrates the named entity labeling for a tweet obtained from the query "Oscars Leonardo DiCaprio win since:2016-2-22 until:2016-2-28".Leonardo DiCaprio is the TARGET, while the named entity tag for Bryan Cranston, one of the   These tags provide information about the position of named entities relative to each other, which is used in the features.

Features
We use five feature templates: context words, distance between entities, presence of punctuation, dependency paths, and negated keyword.
Target and opponent contexts.For every TAR-GET (t) and OPPONENT (o ∈ O) entities in the tweet, we extract context words in a window of one to four words to the left and right of the TAR-GET ("Target context") and OPPONENT ("Opponent context"), e.g., t will win, I'm going with t, o will win.Keyword context.For target and opponent entities, we also extract words between the entity and our specified keyword (k) (win in our case): t predicted to k, o might k.Pair context.For the election type of events, in which two target entities are present (contender and state.e.g., Clinton, Ohio), we extract words between these two entities: e.g., t 1 will win t 2 .Distance to keyword.We also compute the distance of TARGET and OPPONENT entities to the keyword.
Punctuation.We introduce two binary features for the presence of exclamation marks and question marks in the tweet.We also have features which check whether a tweet ends with an exclamation mark, a question mark or a period.Punctuation, especially question marks, could indicate how certain authors are of their claims.
Dependency paths.We retrieve dependency paths between the two TARGET entities and between the TARGET and keyword (win) using the TweeboParser (Kong et al., 2014) after applying rules to normalize paths in the tree (e.g., "doesn't" → "does not").
Negated keyword.We check whether the keyword is negated (e.g., "not win", "never win"), using the normalized dependency paths.
We randomly divided the annotated tweets into a training set of 2,480 tweets, a development set of 354 tweets and a test set of 709 tweets.MAP parameters were fit using LBFGS-B (Zhu et al., 1997).Table 6 provides examples of high-weight features for positive and negative veridicality.

Evaluation
We evaluated TwiVer's precision and recall on our held-out test set of 709 tweets.Figure 4 shows the precision/recall curve for positive veridicality.By setting a threshold on the probability score to be greater than 0.64, we achieve a precision of

Performance on held-out event types
To assess the robustness of the veridicality classifier when applied to new types of events, we compared its performance when trained on all events vs. holding out one category for testing.Table 9 shows the comparison: the second and third columns give F1 score when training on all events vs. removing tweets related to the category we are testing on.In most cases we see a relatively modest drop in performance after holding out training data from the target event category, with the exception of elections.This suggests our approach can be applied to new event types without requiring in-domain training data for the veridicality classifier.

Error Analysis
Table 7 shows some examples which TwiVer incorrectly classifies.These errors indicate that even though shallow features and dependency paths do a decent job at predicting veridicality, deeper text understanding is needed for some cases.The opposition between "the heart . . . the mind" in the first example is not trivial to capture.Paying atten-tion to matrix clauses might be important too (as shown in the last tweet "There is no doubt . . .").

Forecasting Contest Outcomes
We now have access to a classifier that can automatically detect positive veridicality predictions about a candidate winning a contest.This enables us to evaluate the accuracy of the crowd's wisdom by retrospectively comparing popular beliefs (as extracted and aggregated by TwiVer) against known outcomes of contests.We will do this for each award category (Best Actor, Best Actress, Best Film and Best Director) in the Oscars from 2009 -2016, for every state for both Republican and Democratic parties in the 2016 US primaries, for both the candidates in every state for the final 2016 US presidential elections, for every country in the finals of Eurovision song contest, for every contender for the Ballon d'Or award, for every party in every state for the 2014 Indian general elections, and for the contenders in the finals for all sporting events.

Prediction
A simple voting mechanism is used to predict contest outcomes: we collect tweets about each contender written before the date of the event, 13 and use TwiVer to measure the veridicality of users' predictions toward the events.Then, for each contender, we count the number of tweets that are labeled as positive with a confidence above 0.64, as well as the number of tweets with positive veridicality for all other contenders.Table 11 illustrates these counts for one contest, the Oscars Best Actress in 2014.
We then compute a simple prediction score, as follows:

Sentiment Baseline
We compare the performance of our approach against a state-of-the-art sentiment baseline (Mohammad et al., 2013).Prior work on social media forecasting used sentiment analysis to make predictions for contest outcomes.For instance, O'Connor et al. ( 2010) used sentiment to predict election outcomes and forecast polls.We use a re-implementation of (Mohammad et al., 2013)'s system14 to estimate sentiment for tweets in our corpus.We run the tweets obtained for every contender through the sentiment analysis system to obtain a count of positive labels.Sentiment scores are computed analogously to veridicality using Equation (1).For each contest, the contender with the highest sentiment prediction score is predicted as the winner.

Frequency Baseline
We also compare our approach against a simple frequency (tweet volume) baseline.For every contender, we compute the number of tweets that has been retrieved.Frequency scores are computed in the same way as for veridicality and sentiment using Equation (1).For every contest, the contender with the highest frequency score is selected to be the winner.

Results
Table 8 gives the precision, recall and max-F1 scores for veridicality, sentiment and volumebased forecasts on all the contests.The veridicality-based approach outperforms sentiment and volume-based approaches on 9 of the 10 events considered.For the Tennis Grand Slam, the three approaches perform poorly.The difference in performance for the veridicality approach is quite lower for the Tennis events than for the other events.It is well known however that winners of tennis tournaments are very hard to predict.The performance of the players in the last minutes of the match are decisive, and even professionals have a difficult time predicting tennis winners.Table 10 shows the 10 top predictions made by the veridicality and sentiment-based systems on two of the events we considered -the Oscars and the presidential primaries, highlighting correct predictions.

Surprise Outcomes
In addition to providing a general method for forecasting contest outcomes, our approach based on veridicality allows us to perform several novel analyses including retrospectively identifying surprise outcomes that were unexpected according to popular beliefs.
In Table 10, we see that the veridicality-based approach incorrectly predicts The Revenant as winning Best Film in 2016.This makes sense, because the film was widely expected to win at the time, according to popular belief.Numerous sources in the press, 15,16,17 qualify The Revenant not winning an Oscar as a big surprise.
Similarly, for the primaries, the two incorrect predictions made by the veridicality-based approach were surprise losses.News articles 18,19,20  indeed reported the loss of Maine for Trump and the loss of Indiana for Clinton as unexpected.

Assessing the Reliability of Accounts
Another nice feature of our approach based on veridicality is that it immediately provides an intuitive assessment on the reliability of individual Twitter accounts' predictions.For a given account, we can collect tweets about past contests, and extract those which exhibit positive veridicality toward the outcome, then simply count how often the accounts were correct in their predictions.As proof of concept, we retrieved within our dataset, the user names of accounts whose tweets about Ballon d'Or contests were classified as having positive veridicality.Table 12 gives accounts that made the largest number of correct predictions for Ballon d'Or awards between 2010 to 2016, sorted by users' prediction accuracy.Usernames of non-public figures are anonymized (as user 1, etc.) in the table.We did not extract more data for these users: we only look at the data we had already retrieved.Some users might not make predictions for all contests, which span 7 years.
Accounts like "goal ghana", "breakingnewsnig" and "1Mrfutball", which are automatically identified by our analysis, are known to post tweets predominantly about soccer.

Conclusions
In this paper, we presented TwiVer, a veridicality classifier for tweets which is able to ascertain the degree of veridicality toward future contests.We showed that veridical statements on Twitter provide a strong predictive signal for winners on different types of events, and that our veridicalitybased approach outperforms a sentiment and frequency baseline for predicting winners.Furthermore, our approach is able to retrospectively identify surprise outcomes.We also showed how our approach enables an intuitive yet novel method for evaluating the reliability of information sources.

Figure 2 :
Figure2: Heatmaps showing annotation distributions for one of the events -the Oscars and all event types, separating winners from losers.Vertical labels indicate veridicality (DY "Definitely Yes", PY "Probably Yes", UC "Uncertain about the outcome", PN "Probably No" and DN "Definitely No").Horizontal labels indicate desire (SW "Strongly wants the event to happen", PW "Probably wants the event to happen", ND "No desire about the event outcome", PD "Probably does not want the event to happen", SN "Strongly against the event happening").More data in the upper left hand corner indicates there are more tweets with positive veridicality and desire.

Figure 3 :
Figure 3: Illustration of the three named entity tags and distance features between entities and keyword win for a tweet retrieved by the query "Oscars Leonardo DiCaprio win since:2016-2-22 until:2016-2-28".

Figure 4 :
Figure 4: Precision/Recall curve showing TwiVer performance in identifying positive veridicality tweets in the test data.

Table 1 :
in Table 1 strongly asserts the win of Natalie Portman over Meryl Streep, whereas (b) imbues the claim with Examples of tweets expressing varying degrees of veridicality toward Natalie Portman winning an Oscar.

Table 2 :
Oscar nominations for Best Actor 2016.

Table 3 :
Examples of queries to extract tweets.

Table 4 :
Number of tweets for each event category.

Table 5 :
Feature ablation of the positive veridicality classifier by removing each group of features from the full set.The point of maximum F1 score is shown in each case.losers for the Oscars, is re-tagged as OPPONENT.

Table 6 :
Some high-weight features for positive and negative veridicality.

Table 7 :
Some classification errors made by TwiVer.Contenders queried for are highlighted.
80.1% and a recall of 44.3% in identifying tweets expressing a positive veridicality toward a candidate winning a contest.

Table 8 :
Performance of Veridicality, Sentiment baseline, and Frequency baseline on all event categories (%).

Table 9 :
F1 scores for each event when training on all events vs. holding out that event from training.|Tt | is the number of tweets of that event category present in the test dataset.where|T c | is the set of tweets mentioning positive veridicality predictions toward candidate c, and |T O | is the set of all tweets predicting any opponent will win.For each contest, we simply predict as winner the contender whose score is highest.

Table 10 :
Top 10 predictions of winners for Oscars and primaries based on veridicality and sentiment scores.Correct predictions are highlighted."!" indicates a loss which wasn't expected.

Table 11 :
Positive veridicality tweet counts for the Best Actress category in 2014: |T c | is the count of positive veridicality tweets for the contender under consideration and |T O | is the count of positive veridicality tweets for the other contenders.

Table 12 :
List of users sorted by how accurate they were in their Ballon d'Or predictions.