Determining Event Outcomes: The Case of #fail

This paper targets the task of determining event outcomes in social media. We work with tweets containing either #cookingFail or #bakingFail, and show that many of the events described in them resulted in something edible. Tweets that contain images are more likely to result in edible albeit imperfect outcomes. Experimental results show that edibility is easier to predict than outcome quality.


Introduction
While the definition of event is controversial (Casati and Varzi, 2020;Sprugnoli and Tonelli, 2016), there is general consensus that events occur (or happen, or take place) at a time and location. People share in social media a deluge of information including events they care about. These events range from mundane events such as eating or watching TV to important life events such as getting married and graduating from college (Li et al., 2014). Twitter is one of the most popular social networks with 166 million daily active users (Twitter, 2020).
An important property of events is whether they actually occurred. The literature has studied this property under different terms, e.g., factuality (Saurí and Pustejovsky, 2009;Lee et al., 2015) and veridicality (de Marneffe et al., 2012). Other related tasks have studied the level of commitment a speaker or writer has towards a proposition (Werner et al., 2015;Jiang and de Marneffe, 2019). Assessing the degree to which an event occurred or is believed to be true is critical to make inferences and information extraction. Even when an event is guaranteed to have occurred, however, it is not necessarily the case that the desired outcome came to fruition. For example, people make phone ⇤ Currently at Thomson Reuters.Work done while at University of North Texas. Figure 1: Tweet discussing a baking event. Despite the presence of the #BakingFail hashtag, the baking was not a complete failure. Indeed, it (most likely) resulted in something edible but visually unappealing. calls (presumably) to communicate with whoever they are calling. Making the call, however, does not guarantee that the communication took placethe recipient could have not picked up the phone. Some events have fairly clear desired outcomes even if they are not explicitly stated: people make phone calls to communicate, run in elections so that they are elected, etc. The desired outcomes of other events, however, are not so clear: people may plant a tree to help the environment, to provide privacy or shade, or so that it bears fruit.
Like factuality, determining whether an event resulted in the desired outcome is a matter or degree and not a binary decision. In other words, events often do not result in perfect outcomes or complete failures. For example, a phone call may result in communication that is far from perfect because there is background noise or because the call suddenly drops. Consider the tweet in Figure 1. Despite the hashtag #BakingFail, the baking was partially successful: something edible came out of the baking, although it was not visually appealing.
In this paper, we target cooking and baking events that include some form of the hashtag #fail, and study the degree to which they resulted in their desired outcomes in terms of edibility and quality. The main contributions are: (a) a corpus of 4,000 tweets annotated with event outcome information in two stages: edibility and quality; 1 (b) analysis showing that more information can be extracted from tweets including an image; (c) experimental results showing that determining outcome quality remains a challenge; and (d) error analysis shedding light into the difficulty of the task.

Previous Work
The language of social media has been studied from many angles, including applications in the social sciences (Park et al., 2015) and public health (Paul and Dredze, 2011). In the context of emergencies, detecting the first message about a new disaster and information aggregation are important problems (Imran et al., 2015). In this paper, we work with mundane events (cooking and baking) described in one tweet, and study the degree to which they resulted in their desired outcomes.
Event detection from social media has received considerable attention, in particular, pinpointing important life events (Li et al., 2014;Dickinson et al., 2015). Previous research shows that people often tweet about events they do not participate in (Sanagavarapu et al., 2017), targets recurring events (Kunneman and Van den Bosch, 2015), and summarizes tweet streams about TV shows (Andy et al., 2019). The work presented here is not concerned with event detection, our selection criteria virtually guarantees that we only work with tweets about cooking or baking (Section 3).
Determining the degree to which an event results in its desired outcome is distantly related to assessing factuality and other event properties. Previous efforts working with social media target event factuality (Soni et al., 2014), identify controversial events (Popescu and Pennacchiotti, 2010) and credible eyewitnesses (Doggett and Cantarero, 2016), and work with arguably more challenging properties such as rumors (Zubiaga et al., 2015) and credibility (Castillo et al., 2011;Mitra et al., 2017). In the work presented here, we work with factual mundane events whose credibility is undisputed. Lack of factuality or credibility indicates that an event did not occur thus also that the desired outcomes were not achieved. We note, however, that fac-tual and credible events did not necessarily result in their desired outcomes, as the examples in this paper illustrate with cooking and baking events.
To our knowledge, there are only a few previous works investigating event outcomes from a computational perspective. Outside the social media domain, Velichkov et al. (2019) investigate models to predict the outcome of sports events from interviews conducted shortly before the event. Within social media, Stowe et al. (2018) present models to determine whether people evacuate during a hurricane event from their tweets. Finally, Swamy et al.
(2017) present a framework to forecast winners of events (e.g., sports events, elections, awards) by aggregating predictions made by individual users. Our work differs in many respects. First, we work with mundane events (cooking and baking). Second, we investigate a finer-grained characterization of event outcomes beyond binary decisions: edibility and quality. Third, we work with tweets consisting of only text as well as both text and images, and show that the outcomes are easier to determine in the latter-in particular, edibility, both by humans and computational models.

Annotating Event Outcomes: Cooking and Baking Events
We create a new corpus of tweets annotated with event outcome information. Initially we set to work with mundane events carried out by regular people and requiring some degree of skill. We explored the following events: driving, gardening, playing sports, singing, playing musical instruments, cooking, and sewing. After manually observing many tweets discussing these events, it became clear that event outcomes are often unknown for events that do not result in concrete outcomes. Additionally, people barely discuss some of the events above unless they result in the expected outcome (e.g., most people talking about driving appear to be good drivers and reach their destinations). We decided to focus on cooking and baking events because (a) they require minimal expertise (i.e., most people can do some cooking); (b) are frequently discussed in social media; and (c) people often discuss the outcome of their cooking and baking in social media, including less than perfect outcomes.
Selecting Tweets. We downloaded 4,000 English tweets describing cooking or baking events using tweepy. 2 More specifically, we downloaded 2,000  Table 1: Inter-annotator agreements with tweets consisting of (a) only text and (b) text and an image. We present the raw agreements (%) and Cohen's . tweets containing the #cookingfail hashtag and 2,000 tweets containing the #bakingfail hashtag. Half of the tweets in each category consisted of only text, and the other half consisted of text and an image. As we shall see, it is common to find tweets that talk about cooking and baking failures despite an edible outcome resulted from them, especially when the tweet includes an image.
Annotation Guidelines. Dictionaries define cooking as "prepare food for eating [. . . ]", and baking as "cook by dry heat especially in an oven" (Merriam-Webster, 2003). Thus the desired outcome of cooking or baking events is to create something edible. Our event outcome annotation guidelines for cooking and baking events go beyond this binary distinction and include three steps.
The first step is to identify relevant tweets, which we define as tweets that describe a cooking or baking event involving the author. Annotators choose from the following labels for relevancy: • yes: the tweet is relevant; or • no: the tweet is not relevant. The majority of the selected tweets are relevant; exceptions include references to cooking shows.
The second step is to identify whether the cooking or baking event resulted in something edible. Annotators choose from the following labels: • yes: the cooking or baking event resulted in something edible; or • no: the cooking or baking event did not result in something edible. We define edible outcomes as outcomes of cooking or baking events that a reasonable person would eat rather than toss in the trash. Edible outcomes need not be perfect or even what a cook intended to make, they only need to be edible.
The third step is to identify the quality of edible outcomes. After pilot annotations, we decided to let annotators choose among the following labels: • as expected: the cooking or baking event resulted in the expected food or dish, and there  is nothing wrong with it. • partial success: the cooking or baking event resulted in the expected food or dish, but something went wrong: it may be visually unappealing or partially burnt, it may have resulted is less portions than expected, etc. • alternative: the cooking or baking event resulted in an alternative food or dish than the one the cook originally intended. • unknown: I cannot choose any of the other three labels, there is not enough information. While perfection is hard to achieve, one could consider outcomes annotated as expected to be perfect. Outcomes annotated partial success or alternative, on the other hand, are imperfect. The former results in the expected outcome with some flaw, and the latter in another outcome altogether (e.g., baking cookies and ending up with biscuits).
All annotations were made with respect to the cooking or baking event up to the point the tweet was published. For example, the outcome of a tweet describing a baking cake event and mentioning that the oven tripped a circuit breaker would be annotated not edible despite it is possible that the baking was successful after resetting the breaker.
Annotation Process and Agreements. Annotations were done in-house by two graduate students. Both of them annotated 15% of tweets in each group (#cookingfail or #bakingfail, only text or text and image). Table 1 shows the inter-annotator agreements. Cohen's  coefficients (Cohen, 1960) range between 0.55 and 0.76, which is considered substantial-above 0.80 is considered nearly perfect (Artstein and Poesio, 2008).
We note that (a)  coefficients for both edibility and quality are slightly higher when tweets consist   of both text and images, and (b) our agreements are on par or better than previous work working with social media data (Holgate et al., 2018). Table 2 provides the label frequency for each annotation task. The majority of the 4,000 tweets selected are about cooking or baking (94.9% and 90.0%). Despite they contain the hashtag #cookingfail or #bakingfail, a substantial amount of tweets consisting of only text resulted in an edible outcome (31.8%), and this is true for the majority (59.7%) of tweets consisting of text and an image.

Corpus Analysis
Regarding quality, most cooking and baking events resulted in the expected dish with some flaw (partial success: 57.8% and 72.6%). Additionally, people are more likely to share a picture if the cooking or baking event was a partial success rather than resulted in an alternative outcome.

Examples
We present examples of all labels using tweets consisting of only text in Table 3. Example (1) does not discuss cooking by the author of the tweet (relevant: no), and in Example (2) it is unclear: oklava is a kitchen utensil but it appears the author is getting Figure 2: Neural network architecture to predict whether a cooking or baking event resulted in an edible outcome, and if so, the event quality (as expected, partial success, alternative or unknown). We include a text component (above dotted line) and two image components (below dotted line).
ready to travel. Unplugged appliances will result in an inedible outcome (Example (3)), and sometimes baking failures refer to some setback that only delays the expected outcome (Example (4)).
Examples 5-7 are more nuanced. In Example (5), the outcome had some flaw but was edible (partial success), and in Example (6) the author ended up with scrambled eggs while trying to make an omelet. Finally, the outcome in Example (7) is unknown because it in unclear how the kids were fed-it is possible that the family ended up ordering takeout food. Table 4 presents examples with tweets consisting of text and an image. The rationale for the annotations is similar. We note that both the text and image are necessary to annotate correctly. Indeed, the bottom left cupcake in the picture in Example (3) of Table 4 could be misinterpreted as a less than perfect outcome, but the text clearly indicates that they were as good as it gets. Similarly, the text are critical in Examples (4) and (5).

Experiments and Results
We experiment with models to predict outcome edibility (yes or no) and outcome quality (as expected, partial success, alternative or unknown). We split the tweets into train (80%) and test (20%) splits, and report results evaluating in the test split with (a) the tweets consisting of only text and (b) tweets consisting of both text and an image.
Baselines. We work with the majority baseline (edibility: always no (only text) or yes (text and image), quality: always partial success for all tweets) and a supervised baseline using Logistic Regression. The Logistic Regression model uses bag-ofwords features and only considers the text in tweets as input-it disregards the image if tweets contain one. We use the implementation in the scikit-learn machine learning Python package (Pedregosa et al., 2011) with default parameters, which in turn uses the LIBLINEAR library (Fan et al., 2008).
Neural Network Architecture. The neural network is inspired by our previous work (Chinnappa et al., 2019) and Cai et al. (2019). It includes two components: one for the text and another one for the image (above and below dotted line in Figure 2). The first component is a basic LSTM (Hochreiter and Schmidhuber, 1997) with 200 units which takes as input the text in the tweet. We lower case tokens and transform them into their GloVe embeddings (Pennington et al., 2014) pretrained with Twitter data (300 dimensions). 3 The image component consists of two parts. The first part is another LSTM with 200 units that takes as input the tags automatically extracted from the image by the Google Cloud Vision API. 4 Note that the tags are an additional text input, and that tags may be more than one word (e.g., chocolate cake), so the LSTM allows us to encode the sequence of tags (which has variable length). Additionally, the word embeddings (GloVe embeddings pre-trained with CommonCrawl) allow us to leverage a distributional representation of tags, including those not seen during training. The second part uses the pre-trained InceptionNet network (Szegedy et al., 2015) in order to extract a representation of the image. More specifically, we use the weights from the average pool layer (second to last).
We implement the neural network with the Keras API (Chollet et al., 2015) and TensorFlow backend (Abadi et al., 2015).    Table 6: Results obtained with the tweets consisting of text and images.

Experimental Results
Tweets with only Text. Predicting outcome quality is harder. Logistic regression and the neural network obtain the same weighted F1 (0.53) and outperform the majority baseline (F1: 0.42). All the models obtain F1s below 0.50 for all labels except partial success, which is the most frequent label.
Tweets with Text and Images. Table 6 shows the results with tweets consisting of text and images. Regarding outcome edibility, we observe a similar pattern as before, but this time the yes label (the most frequent) obtains a higher F1 (0.77 vs. 0.60). The neural network (text and image components) outperforms logistic regression predicting outcome edibility (F1: 0.70 vs. 0.62), but not predicting outcome quality (F1: 0.54 vs. 0.53). We also experiment with an alternative set of classes for outcome quality. Specifically, we merge partial success and alternative as these two labels  indicate unexpected (but edible) outcomes. The results are as one would expect: it is easier to predict three instead of four labels. The baseline, however, also obtains better results, and in fact both logistic regression and the neural network yield lower relative improvements with respect to the baseline.

Error Analysis
We identify the most common error types made by the best model (NN, only text and NN, text + imgs) after manually analyzing 100 errors.
Tweets with Only Text. Table 7 presents the most frequent error types with tweets consisting of only text. Regarding outcome edibility, the most common type (54%) is the need for world knowledgeprimarily related to cooking. In the example, annotators had no issue realizing that hard boiled eggs cannot be used for baking, but the model, unsurprisingly, failed to do so. The next two most common errors are human errors and intricate text (15% each). The former refers to instances in which a human makes the wrong measurement, fails to properly operate appliances, or is otherwise responsible for an inedible outcome. The latter are tweets in which complex reasoning in addition to knowledge about cooking is required. Finally, 7% or errors occurred predicting inedible outcome when in reality an alternative (and edible) outcome resulted from the cooking or baking.
Regarding outcome quality, we identify two major error types. The most common (41%) is also world knowledge. In the example, one must know that potatoes and beans have different cooking times; note that the text does not give any explicit cue about the quality of the resulting dish. A substantial amount of errors occur with tweets whose text lacks information to establish the outcome quality (gold: unknown). In this case, the model tends to predict the majority label, partial success. Finally, the remaining errors (26%) are due to other reasons. In the example, the #cookingfail refers to a past cooking (last week), not the one that occurred shortly before tweeting.
Tweets with Text and Images. Table 8 presents the most common error types with tweets consisting of text and an image. Compared to tweets consisting of only text, we observe that the picture is often critical to make the right prediction-even if the text is long. World knowledge is not a common error type, suggesting that people use pictures for rather explicit outcomes-assuming one can properly interpret the picture. Although we did not anticipate this insight prior to the error analysis, it is to a large extent unsurprising: it is rather hard to Gold quality: unknown Gold quality: partial success Gold quality: partial success Pred. quality; partial success Pred. quality: unknown Pred. quality: unknown Table 8: Most frequent error types with tweets consisting of text and images (top: outcome edibility, bottom: outcome quality). Pred. indicates the predicted label from the best performing model (NN text+img, Table 6).
depict world knowledge in a picture.
Regarding outcome edibility (top three examples in Table 8), a common source of errors (25%) is with tweets in which the image is key. For example, the text in the first example alone does not make it clear what charcoal refers to, but the picture clearly shows that the cupcake is partially burnt. The second cause of errors (20%) is due to human errors (mismeasurements, improper use of appliances, etc.) In the second example, the picture is also important but the text alone gives a clue that the cook lost the battle) against the oven (Oven 1, me 0), thus we consider it a human error. The third error type (20%) is also shared with the tweets consisting of only text: the model struggles identifying edible outcomes that were not anticipated (i.e., alternative (and edible) outcomes).
Regarding outcome quality, we observe two error types covering over half of the errors and a long tail of additional types. First, some tweets lack information in the text and image (28% of errors) to determine the outcome quality (gold: unknown), and the model tends to predict the majority label (partial success). Second, the image is key in 26% of errors, as illustrated with in the second example. In this example, the event outcome (edibility and quality) is very ambiguous without looking at the picture. Finally, we also identified that the model struggles to identify partial success when cooks make some mistake (human error, 7% of all errors). In the third example, the cook forgot an ingredient but doing so did not result in a complete failure.

Conclusions
Factual and credible events do not necessarily result in their expected outcomes. In this paper, we target outcomes of cooking and baking events from social media. Specifically, we determine whether something edible resulted from them, and also the outcome quality (as expected, partial success or alternative). An annotation effort with 4,000 tweets consisting of either only text or text and an image shows that people often use the hashtag #cookingFail or #bakingFail when the cooking did not result in a complete failure. Indeed, the outcome is often edible albeit not perfect, especially if the tweet includes an image (59.7 vs. 31.8%).
We believe that a similar approach could be used to assess outcomes of other events. For example, taking exams and going to the grocery store usually have clear expected outcomes: to pass the exam and to buy something. Taking an exam or going shopping (factuality is not in question here), however, does not guarantee that the expected outcomes become a reality (e.g., people take exams and fail them).One may be able to determine not only whether instances of these events occurred, but also if they resulted in the desired outcomes.        Table 12: Results obtained training and testing with four or three labels for outcome quality. These results are obtained with tweets consisting of text and images. IN refers to features extracted from the pretrained InceptionNet network, and tags refers to the LSTM taking as input the tags from the Google Vision API.