Identifying and Categorizing Disaster-Related Tweets

This paper presents a system for classifying disaster-related tweets. The focus is on Twitter data generated before, during, and after Hurricane Sandy, which impacted New York in the fall of 2012. We propose an annotation schema for identifying relevant tweets as well as the more ﬁne-grained categories they represent, and develop feature-rich classiﬁers for relevance and ﬁne-grained categorization


Introduction
Social media provides a powerful lens for identifying people's behavior, decision-making, and information sources before, during, and after wide-scope events, such as natural disasters (Becker et al., 2010;Imran et al., 2014). This information is important for identiying what information is propagated through which channels, and what actions and decisions people pursue. However, so much information is generated from social media services like Twitter that filtering of noise becomes necessary.
Focusing on the 2012 Hurricane Sandy event, this paper presents classification methods for (i) filtering tweets relevant to the disaster, and (ii) categorizing relevant tweets into fine-grained categories such as preparation and evacuation. This type of automatic tweet categorization can be useful both during and after disaster events. During events, tweets can help crisis managers, first responders, and others take effective action. After the event, analysts can use social media information to understand people's behavior during the event. This type of understanding is of critical importance for improving risk communication and protective decision-making leading up to and during disasters, and thus for reducing harm (Demuth et al., 2012).
Our experiments show that such tweets can be classified accurately, and that combining a variety of linguistic and contextual features can substantially improve classifier performance.
2 Related Work

Analyzing Disasters with Social Media
A number of researchers have used social media as a data source to understand various disasters (Yin et al., 2012;Kogan et al., 2015), with applications such as situational awareness (Vieweg et al., 2010;Bennett et al., 2013) and understanding public sentiment (Doan et al., 2012). For a survey of social media analysis for disasters, see Imran et al. (2014).
Closely related to this work is that of Verma et al. (2011), who constructed classifiers to identify tweets that demonstrate situational awareness in four datasets (Red River floods of 2009 and 2010, the Haiti earthquake of 2010, and Oklahoma fires of 2009). Situational awareness is important for those analyzing social media data, but it does not encompass the entirety of people's reactions. A primary goal of our work is to capture tweets that relate to a hazard event, regardless of situational awareness.

Tweet Classification
Identifying relevant information in social media is challenging due to the low signal-to-noise ratio. A number of researchers have used NLP to address this challenge. There is significant work in the medi-1 cal domain related to identifying health crises and events in social media data. Multiple studies have been done to analyze flu-related tweets (Culotta, 2010;Aramaki et al., 2011). Most closely related to our work (but in a different domain) is the flu classification system of Lamb et al. (2013), which first classifies tweets for relevance and then applies finergrained classifiers.
Similar systems have been developed to categorize tweets in more general domains, for example by identifying tweets related to news, events, and opinions (Sankaranarayanan et al., 2009;Sriram et al., 2010). Similar classifiers have been developed for sentiment analysis (Pang and Lee, 2008) to identify and categorize sentiment-expressing tweets (Go et al., 2009;Kouloumpis et al., 2011).

Collection
In late October 2012, Hurricane Sandy generated a massive, disperse reaction in social media channels, with many users expressing their thoughts and actions taken before, during, and after the storm. We performed a keyword collection for this event capturing all tweets using the following keywords from October 23, 2012 to April 5, 2013: DSNY, cleanup, debris, frankenstorm, garbage, hurricane, hurricanesandy, lbi, occupysandy, perfectstorm, sandy, sandycam, stormporn, superstorm 22.2M unique tweets were collected from 8M unique Twitter users. We then identified 100K users with a geo-located tweet in the time leading up to the landfall of the hurricane, and gathered all tweets generated by those users creating a dataset of 205M tweets produced by 92.2K users. We randomly selected 100 users from approximately 8,000 users who: (i) tweeted at least 50 times during the data collection period, and (ii) posted at least 3 geotagged tweets from within the mandatory evacuation zones in New York City. It's critical to filter the dataset to focus on users that were at high risk, and this first pass allowed us to lower the percentage of users that were not in the area and thus not affected by the event. Our dataset includes all tweets from these users, not just tweets containing the keywords. Seven users were removed for having predominately non-English tweets. The final dataset contained 7,490 tweets from 93 users, covering a 17 day time period starting one week before landfall (October 23rd to November 10th). Most tweets were irrelevant: Halloween, as well as the upcoming presidential election, yielded a large number of tweets not related to the storm, despite the collection bias toward Twitter users from affected areas.

Annotation Schema
Tweets were annotated with a fine-grained, multilabel schema developed in an iterative process with domain experts, social scientists, and linguists who are members of our larger project team. The schema was designed to annotate tweets that reflect the attitudes, information sources, and protective decisionmaking behavior of those tweeting. This schema is not exhaustive-anything deemed relevant that did not fall into an annotation category was marked as Other-but it is much richer than previous work. Tweets that were not labeled with any category were considered irrelevant (and as such, considered negative examples for relevance classification). Two additional categories, reporting on family members and referring to previous hurricane events, were seen as important to the event, but were very rare in the data (34 of 7,490 total tweets). The categories identified and annotated are as follows: Tweets could be labeled with any of the following: Sentiment Tweets that express emotions or personal reactions towards the event, such as humor, excitement, frustration, worry, condolences, etc.
Action Tweets that describe physical actions taken to prepare for the event, such as powering phones, acquiring generators or alternative power sources, and buying other supplies.
Preparation Tweets that describe making plans in preparation for the storm, including those involving altering plans.
Reporting Tweets that report first-hand information available to the tweeter, including reporting on the weather and the environment around them, as well as the observed social situations.
Information Tweets that share or seek information from others (including public officials). This category is distinct from Reporting in that it only includes information received or request from outside sources, and not information perceived first-hand.
Movement Tweets that mention evacuation or sheltering behavior, including mentions of leaving, staying in place, or returning from another location. Tweets about movement are rare, but especially important in determining a user's response to the event.

Annotation Results
Two annotators were trained by domain experts using 726 tweets collected for ten Twitter users. Annotation involved a two-step process: first, tweets were labeled for relevance, and then relevant tweets were labeled with the fine-grained categories described above. The annotators were instructed to use the linguistic information, including context of previous and following tweets, as well as the information present in links and images, to determine the appropriate category. A third annotator provided a deciding vote to resolve disagreements. Table 1 shows the label proportions and annotator agreement for the different tasks. Because each tweet could belong to multiple categories, κ scores were calculated based on agreement per category: if a tweet was marked by both annotators as a particular category, it was marked as agreement for that category. Agreement was only moderate for relevance (κ = .569). Many tweets did not contain enough information to easily distinguish them, for example: "tryin to cure this cabin fever!" and "Thanks to my kids for cleaning up the yard" (edited to preserve privacy). Without context, it is difficult to determine whether these tweeters were dealing with hurricanerelated issues.
Agreement was higher for fine-grained tagging (κ = .814). The hardest categories were the rarest (Preparation and Movement), with most confusions between Preparation, Reporting, and Sentiment. 1

Classification
We trained binary classifiers for each of the categories in Table 1, using independent classifiers for each of the fine-grained categories (for which a tweet may have none, or multiple).

Model Selection
Our baseline features are the counts of unigrams in tweets, after preprocessing to remove capitalization, punctuation and stopwords. We initially experimented with different classification models and feature selection methods using unigrams for relevance classification. We then used the best-performing approach for the rest of our experiments. 10% of the data was held out as a development set to use for these initial experiments, including parameter optimization (e.g., SVM regularization).
We assessed three classification models that have been successful in similar work (Verma et al., 2011;Go et al., 2009): support vector machines (SVMs), maximum entropy (MaxEnt) models, and Naive Bayes. We experimented with both the full feature set of unigrams, as well as a truncated set using standard feature selection techniques: removing rare words (frequency below 3) and selecting the n words with the highest pointwise mutual information between the word counts and document labels.
Each option was evaluated on the development data. Feature selection was substantially better than using all unigrams, with the SVM yielding the best F1 performance. For the remaining experiments, SVM with feature selection was used.

Features
In addition to unigrams, bigram counts were added (using feature selection described above), as well as: • The time of the tweet is particularly relevant to the classification, as tweets during and after the event are more likely to be relevant than those before. The day/hour of the tweet is represented as a one-hot feature vector.
• We indicate whether a tweet is a retweet (RT), which is indicative of information-sharing rather than first-hand experience.
• Each URL found within a tweet was stripped to its base domain and added as a lexical feature.
• The annotators noted that context was important in classification. The unigrams from the previous tweet and previous two tweets were considered as features.
• We included n-grams augmented with their partof-speech tags, as well as named entities, using the Twitter-based tagger of Ritter et al. (2011).
• Word embeddings have been used extensively in recent NLP work, with promising results (Goldberg, 2015). A Word2Vec model (Mikolov et al., 2013) was trained on the 22.2M tweets collected from the Hurricane Sandy dataset, using the Gensim package (Řehůřek and Sojka, 2010), using the C-BOW algorithm with negative sampling (n=5), a window of 5, and with 200 dimensions per word. For each tweet, the mean embedding of all words was used to create 200 features.
• The work of Verma et al. (2011) found that formal, objective, and impersonal tweets were useful indicators of situational awareness, and as such developed classifiers to tag tweets with four different categories: formal vs informal, subjective vs objective, personal vs impersonal, and sit-  uational awareness vs not. We used these four Verma classifiers to tag our Hurricane Sandy dataset and included these tags as features.

Classification Results
Classification performance was measured using fivefold cross-validation. We conducted an ablation study (Figure 1), removing individual features to determine which contributed to performance. Table 2 shows the cross-validation results using the baseline feature set (selected unigrams only), all features, and the best feature set (features which had a significant effect in the ablation study). In all categories except for Movement, the best features improved over the baseline with p < .05.

Performance Analysis
Time, context, and word embedding features help relevance classification. Timing information is helpful for distinguishing certain categories (e.g., Preparation happens before the storm while Movement can happen before or after). Context was also helpful, consistent with annotator observations. A larger context window would be theoretically more useful, as we noted distant tweets influenced annotation choices, but with this relatively small dataset increasing the context window also prohibitively increased sparsity of the feature. Retweets and URLs were not generally useful, likely because the information was already captured by the lexical features. Part-of-speech tags yielded minimal improvements, perhaps because the lexical features critical to the task are unambiguous (e.g., "hurricane" is always a noun), nor did the addition of features from Verma et al. (2011), perhaps because these classifiers had only moderate performance to begin with and were being extended to a new domain.
Fine-grained classification was much harder. Lexical features (bigrams and key terms) were useful for most categories, with other features providing minor benefits. Word embeddings greatly improved performance across all categories, while most features had mixed results. This is consistent with our expectations of latent semantics : tweets within the same category tend to contain similar lexical items, and word embeddings allow this similarity to be captured despite the limited size of the dataset.
The categories that were most confused were Information and Reporting, and the categories with the worst performance were Movement, Actions, and Preparation. Movement simply lacks data, with only 53 labeled instances. Actions and Preparation contain wide varieties of tweets, and thus patterns to distinguish them are sparse. More training data would help fine-grained classification, particularly for Actions, Preparation, and Movement.
Classification for Reporting performs much better than others. This is likely because these tweets tend to fall into regular patterns: they often use weather and environment-related lexical items like "wind" and "trees", and frequently contain links to images. They also are relatively frequent, making their patterns easier to identify.

Performance in Other Domains
To see how well our methods work on other datasets, we compared our model to the situational awareness classification in the Verma et al. (2011)

Conclusion
Compared to the most closely related work of Verma et al. (2011), our proposed classifiers are both more general (identifying all relevant tweets, not just situational awareness) and richer (with fine-grained categorizations). Our experimental results show that it is possible to identify relevant tweets with high precision while maintaining fairly high recall. Finegrained classification proved much more difficult, and additional work will be necessary to define appropriate features and models to detect more specific categories of language use. Data sparsity also causes difficulty, as many classes lack the positive examples necessary for the machine to reliably classify them, and we continue to work on further annotation to alleviate this issue.
Our primary research aims are to leverage both relevance classification and fine-grained classification to assist crisis managers and first responders. The preliminary results are show that relevant information can be extracted automatically via batch processing after events, and we aim to continue exploring possibilities to extend this approach to realtime processing. To make this research more applicable, we aim to produce a real-time processing system that can provide accurate classification during an event rather than after, and the apply current results to other events and domains.