More Features Are Not Always Better: Evaluating Generalizing Models in Incident Type Classification of Tweets

Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.


Introduction
Incident information contained in social media has proven to frequently include information not captured by standard emergency channels (e.g. 911 calls, bystander reports). Therefore, stakeholders like emergency management and city administration can highly benefit from social media. Due to its unstructured and unfocused nature, automatic filtering of social media content is a necessity for further analysis. A standard approach for this filtering is automatic classification using a trained machine learning model (Agarwal et al., 2012;Schulz et al., 2013;Schulz et al., 2015b).
A problem for the classification approach is that language, style and named entities used in social media highly vary across different regions. Consider the following two tweets as examples: "RT: @People 0noe friday afternoon in heavy traffic, car crash on I-90, right lane closed" and "Road blocked due to traffic collision on I-495". Both tweets comprise entities that might refer to the same thing with different wording, either on a semantically low ("accident" and "car collision") or more abstract level ("I90" and "I-495"). With simple syntactical text similarity approaches using standard bag of words features, it is not easily possible to make use of this semantic similarity, even though it is highly valuable for classification.
These limitations impose constraints on the dataset, because tokens are likely to be related to the location where the text was created or contain location-or incident-sensitive topics. Models trained using spatially and temporally restricted data from one region are bound by the specific aspects of language and style expressed in the training data, thus, model reuse is not easily possible.
In this paper, we focus on the creation of generalized models. Such models avoid the use of features that -overfitting like -are only useful for a specific region. Generalized models are intended to work in different regions, even if training data originates only from one ore few regions. This can ensure high classification rates even in areas where only few training samples are available. Finally, in times of increasing growth of cities and the merging with surrounding towns to large metropolitan areas, they allow to cope with the latent transitions in token use.
To create generalized models for incident type classification (and social media classification in general) the most important step is an appropriate feature generation. Therefore, in this paper we investigate the suitability of standard and novel features and different machine learning algorithms for the creation of generalized classification models for incident type classification. We conduct intensive feature engineering and evaluation. For this purpose, we have collected and labeled 10 datasets with high regional variation. To the best of our knowledge, this is the first investigation of the challenges of heterogeneous datasets in this domain, and of the suitability of state of the art classification and feature extraction techniques.
In summary, our contributions are: 1) Investigation of features and feature groups for generalized social media/incident type classification models. 2) Identification of the best feature combinations and classifiers for a generalized model. For an evaluation (qualitative and inferential statistics) of ten tweet datasets with high regional variation we get an overall F-measure of > 83%.
3) The evaluation shows that features extending a plain ngram-based approach are not necessarily valuable for training a generalized model as these provide little improvement.
Following this introduction, we give an overview of related work in Section 2. In Section 3, we provide a description of our datasets followed by a comprehensive evaluation in Section 4. We close with our conclusion and future work in Section 5.

Related Work
A review of existing work on the classification of social media content shows which features, feature groups and algorithms are generally used (see table 1). Furthermore, the number of classes and the dominating approaches unfold. We report the ratios of labeled tweets for the individual approaches; however, we omit performance measures as these are directly related to the respective datasets used for evaluation.
Classifiers based on Support Vector Machines (SVM) or Naive Bayes (NB) clearly dominate in terms of performance for incident type classifi-cation. (Sakaki and Okazaki, 2010;Carvalho et al., 2010;Agarwal et al., 2012;Robert Power, 2013;Schulz and Janssen, 2014) trained an SVM, whereas (Agarwal et al., 2012;Imran et al., 2013;Schulz and Janssen, 2014) also evaluated an NB classifier. In contrast to these works, (Wanichayapong et al., 2011) followed a dictionary-based approach using traffic-related keywords. (Li et al., 2012) do not provide any information about the classifier used.
Keywords also play a crucial role in feature design. (Sakaki and Okazaki, 2010) used earthquake-specific keywords, statistical features (the number of words in a tweet and the position of keywords), and word context features (the words before and after the earthquake-related keyword). (Wanichayapong et al., 2011) used traffic-related keywords in combination with location-related keywords. Furthermore, Li et al. (2012) iteratively refined a keyword-based search for retrieving a higher number of incident-related tweets.
Two approaches rely on more specific feature groups. The approach of (Schulz and Janssen, 2014) is the only one that uses TF-IDF scores. (Imran et al., 2013) use Kipper et al.'s (2006) extension of the Verbnet ontology for verbs.
The related approaches mostly use word-ngrams and a variety of Twitter-specific features. Datasets are spatially and temporally restricted and limited to a small number, complicating generalizability.

Data Collection and Processing
We are interested in generalizable models for different regions, user-generated content has been

Data Collection
We focus on tweets as suitable example for unstructured textual information shared in social media. The classification of incident-related tweets represents a challenge that is relevant for many cities. We use a complex four-class classification problem, where new tweets can be assigned to the classes "crash", "fire", "shooting", and a neutral class "not incident related". This goes beyond related work with a focus on two-class classification. Our classes were identified as the most common incident types in Seattle using the Fire Calls data set (http://seattle.data.gov), an official incident information source. As ground truth data, we collected several cityspecific datasets using the Twitter Search API. These datasets were collected in a 15 km radius around the city centres of Boston (USA), Brisbane (AUS), Chicago (USA), Dublin (IRE), London (UK), Memphis (USA), New York City (USA), San Francisco (USA), Seattle (USA), Sydney (AUS).
We selected these cities because of their huge regional distance, which allows us to evaluate our approaches with respect not only to geographical, but also to cultural variations. Also, for all cities, sufficiently many English tweets can be retrieved. We chose 15 km as radius to collect a representative data sample even from cities with large metropolitan areas. Despite the limitations of the Twitter Search API with respect to the number of geotagged tweets, we assume that our sample is, although by definition incomplete, highly relevant to our experiments.
We collected all available Tweets during limited time periods, resulting in three initial sets of tweets: 7.5M tweets collected from November, 2012 to February, 2013 for Memphis and Seattle (SET CITY 1); 2.5M tweets collected from January, 2014 to March, 2014 for New York City, Chicago, and San Francisco (SET CITY 2); 5M tweets collected from July, 2014 to August, 2014 for Boston, Brisbane, Dublin, London, and Sydney (SET CITY 3).
For the manual labeling process, we had to select a subset of our original tweet set which included our classes of interest for model training and testing. Generating subsets is required because manual labeling of social media data is very expensive, especially if multiple annotators are involved. To generate subsets we used the approach of (Schulz et al., 2013) of extracting microposts using incident-related keywords. As a result, more than 200 keywords were identified for each class. Based on these incident-related keywords, we were able to accurately and efficiently filter the datasets. After applying keywordfiltering, we randomly selected 5.000 microposts for each city. Though one might assume that this pre-filtering leads to a biased dataset, (Schulz and Janssen, 2014) showed that keyword sampling does not influence the classification process as the performance of a keyword-based classifier is notably worse compared to supervised classifiers.
In the next step, we removed all redundant tweets as well as those with no textual content from the resulting sets as a couple of tweets contain keywords that are part of hashtags or @-mentions, but have no useful textual content. The tweets were then labeled manually by five annotators using the CrowdFlower (http://www.crowdflower.com/) platform. We retrieved the manual labels and selected those for which all coders agreed to at least 75%. In the case of disagreement, the tweets were removed. This resulted in ten datasets with regional diversity to be used for evaluation. Table 2 lists the class distributions for each dataset. The distributions vary considerably, allowing us to evaluate with typical city-specific samples. Also, the "crash" class seems to be the most prominent incident type, whereas "shootings" are less frequent. One reason for this is that "shootings" do not occur as frequent as other incidents, whereas another less obvious reason might be that people tend to report more about specific incident types and that there is not necessarily a correlation between the real-world incidents and those mentioned in tweets. Although the data sets have been filtered based on keywords, the "no incident" class remains the largest class.
One of the key questions that motivates our work is to which extent the used words vary in each dataset as an effect of the spatial and cultural context. We thus analysed how similar all datasets are by calculating the intersection of tokens. We found that after preprocessing, between 14% and 23% tokens are shared between the datasets. We do not assume that every unique token is a cityspecific token, but the large number of tweets in our evaluations gives a first indication that there is a diversity in the samples that either requires the training of several individual-or one generalizing model which is the focus of this paper.

Preprocessing and Feature Generation
To use our datasets for feature generation, i.e., for deriving different feature groups that are used for training a classification model, we had to convert the texts into a structured representation by means of preprocessing. Following this, we extracted several features for training classification models. To evaluate the best feature groups for incident type classification, we re-implemented commonly used feature extraction approaches from the state of the art. We further extended these feature groups by additional ones that seemed promising in this problem domain: Preprocessing As a first step, the text was converted to Unicode to preserve non-Unicode characters. Specific URLs would not be useful for the classification process, therefore we replaced them with a common token "URL". We then removed stopwords and conducted tokenization. Every token was then analysed and non-alphanumeric characters were removed or replaced. Finally, we applied lemmatization to normalize all tokens. All preprocessing steps were performed by DKPro TC (de Castilho and Gurevych, 2014), a popular framework for text classification. After preprocessing, we generated several features (see Table  3). In the following, we give a description of the different feature groups.
Baseline Feature Sets As the most simple approach and as used in all related works, we represented tweets as a set of words and also as a set of characters with varying lengths. As features, we used a vector with the frequency of each n-gram. Most importantly, we evaluated the powerset of all different combinations of n-grams. For instance, if a length of n = 2 was chosen, we evaluated the three combinations (n = 1), (n = 1, 2), (n = 2). Furthermore, as not all tokens are necessarily important for the classification process, we evaluated several top-k selection strategies, i.e., taking only the k most frequent n-grams into account. For this, we tested k = 100, 1000, 5000 as well as the approach using all n-grams. We treat these features as the baseline approach, and extend it by additional features, e.g. similarity, sentiment scores, Twitter-specific features.

Sentiment Features
Emoticons are widely used to express emotions in textual content. Various text classification approaches make use of these, e.g. for sentiment analysis (Agarwal et al., 2011;Go et al., 2009). For incident type classification, they could also be useful as people link emotions with ongoing incidents, thus, we re-implemented three approaches for extracting sentiment features. Each tweet is represented as a powerset of word-n-grams of length n = 1 to n = 3.

Char-n-grams
Each tweet is represented as a powerset of char-n-grams of length n = 1 to n = 5.

POS EMO
The Tweet NLP part-of-speech tagger (Owoputi et al., 2013) was used to identify emoticons. The ratio of emoticons to all tokens is calculated.

DICT EMO
An emoticon library that is based on the suggestions from Agarwal et al. (Agarwal et al., 2011) was used comprising a set of 63 emoticons from Wikipedia. The number of positive and the number of negative emoticons in a tweet is calculated.

AGG EMO
One single sentiment score based on the second approach by aggregating the number of positive and negative emoticons.

NER
We used the Stanford Named Entity Recognizer (Finkel et al., 2005) and applied the three class model to count the number of location, organization, and person mentions.

NR CHAR
The number of characters in a tweet.

NR SENT
The number of sentences in a tweet.

NR TOKEN
The number of tokens in a tweet.

QUEST RT
The proportion of question marks and sentences in a tweet.

EXCLA RT
The proportion of exclamation marks and sentences in a tweet.

NR AT MN
The number of @-mentions in a tweet.

NR HASHTAG
The number of hashtags in a tweet.

NR URL
The number of URLs present in a tweet.

NR SLANG
The number of colloquial words (i.e., lol or ugh). Feature extraction is based on the Tweet NLP POStags (Owoputi et al., 2013).

IS RT
A boolean to indicate whether a tweet is a retweet.

NR CARD
In conjunction with the named entities present in tweets, people tend to refer to street names (e.g.,I-95) or the number of injured people (e.g.,2-people). Thus, we create a feature for the number of cardinal numbers present in a tweet.

GREEDY ST
Similarity scores following Greedy String Tiling (Wise, 1996) as a method to deal with shared substrings that do not appear in the same order.

LEVENST
The Levenshtein distance (Levenshtein, 1966) as an edit-distance metrics, i.e., the minimum number of edit operations that transform one tweet into another.

TF IDF
As the baseline relies on plain frequency-based weighting, we calculate the traditional TF-IDF scores (Manning et al., 2009) for every tweet.
Named Entities: As shown in the state of the art, named entities, i.e. entities that have been assigned a name such as Seattle, are commonly used in tweets. Named entities might be valuable, as these are used frequently in incident-related tweets. Thus, we also incorporated Named Entity Recognition (NER) for feature extraction.
Stylistic Features: The style of a tweet could be an additional indicator for incident relatedness. For instance, a repetition of punctuations could point at a person that is expressing emotions resulting from an ongoing incident. Structured representation might indicate high quality.
Twitter-specific features As shown in related work, several Twitter-specific features seem to be valuable for incident type classification such as the number of @-mentions and hashtags.

Similarity Features
The similarity of individual tweets might be helpful to identify common topics. We therefore implemented several similarity scores 1 . The rationale behind this is to embrace additional features that do not only take the raw frequencies of words into account, but also which words appear in which document.
To sum up, we re-implemented two approaches that will serve as a baseline, and 18 additional feature groups to extend them. In the following section, we will evaluate the usefulness of these approaches for training a generalizing model.

Evaluation
The goal of our evaluation is to determine which features were most useful for creating a generalizing model. We first describe our method, including the feature sets, the classification algorithms used, and our sampling procedure. This is followed by a results section in which we report differences in performance by means of qualitative and inferential statistics.

Method
The indicators for well-performing features in related work allows us to perform a condensed evaluation, compared to similar studies such as (Hasanain et al., 2014).
Our approach comprises three steps: First, we evaluated the baseline approaches, i.e., word-and char-n-grams. Second, we combined each of the remaining features with the best performing baseline feature. Third, we again selected the best performing combinations and evaluated their power set. To evaluate the suitability of different features for training generalizing models, we picked one dataset from the 10 presented in Section 3.1 for training, and tested on the remaining 9 datasets. We did not evaluate different models on datasets from only one city, as we were interested in generalizing models.
Selecting each city as training set resulted in 90 performance samples per model. The models were formed by combining the feature sets described in the previous section 3.2 or respectively, their combinations, with an SVM and NB classifier. We decided for these classification algorithms since they were the most successful in related work. Another reason for the choice of NB is its good performance in text classification tasks, as demonstrated by Rennie et al. (2003). We relied on the Lib-Linear implementation of an SVM because it has been shown that for a large number of features and a small number of instances, a linear kernel is comparable to a non-linear one (Hsu et al., 2003). As for SVMs parameter tuning is inevitable, we evaluated the best settings for the slack variable c whenever an SVM was used. For training and testing, we used the reference implementations in WEKA (Hall et al., 2009).
We calculated the F1-Measure for assessing performance, because it is well established in text classification, cannot be manipulated by the classification threshold parameter and allows to measure 1 The respective similarity scores have been calculated on the whole document corpus after preprocessing. the overall performance of the approaches with an emphasis on the individual classes (Jakowicz and Shah, 2011). In Section 3.1, we demonstrated that the proportion of data representing individual classes varies strongly. We therefore weighted the F1-measure by this ratio and report the microaveraged results over all datasets F 1. Given our focus on training a generalizable model, we deliberately did not focus on the performance variation in the individual datasets.

Results
In order to check whether our findings persist at least across the two learning algorithms, we did not aggregate the model performance samples but analyzed them for each algorithm separately. We therefore only have one independent variable, our feature groups, that affects the model performance. In order to keep p-value inflation low, we only compared the ten best performing models for each algorithm with respect to the F1-Measure. Note that even if the difference in performance between these models appears small, there are thus many worse models that were not explicitly listed.
Our samples generally do not fulfill the assumptions of normality and sphericity required by parametric tests for comparing more than two groups. Under the violation of these assumptions, nonparametric tests have more power and are less prone to outliers (Demsar, 2006). We therefore relied exclusively on the non-parametric tests suggested in literature: Friedman's test was used as non-parametric alternative to a repeated-measures one-way ANOVA, and Nemenyi's test 2 was used post-hoc as a replacement for Tukey's test.
In contrast to its parametric counterpart, Friedman's test is based on a ranking of the models induced by the performance measure, and therefore only relies implicitly on the latter. Each model is ranked from best to worst, with mean ranks being 2 We chose Nemenyi's test because it is widely accepted in the machine learning community. A discussion of alternatives can be found in Herrera et al. (Herrera, 2008).
Feature Group words(1000,1,2) words(1000,1,3) words(ALL,1,1) words(5000,1,1) words(100,1,1) words(100,1,2) words(100,1,3) words(5000,1,3) words(1000,1,1) words(5000,1,1)   Table 4: Average F1-Measure F 1 for the ten best performing baseline feature groups used in case of ties. The Friedman statistic is calculated by dividing the sum of squares of the mean ranks by the sum of squares error. For sufficiently many samples, the statistic follows a χ 2 distribution with k − 1 degrees of freedom. The q statistic used in Nemenyi's test is similar to the one used by Tukey, but uses rank differences. It utilises the previous ranking from the Frieman test to calculate and relate the average ranks of two models, for each available pair. Two models are considered significantly different, if their difference in mean ranks exceeds a critical value, which varies for different significance levels. For a detailed description and examples of these tests, see (Jakowicz and Shah, 2011). We illustrated the ranks and significant differences between the feature groups by means of the critical difference (CD) diagram. Introduced by Demsar (2006), this diagram lists the feature groups ordered by their rank, where lower rank numbers indicate higher performance. Feature groups are connected with bars if they are not significantly different, given α = 0.05.
In the following, we will use shortcuts like words(1000,1,2) to denote the 1000 most frequent uni-and bigrams. The same applies for char-ngrams. Abbreviations can be found in Table 3.

Evaluation using LibLinear Classifier
We first evaluated which of our 20 baseline feature sets, as described in Section 3, lead to the best classification performance over different datasets. Notably, the ten best-performing approaches were all combinations of word-n-grams. Table 4 contains the average F-Measures for these approaches. The Friedman test indicated strong significant differences between the performances of these groups (χ 2 r (9) = 112.467, p < 0.001, α = 0.01). The subsequent Nemenyi test indicated strong significant pairwise differences between the performances of the models (α = 0.01), with pvalues listed in Table 2 in the supplementary. Figure 1 illustrates the differences by means of a CD diagram: the approaches of using simple unigrams of the most frequent 5000 and all words provide the best results, i.e. they have the lowest rank. They are not significantly different from the 1000 most frequent word-uni and bigrams. Nevertheless, they are significantly better than all other baseline approaches.
This also applies to the char-n-gram approaches, that were not considered in this statisti- cal comparison due to their inferior performance. It is important to note that the differences between the worst word-n-gram approaches and the best char-n-gram approaches could still be statistically non-significant.
The best performing baseline approach for LibLinear is using unigrams of the top 5000 words, i.e. words(5000,1,1), with F 1 = 82.87. We therefore picked this baseline feature group for the second part of our evaluation. We added each non-baseline feature individually to the selected baseline approach and compared the performances of these combinations and the non-extended baseline group. Table 6 lists the averaged F-Measure. When comparing the ten best-performing groups, the Friedman test showed strong significant differences between the performances of the models (χ 2 r (9) = 87.274, p < 0.001, α = 0.01). The Nemenyi test partly showed strong significant differences between the performances of the models (for the corresponding p-values see Table 3 in the supplementary). They are illustrated in the CD diagram in Figure 2. The tests indicate that adding NER and NR AT MT to the baseline approach provides the best performances with F 1 = 83.32 and F 1 = 83.03 respectively.
Finally, we evaluated the power set of these feature groups, i.e. we compared the individual groups and their combination. Table 5 contains the corresponding averaged F-Measures. For the resulting performance samples, the Friedman test showed strong significant differences between the models (χ 2 r (3) = 72.014, p < 0.001, α = 0.01). The Nemenyi test partly showed strong significant   differences (α = 0.001), with p-values listed in Table 4 in the supplementary and illustrated in Figure 3. The diagram shows that the combination of NER and NR AT MN with the words(5000,1,1) baseline outperforms all other models with respect to F1 (F 1 = 83.48), but does not differ significantly from the plain NER approach (F 1 = 83.32). This combination gives us the final and best feature set for training a generalizing model over our datasets. As can be seen, the plain n-gram approach (F 1 = 82.87) can be improved further by 0.5%.

Evaluation using Naive Bayes Classifier
In this section, we repeat the previous steps for the NB classifier. As baseline feature sets, we first evaluated the word-n-gram and char-n-gram approaches. The averaged F-Measures can be found in Table 4. The Friedman test showed strong significant differences between the performances of the models (χ 2 r (9) = 110.293, p < 0.001, α = 0.01). The Nemenyi test partly showed strong significant differences between the performances of the models (for the corresponding p-values see Table 1 in the supplementary). In contrast to the Li-bLinear classifier, using the 5000 most frequent combinations of two to five subsequent characters, i.e. chars(5000,2,5) provide the best F1 score (F 1 = 80.48). Thus, char-n-grams outperform the word-n-gram approaches with respect to F1. The CD diagram in Figure 5 shows that using either the 5000 most frequent char-n-grams with a length of two to five and two to four respectively, the 1000 most frequent word-n-grams with a length of one and one to two respectively, and the 1000 most frequent char-n-grams with a length of two to four do not differ significantly. However, using either the 5000 most frequent char-n-grams with a length of two to five and two to four respectively significantly outperform all other baseline approaches. As a subsequent step, we added each single feature to chars(5000,2,5) as the best baseline approach to find if these provide better performance for the NB classifier. Table 6 contains the corresponding averaged F-Measures. Though the Friedman test indicated strong significant differences between the performances of the models (χ 2 r (9) = 22.209, p = 0.008, α = 0.01), the subsequent Nemenyi test did not indicate significant pairwise differences. We can therefore not repeat the third step of our evaluation, but infer that for a NB classifier, the plain char-n-gram-based approach is sufficient for training a generalizing model for our dataset.
The results indicate that LibLinear provides a better avg. performance (F 1 = 83.32) when training a generalizing model, compared to the NB classifier (F 1 = 80.48).

Conclusion and Future Work
In this paper, we compared the performance of different popular feature groups and classification algorithms for the task of training a generalizing model for incident type classification. We carefully selected the most popular feature groups from related work, and separately evaluated them Feature Group words(5000,1,1) +DICT EMO +NER +NR CARD +NR AT MN +POS EMO +NR SLANG +EXCLA RT +QUEST RT +IS RT    for the LibLinear and NB classification algorithms on ten spatially and temporally diverse datasets. The resulting F1-measure samples indicate that training a generalizing model, i.e., a model that is applicable on previously unseen incident-related data, is still a challenging task. We found that Li-bLinear provides a better averaged performance compared to the NB classifier. More surprisingly, additional feature groups that are commonly used in related work do not necessarily outperform a plain n-gram-based approach. This highlights the need for other novel approaches for training generalizing classification models. Especially in the domain of incident detection and emergency management, our findings are important because less time consuming techniques showed nearly the same performance as sophisticated ones.
There are two main topics for our future work. First, we will investigate the performance of models generated with biased datasets on unfiltered datasets. This is relevant, if a technique like filtering is used to include more relevant class examples in a dataset than provided with an original sample -a necessary step to realize a labeled dataset for model learning of a rare-class task. Second, we will work on using novel features for the creation of generalized models. One example is the utilization of the Semantic Web to generate abstract features, utilizing a technique called Semantic Abstraction (Schulz et al., 2015a). Semantic Abstraction has shown to improve the generalization of tweet classification by deriving features from Linked Open Data and using location and temporal mentions.