That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets

We propose a novel data augmentation approach to enhance computational behavioral analysis using social media text. In particular, we collect a Twitter corpus of the descriptions of annoying behaviors using the #petpeeve hashtags. In the qualitative analysis, we study the language use in these tweets, with a special focus on the fine-grained categories and the geographic variation of the language. In quantitative analysis, we show that lexical and syntactic features are useful for automatic categorization of annoying behaviors, and frame-semantic features further boost the performance; that leveraging large lexical embeddings to create additional training instances significantly improves the lexical model; and incorporating frame-semantic embedding achieves the best overall performance.


Introduction
In the ever-expanding era of social media, many scientific disciplines, such as health and healthcare, biology, and learning sciences, have adopted computational approaches to exploit patterns and behaviors in large datasets (Wang et al., 2015;Chen and Lonardi, 2009;Baker and Yacef, 2009). In contrast, the primary methods for behavioral sciences still rely on lab experiments with limited amount of subjects, which are time consuming and financially expensive. In addition to this, it is also difficult to obtain a set of samples with geograph- * We understand that many people find long titles annoying, so we intentionally use a very long one to help people understand what "pet peeve" means. While the social media data are abundantly available, computational approaches to behavioral sciences using Twitter are not well-studied. Even when statistical techniques are applied to these tasks, their concentration has been on simple statistical significance tests and descriptive statistics (De Charms, 2013;Zhang et al., 2013). Therefore, we believe that statistical natural language processing techniques are needed for insightful analysis and interpretation in behavioral studies.
In this paper, we use Twitter as a corpus for computational behavioral science. More specifically, we focus on a case study of analyzing annoying behaviors. To do this, we exploit a corpus of 9 million tweets (Cheng et al., 2010), and extract the tweets that describe these behaviors using the #petpeeve hashtags. #petpeeve is a popular Twitter hashtag, which describes behaviors that might be annoying to others. An example of #petpeeve tweets is shown in Figure 1. To facilitate the analysis, we manually annotate 3,375 tweets with 60 fine-grained categories, which will be described in Section 3. We use a sparse mixedeffects topic model to analyze the salient words in each category, as well as the geographic variations. We show that lexical, syntactic, and semantic features enhance the automatic categorization of annoying behaviors; and that the performance is further improved with a novel lexical and framesemantic embedding based data augmentation ap-proach. Our main contributions are three-fold: • We provide a Twitter corpus with finegrained annotations for computational behavior studies; • We qualitatively analyze the Twitter language concerning annoying behaviors, with a focus on the topics and geographical variations; • We propose various linguistic features and a novel data augmentation approach for automatic categorization of annoying behaviors.
We outline related work in the next section. The dataset is described in Section 3. We introduce the approach for analyzing #petpeeve Tweets in Section 4. Experimental results are shown in Section 5. We discuss possible applications in Section 6, and conclude in Section 7.

Related Work
Psychologists, behavioral scientists, and computer scientists have studied a wide-range of methods for behavior extraction (Mast et al., 2015). For example, in lab experiments, arm and body postures (Marcos-Ramiro et al., 2013) are often used to extract self-touch and gestures, while eye gaze (Funes Mora and Odobez, 2012), head pose (Ba and Odobez, 2011), face location and motion (Nguyen et al., 2012), and full-body pose (Shotton et al., 2013) can also be used as cues to extract gazing, nodding, and arm-related behaviors. There are also significant amount of studies of extracting facial and speech features to understand smiling (Bartlett et al., 2008), eye contact (Marin-Jimenez et al., 2014), and verbal behaviors (Basu, 2002).
With the surge of interest in computational social science (Lazer et al., 2009), Twitter has become a popular resource to study data-driven methods in social science (Miller, 2011). For example, O'Connor et al. (2010a) align the Twitter messages with public opinion time series to study computational political science. Ritter et al. (2010) study Twitter dialogues using a clustering approach. Bollen et al. (2011) use a sentiment analysis approach to predict the American stock market via Twitter. Li et al. (2014b) have investigated the alignment of Twitter mood with weather for sentiment analysis. In recent years, language technology researchers have focused on developing genre-specific Twitter partof-speech tagging (Gimpel et al., 2011), named  (Agarwal et al., 2011), event extraction (Ritter et al., 2012;Li et al., 2014a), paraphrasing (Xu et al., 2014), machine translation (Ling et al., 2013), and dependency parsing (Kong et al., 2014) methods. To the best of our knowledge, even though there have been studies on using Twitter hashtags to study language-related behaviors (González-Ibánez et al., 2011;Bamman and Smith, 2015), Twitter NLP approaches to non-linguistic behaviors are not well studied in general.

The Dataset
We use the Twitter corpus with 9 million sampled messages collected in prior work (Cheng et al., 2010), which includes a total of 121K users. The dataset includes latitude and longitude information. We extract 3,375 tweets 1 with #petpeeve hashtags. We follow past work to annotate the tweets (Ritter et al., 2012;Li et al., 2014a): we apply the LDA clustering + human-identification approach to label the categories of the described annoying behaviors in these tweets. The human annotation process includes two stages: first, the annotators identify the 50 categories from the clustering process, and use these topics as a candi-date label set to annotate the data; in the second stage, the categories are refined (to 60 classes) from the first pass, and the data is re-annotated with the refined human-specified category labels. Due to the complexity of this fine-grained annotation task, the inter-annotator agreement rate between two annotators is moderate (0.445).
The annotated categories and label distribution 2 of the dataset are shown in Table 1. In our random samples, the states that post the most #petpeeve tweets are NY, MD, CA, NJ, FL, GA, VA, TX, NC, PA, and DC. In our predictive experiments, we randomly select 60% of tweets for training, and 40% for testing.

Our Approach
In this section, we describe our methods for the qualitative and quantitative analyses. In particular, we briefly review a supervised approach of using sparse mixed-effects topic model to visualize the topical words to analyze this behavior data. For the quantitative task of automatic categorization of tweets, we propose a novel approach to create additional training data, using continuous lexical and semantic representations.

Supervised Topic Modeling
To analyze the salient words for each category of annoying behaviors, we utilize SAGE , a state-of-the-art mixed-effect topic model, which has been used in several NLP applications (Sim et al., 2012;Wang et al., 2012). SAGE is ideal for our text analytic purposes, because it is supervised, and it builds relatively clean topic models by considering the additive effects and the background distribution of words. Therefore, we can use SAGE to visualize the salient words for each category of annoying behaviors using the 3,375 #petpeeve tweets. Each tweet is treated as a document, and we use Markov Chain Monte Carlo for inference. To facilitate the geographical analysis, we use Google's reverse geocoding service to extract the state information from coordinates, and apply SAGE for visualization.

Embedding-Based Data Augmentation for Automatic Categorization of Tweets
In addition to the visualization task, we also ask the question: can we use linguistic cues to predict tweets that describe different annoying behaviors? We formulate the problem as a multiclass classification task, and consider the following feature sets: • Lexical Features: we extract unigrams as surface-level lexical features.
• Frame-Semantics Features: SE-MAFOR (Das et al., 2010) is a state-ofthe-art frame-semantics parser that produces FrameNet-style semantic annotation. We use SEMAFOR to extract frame-level semantic features.
Embeddings for Data Augmentation Since the Twitter messages are often short and noisy, and the training data is relatively scarce for each class, we consider the feasibility of leveraging external resources, in particular, continuous word embeddings (Mikolov et al., 2013a) to enhance the multiclass text categorization model. Two major challenges for leveraging word embeddings for tweet classification are: 1) because word embeddings are continuous, it is difficult to fuse them with other discrete syntactic and semantic features; 2) it is not straightforward how one should transform the word-level representation to the tweet-level representation. In our preliminary experiments, we have evaluated the continuous word representation method (Turian et al., 2010), as well as incorporating neighboring words in the embeddings as additional features, but both methods fail to outperform the lexical baseline that uses only bag-of-word unigrams.
To solve this problem, we propose the use of neighboring words in continuous representations to create new instances to augment the training weather   ungratefulness  traffic  timewasting  talkative  swearing  stability  snobbish  rains  helped  cop  wastingmytime  Tweeters  curse  mood  smut  STORM  ungrateful  lane  colleagues  Xs  teary  sensitive  intellectual  Blizzarad  clearly  pulled  Wen  wht  qweet91  dudes  moneycars  snowed  r  speed  BruklynFinest  sheesh  swears  nigga  LoWQUI  SNOW  them  Slow  hold  TwitterJail  10  up  lifestyle  smoking  silence  showoff  sexual  services  selfishness repetition  religious  JAYECANE  guilty  louis  box  fil  ONLY  dislike  sinners  reggie  R  rims  wonder  requests  Selfish  repeat  IAmKevinTerrell  smoking  response  seein  Preach  convos  selfish  myself  spiritual  smoke  conversation  makin  suck  TIP  stay  same  CHURCH  smokers sending bag pussy products hit over FOLK  dataset. More specifically, in the embedding vocabulary W, we search for the k-nearest-neighbor (knn) word w for a query term using cosine similarity between query Q and target word vectors W : For each word in a tweet, we query the external embeddings, and replace them with their knn words to create a new training instance. For example, consider the tweet "Being late is terrible" with the punctuality label, after searching for knn words for each token, we create a new training instance: "Be behind are bad" with the same label. Frame-Semantic Embeddings Although lexical (Mikolov et al., 2013a) and dependency based embeddings (Levy and Goldberg, 2014) have been studied, semantic-based embedding is still less understood. We consider the continuous embedding of semantic frames (Baker et al., 1998). To do this, we semantically parsed 3.8 million tweets using SEMAFOR (Das et al., 2010), and built a continuous bag-of-frame model to represent each semantic frame using Word2Vec 3 . We then use the same data augmentation approach to create additional instances with these semantic frame embeddings.  Table 4: Comparing linguistic features for categorizing annoying behaviors. The best results are highlighted in bold.* indicates that the result is significantly better than the lexical baseline (p < .0001).

Qualitative Analysis
We show the results of the visualization of salient words for each category of tweets in Table 2. SAGE clearly does a good job identifying annoying specific behaviors in each category. For example, in the traffic category, we see that the keywords "cop" and "pulled" that associate with traffic stop are identified. Also, "slow" and "speed" are also recognized as annoying behaviors during traffic. In the selfishness category, the word "ONLY" and "Selfish" are corrected identified. In the silence category, we see that the word "R" is promising, because it indicates the behavior when someone reads a blackberry message without reply. We see that many slang expressions are associated with various labels.
In Table 3, we show the geographical variation of tweets. The word "dmv" (DC-Maryland-Virginia) is correctly associated with MD and DC, and when we search the database, these #petpeeve tweets mainly refer to the 2010 snowstorm in the Winter affecting these areas. The "daddy" is prominent in the state of Florida, while the word "rims" is also identified, showing the unique car culture of this southern state.

Quantitative Evaluation
Experimental Setup We use the logistic regression model from LibShortText (Yu et al., 2013) Table 5: The effectiveness of leveraging continuous embeddings to create additional training instances. Imp.: relative improvement to the baseline without data augmentation. The best results for each section are highlighted in bold.* indicates that the result is significantly better than the baseline without data augmentation (p < .0001).
as the classifier in our 60-way multi-class classification experiments. Grid search is used to select the best hyper-parameter using the training data only. A final classifier is then trained using the best hyper-parameters and test set results are reported. We set k = 5 for knn in our data augmentation experiments: the training data is expanded to 5 times of the original size. We use a paired two-tailed student's t test to assess the statistical significance.
Word2Vec is used to train various lexical and semantic embedding models. We consider three lexical embeddings and one frame-semantic embeddings for data augmentation: 1) Google-News Lexical Embeddings trained with 100 billion words (Mikolov et al., 2013b); 2) Twitter Lexical Embeddings trained with 51 million of words; 3) Urban Dictionary lexical embeddings trained with 53 million of words from slang definitions and examples; 4) Twitter Semantic Frame Embeddings trained with 27 million frames. Varying Feature Sets We compare various features in Table 4. We see that adding shallow partof-speech features does not have a strong effect on the performance, but adding the dependency triples significantly outperforms the lexical baseline. We see that the semantic frames are particular useful, showing a 7% relative improvement over the baseline. The Effectiveness of Data Augmentation Table 5 shows the results of data augmentation. We see that using the Google News lexical embeddings to augment the training data brings a 6.1% relative F1 improvement over the lexical baseline. When considering the additional frame-semantic embeddings from Twitter, our system obtains the best F1 of 0.380, bringing a 3.8% improvement over the no data augmentation baseline with all linguistic features.

Discussion
We provide a case study of automatically categorizing annoying behaviors using #petpeeve Tweets. We hope that this study can further solicit relevant research on fine-grained analysis of annoying behaviors in different dimensions, and use computational approaches to improve social good. For example, by using coordinates and other APIs, one might analyze the annoying behaviors in the public working environments (e.g., office, meeting rooms, etc.). By understanding what annoys their employees, companies can renovate their working setups, refine their policies, and improve the satisfaction and productivity of their employees.
In addition to #petpeeve Tweets, there are many other interesting hashtags that align well with traditional topics in behavior sciences. For example, hashtags like #occupywallstreet can be used to study crowd behaviors in terms of a political unrest. The #ALS hashtag can be used to study public behaviors in reaction to philanthropic campaigns. Overall, Tweets from carefully selected hashtags can be inexpensive to obtain, and facilitate significant amount of behavioral studies.

Conclusion
In this paper, we have presented a case study of the annoying behaviors using Twitter as a corpus. Our fine-grained visualization approach shows insights of different categories of these behaviors, with the geographical effects. We also show that linguistic cues are useful to categorize these behaviors automatically, and that using lexical and semantic embeddings as a data augmentation method significantly improves the performance.