TWEETSUM: Event oriented Social Summarization Dataset

With social media becoming popular, a vast of short and noisy messages are produced by millions of users when a hot event happens. Developing social summarization systems becomes more and more critical for people to quickly grasp core and essential information. However, the publicly available and high-quality large scale social summarization dataset is rare. Constructing such corpus is not easy and very expensive since short texts have very complex social characteristics. In this paper, we construct TWEETSUM, a new event-oriented dataset for social summarization. The original data is collected from twitter and contains 12 real world hot events with a total of 44,034 tweets and 11,240 users. Each event has four expert summaries, and we also have the annotation quality evaluation. In addition, we collect additional social signals (i.e. user relations, hashtags and user profiles) and further establish user relation network for each event. Besides the detailed dataset description, we show the performance of several typical extractive summarization methods on TWEETSUM to establish baselines. For further researches, we will release this dataset to the public.


Introduction
Social media has become an important real-time information source, especially during emergencies, natural disasters and other hot events. According to a new Pew Research Center survey, social media has surpassed traditional news platforms (such as TV and radio) as a news source for Americans: about twothirds of American adults (68%) get news via social media. Among all major social media sites, Twitter is still the site Americans most commonly use for news, with 71% of Twitter's users get their news from Twitter. However, it can often be daunting to catch up with the most recent contents due to high volume and velocity of tweets. Hence, social summarization aiming to acquire the most representative and concise information from massive tweets when a hot event happens is particularly urgent.
In recent years, many large-scale summarization datasets have been proposed such as New York Times (Sandhaus, 2008), Gigaword (Napoles et al., 2012), NEWSROOM (Grusky et al., 2018) and CNN/DAILYMAIL (Nallapati et al., 2016). However, most of these datasets focus on formal document summarization. Actually, social media text has many different characteristics from formal documents: 1) Short. The length of a tweet is limited to 140 characters, which is much shorter than formal document. 2) Informal. Tweets usually contains informal expressions such as abbreviations, typos, special symbols and so on, which make tweets more difficult to deal with. 3) Social signal. There are different kinds of social signals on social media such as hashtags, urls and emojis. 4) Potential relations. Tweets are generated by users and hence have potential connections through user relationship. Because of these characteristics, traditional summarization methods often do not perform well on social media.  Figure 1: Diagram of the process for creating the TWEETSUM dataset.
Though there exists some social media summarization datasets (Hu et al., 2015;Li et al., 2016;P.V.S. et al., 2018;Duan et al., 2012;Cao et al., 2017;Nguyen et al., 2018). However, these datasets only consider the text on social media and ignore the potential social signals on social network. In a social context, the interactions between friends are obviously different from that between strangers. This phenomenon demonstrates that social relationship can affect user behavior patterns and consequently affect the tweets content they post. This inspires us to consider integrating social relations relevant signals when analyzing social information.
In this paper, we construct an event-oriented large-scale dataset with user relations for social summarization called TWEETSUM. It contains 12 real world hot events with a total of 44,034 tweets and 11,240 users. In summary, this paper provides the following contributions: (1) Construct an event-oriented social media summarization dataset, TWEETSUM, which contains social signals. To our knowledge, it is the first summarization dataset that contains user relationships relevant social signals, such as hashtags and user profiles and so on; (2) Create expert summaries for each social event and verified the existence of sociological theory in real data, including social consistency and contagion; (3) Evaluate the performance of typical extractive summarization models on our TWEETSUM dataset to provide benchmarks and validate the effectiveness of the dataset.

Task and Data Collection
Tweets summarization aims to find a group of representative tweets for a specific topic. Given a collection of tweets about an event T = [t 1 ; t 2 ; .....; t m ], our goal is to extract a set of tweets S = [s 1 ; s 2 ; ...; s n ] (n << m), which contain as much important information and as little redundant information as possible at the same time (Rudrapal et al., 2018).
The dataset is created using the public Twitter data collected by University of Illinois 1 as the raw data and the overall creation process is shown as Figure 1. The detailed process of data collection is shown summarized as follows: (1) We first select twelve hot events happened in May, June and July 2011, including sports, technology and science, natural disasters, politics, terrorist attacks and so on. The events selected should satisfy the following conditions: (i) Widely spread on the Internet and cause a heated discussion on social media; (ii) Last longer than 30 days; (iii) Be impressive to news providers.
(2) Since each hot event can have multiple hashtags, such as "#nba" and "#nbafinals", we then search the tweets which contain any of these hashtags or any of the keywords obtained by getting rid of "#" from hashtags. (3) After obtaining the event-oriented data, we carefully preprocess the data as follows: (i) Merge identical tweets; (ii) Remove tweets whose length are shorter than 3 other than hashtags, keywords, mentions, URLs and stop words; (iii) Delete tweets whose author has no connection with others. (4) For each event, we further collect user profiles and user relationships. We filter users whose degree is smaller than 1 and obtain 11,240 users with their relations. Finally, we collect user profiles including User ID, historical tweets records, Tweet timestamp and Retweet Count. To verify summarization performance, we create expert summaries for each event. Specifically, for each of the 12 events, we ask annotators to select most representative 25 tweets as expert summary. Since different annotators have different understandings of the same event, we ask 4 annotators to create expert summary individually for each event in order to eliminate the subjective factors of users. To evaluate the quality of all expert summaries, we further ask 3 other annotators to score all summaries in range [1, 5] based on the coverage, diversity and readability. If only 0-6 tweets are satisfactory, the summary is scored as 1, 6-12 tweets as 2 scores, 12-18 tweets as 3 scores, 18-24 tweets as 4 scores. If all tweets are good, the score is 5. We remain the summaries with scores greater than or equal to 3 and require modifications to those low-quality summaries until they meet the criteria. To ensure the agreement of multiple expert summaries of each event, we conduct the mutual evaluation among them, and the results are shown in Figure 2.  In this section, we introduce each part of the TWEETSUM dataset in detail. The dataset consists of 12 hot events, each of which contains four parts: tweets text, user relations, user profiles, and manually created expert summaries. The detailed statistics of each part are shown in Table 1. Due to the limited space, we only show the statistics of four events. Tweet text is the textual content of tweets, whose average length in all 12 events is 15.22 words. The number of tweets and average length per tweet in each event is shown in the first two rows in Table 1. In addition, hashtags in tweets contain important clues that can help understand the semantics of the tweets. Therefore, we also analyzed the distribution of hashtags in tweets, as shown in the third and fourth rows of Table 1. User relations are the unique property of our dataset compared with other summarization datasets. We collect users and their corresponding relations in each event to construct social networks and further analyze the statistics of the generated network, which is shown in the second part of Table 1. Indicated by social theories, i.e. consistency (Abelson, 1983) and homophily (Mcpherson et al., 2001), social relations will affect the user behaviors and consequently influence the content. We visualize the structure of one social network as shown in Figure 3. Users and their relationships constitute an undirected graph G(V, E), where V is user set and E is relation set.  We observe some homophily groups, which may indicate that users being friends tend to share similar opinions or topics. We further analyze the words overlap ratio between friends and strangers respectively. Figure 4 shows the 1-gram and 2-gram overlap ratio under all 12 events. The average 1-gram and 2-gram overlap ratio between friends (26.92% and 4.01%) are consistently higher than that between strangers (25.40% and 3.45%), which demonstrates the impact of social relations on user behavior. We further conduct two sample t-test where null hypothesis H 0 means there is no difference between tweets posted by friends and those randomly selected tweets, while alternative hypothesis H 1 means the distance between tweets posted by friends is smaller than that of randomly selected tweets. We define the distance of two tweets as : D ij = ||t i − t j || 2 , where t i is the TFIDF representation of the i-th tweet. The p-value shown in Table 1 suggests to reject H 0 , which proves the influence of social relation on the tweet content.

Data Properties and Analysis
User profiles include the following information: 1) User ID. This is the unique identity of each user. 2) Historical tweet records. Tweets posted by users contain lots of user information, from which we can obtain abundant information, such as user interests and preference. 3) Tweet timestamp records the time of the creation of tweets. 4) Retweet Count is used to reflect the popularity of each tweet. Expert summary has been described in section 2.2.

Compared methods
To verify the effectiveness of our TWEETSUM dataset, we choose some typical extractive summarization methods as baseline methods.
(1) Expert: denotes the average mutual assessment of expert summaries.

Evaluation Methodologies
ROUGE is the most commonly used evaluation metric in summarization task, which counts number of overlapping units such as n-gram, word sequences and word pairs between the machine-generated summary and reference summaries. (Lin and Hovy, 2003) proposed different ROUGE matrices. Here, we use the F-measures of ROUGE-1, 2, ROUGE-L and ROUGE-SU* as our evaluation metric. Table 2 shows the performance of different baselines on our dataset. As we can see, all of these models have improvement over the Random baseline, especially SNSR model, which achieves the best performance and outperforms the Random baseline with an absolute gain of 3.31% R-1, 4.45% R-2, 3.57% R-L and 3.39% R-SU*. The main reason is that SNSR captures social relations among tweets.  However, the improvement of other models are not as significant as SNSR. The reason is that most of these models are designed for formal documents such as news articles, thus not suitable for tweets. The neural network based model BERT has a strong ability in feature extraction. However, the BERT model still lags behind the best model. There are mainly three reasons: 1) Learning an efficient tweet representation still remains a big challenge since tweets are short and noisy.

Results and Discussions
2) It only considers text content and ignores relations among tweets.
3) The summary selection strategy is relatively simple. To further prove the effectiveness of social relations, we remove the relation component of SNSR (indicated by -social), which brings performance deterioration. As we discussed above, there are multiple types of social signals in social media which can provide various kinds of additional information. These heterogeneous signals contain a large amount of information, which is conducive to generating summaries. This inspires us to further explore to integrate these additional signals to improve social summarization.

Conclusion and Future Work
In this paper, we construct an event-oriented social media summarization dataset, called TWEETSUM. To better explore how social signals help social summarization, we filter some outliers, keeping social network dense to some extent, and conduct experiments to verify the influence of social signals on user generated content. We further analyze the characteristics of this dataset in detail and validate the influence of social relations on tweets content selection. Both traditional summarization methods and neural network-based methods are tested on our dataset.
In the future, the dataset can be further expanded to include more events as well as more various social signals. In addition, manually annotating data is an expensive and labor-consuming task, therefore we will further try to explore approaches to construct social summarization dataset automatically. More research space can be extended based on this dataset and we hope the TWEETSUM dataset can foster the development of social summarization.