Never Abandon Minorities: Exhaustive Extraction of Bursty Phrases on Microblogs Using Set Cover Problem

We propose a language-independent data-driven method to exhaustively extract bursty phrases of arbitrary forms (e.g., phrases other than simple noun phrases) from microblogs. The burst (i.e., the rapid increase of the occurrence) of a phrase causes the burst of overlapping N-grams including incomplete ones. In other words, bursty incomplete N-grams inevitably overlap bursty phrases. Thus, the proposed method performs the extraction of bursty phrases as the set cover problem in which all bursty N-grams are covered by a minimum set of bursty phrases. Experimental results using Japanese Twitter data showed that the proposed method outperformed word-based, noun phrase-based, and segmentation-based methods both in terms of accuracy and coverage.


Introduction
Background and motivation. Trends on microblogs reflect manifold real-world events including natural disaster, new product launch, television broadcasting, public speech, airplane accident, scandal and national holiday. To catch realworld events, not a few researchers and practitioners have sought ways to detect trends on microblogs. Trend detection often involves bursty phrase extraction, i.e., extracting phrases of which occurrence rate in microblog texts posted within a certain period of time (and from a certain location) is much higher than that of the normal state. Extracted bursty phrases are directly used as trends as Twitter 1 officially provides, or sent to higher-order processes such as clustering and event labeling. 1 https://twitter.com/.
Bursty phrases on microblogs are likely to be noun phrases, but sometimes deviate from such standards. The title of a movie, song, game or any creation can be an arbitrary form like a long and/or general phrase (e.g., Spielberg's movie "catch me if you can", the Beatles' song "let it be") 2 . A memorable phrase (e.g., Steve Jobs's phrase "stay hungry, stay foolish") can also be a bursty phrase on microblogs. Even numbers and symbols can be potentially bursty phrases (e.g., "1984" can be a novel, "!!!" can be an artist). Any filtering rule such as stop word removal, part-of-speech (POS) tag restrictions, or length limit can mistakenly filter out bursty phrases.
Extracting irregularly-formed bursty phrases as described in the previous paragraph is difficult since no restriction can be used anymore to filter out incomplete N-grams. However, they are rare and little influence the overall accuracy even if they are correctly extracted. Not only that, tackling such difficult and rare cases easily leads to extracting many incomplete N-grams and deteriorating the accuracy. Almost all existing work has therefore ignored difficult and rare cases, and concentrated on extracting simple phrases (e.g., unigrams, bi-grams, tri-grams, or noun phrases identified by POS taggers). By sacrificing minorities, most bursty phrases can be extracted with high accuracy. Thus, irregularly-formed phrases have been abandoned in bursty phrase extraction (and alike in many text analysis tasks).
Contributions. In this work, we aim to accurately and exhaustively extract bursty phrases of arbitrary forms from microblogs. The challenge here is: How do we avoid extracting bursty incomplete N-grams without introducing any filtering rule? To solve this challenging problem, we propose a set cover-based method. We found that the Table 1: Representative trend detection methods based on bursty phrases.

Method
Unit of process Measure of burst (Sayyadi et al., 2009) Noun phrase TF, DF, IDF (O'Connor et al., 2010) Uni-gram, bi-gram, tri-gram Burstiness (Mathioudakis and Koudas, 2010) Uni-gram Burstiness (Weng and Lee, 2011) Uni-gram DF-IDF, H-measure (wavelet analysis) (Metzler et al., 2012) Uni-gram Burstiness (Li et al., 2012) N-gram of any length Deviation from Gaussian distribution (Cui et al., 2012) Hashtag Deviation from Gaussian distribution (Benhardus and Kalita, 2013)-1 Uni-gram, bi-gram, tri-gram Burstiness (Benhardus and Kalita, 2013 Uni-gram, bi-gram, tri-gram DF-IDF (Abdelhaq et al., 2013) Uni-gram Deviation from Gaussian distribution (Schubert et al., 2014) Uni-gram (pair) Deviation from Gaussian distribution (Feng et al., 2015) Hashtag Deviation from Gaussian distribution burst (i.e., the rapid increase of occurrence) of a phrase causes the burst of overlapping incomplete N-grams. For example, if phrase "let it be" bursts, the occurrence of some overlapping N-grams such as "let it", "let it be is", and "it be is" inevitably increases, possibly generating bursty incomplete Ngrams. Given that bursty incomplete N-grams always accompany overlapping bursty phrases, we can avoid extracting bursty incomplete N-grams using the set cover problem (Chvátal, 1979). The proposed set cover-based method finds a minimum set of bursty phrases that cover all bursty N-grams including incomplete ones. Because the set cover problem is NP-complete, the proposed method approximately solves it by iteratively choosing an Ngram that most covers remaining bursty N-grams. The advantages of the proposed method are as follows. 1) Exhaustive. The proposed method can extract bursty (contiguous) phrases of arbitrary forms. In our experiment, the coverage has been shown to be larger than that of word-based, noun phrase-based, and segmentation-based methods. 2) Accurate. With adequate preprocessing of auto-generated texts, the proposed method achieved 99.3% of precision for top 10 bursty phrases and 97.3% for top 50 bursty phrases, which were even higher than the comparative methods. 3) Language-independent. Because the proposed method processes texts as a sequence of characters (or words), it works in any languages including those without word boundary such as Japanese. 4) Purely data-driven. The proposed method only requires raw microblog texts and does not need external resources.

Related Work
Much work has been focused on trend detection or event detection from microblogs. Majority of representative trend detection methods (summarized in Table 1) start with extracting bursty phrases, often followed by clustering bursty phrases in order to link them to real-world events. Others first build clusters of words (uni-grams) by using word cooccurrence (Pervin et al., 2013) or topic models (Aiello et al., 2013;Diao et al., 2012;Lau et al., 2012) and then apply burst detection for clusters. Also, there are different approaches such as the document-based approach (Aiello et al., 2013), sketch-based model (Xie et al., 2013) and bursty biterm topic model (Yan et al., 2015).
It is noteworthy that most methods in Table 1 only focus on uni-grams, short N-grams (up to tri-grams), or noun phrases (or rely on hashtags). This is because majority of bursty phrases conform to such simple forms. The rest of bursty phrases are rare but their forms are irregular and difficult to stereotype by using rules. Trying to extract such irregularly-formed phrases easily leads to the deterioration of the precision due to incorrect extraction. Also, the recall can hardly be increased since they are only a small portion of all bursty phrases. To balance the precision and recall of bursty phrase extraction, focusing on simple phrases and ignoring rare cases is a reasonable strategy. In the field of trend detection on microblogs, this ignoring-minority strategy has become a de facto standard. However, it always fails to extract irregularly-formed bursty phrases. In this work, we tackle the challenge of extracting bursty phrases without any restriction of forms.
Among methods in Table 1, Li et al. (Li et al., 2012) have only attempted to extract bursty phrases of arbitrary forms. Their method, Twevent, applies text segmentation (or chunking) before measuring the degree of the burst. Every microblog text is represented as a sequence of word N-grams called segments. Ngram length, Symmetric Conditional Probability (SCP) (da Silva and Lopes, 1999), and semantic resources extracted from Wikipedia are integrated to obtain the best segmentation results. Owing to a good segmentation algorithm, it can potentially detect bursty phrases other than noun phrases, uni-grams, bi-grams, and tri-grams with high accuracy. However, it is still possible to miss irregularly-formed bursty phrases because they are likely to be segmented incorrectly. Our set coverbased method guarantees that all bursty N-grams including irregularly-formed ones must be covered by extracted bursty phrases. Thus, it is unlikely to miss irregularly-formed bursty phrases.
One more thing to note in Table 1 is that how to measure the degree of the burst can be largely classified into a few groups: burstiness, TF-IDFbased measures, and distribution-based methods. The simplest approach is burstiness, which is the ratio of the occurrence rate in target and reference document sets. Reference document set is usually constructed from past microblog texts. TF-IDF-based measures compute term frequency (TF) or document frequency (DF) in the target document set and inverse document frequency (IDF) in the reference document set. Distribution-based methods generally measure how much the observed frequency deviates from the distribution of the normal state using standard scores (z-scores). The Poisson distribution is proper to represent the number of occurrence of phrases, but the Gaussian distribution is often used as its approximation due to computational reasons. We in this work primarily adopt a Gaussian distribution-based approach and use the z-score as the measure of the burst because it reasonably works well for the different magnitude of the number of occurrence.

Problem Formulation
We formalize the problem of bursty phrase extraction from microblogs. Target document set D T is a set of microblog messages posted within a certain period of time (e.g., one day, three hours). Reference document set D R is a set of microblog messages posted before the target time. Both are usually limited to certain locations or languages. Each document is a sequence of characters (or words). The objective here is to extract bursty complete phrases from D T as much as possible using D R as the normal state. The output format is a list of bursty N-grams L = [g 1 , g 2 , · · · , g |L| ] (g i is an Ngram or a sequence of characters) ranked by the degree of the burst. The accuracy and coverage of top K (K is a user-specified parameter) bursty phrases are important evaluation criteria.
The degree of the burst for N-grams in D T is defined as the z-score when the Gaussian distribution is estimated from D R . While most N-grams occur once in a single microblog message, a few N-grams are repeatedly used in it. We thus employ the document frequency-based z-score as the degree of the burst. The z-score of N-gram g is specifically computed as where df (g) is the document frequency of g in D T , and µ(g) and σ(g) are respectively the mean and standard deviation for the document frequency of g estimated from D R . Given that the Gaussian distribution used here is the approximation of the Poisson distribution, σ(g) is approximated by √ µ(g). To smooth µ(g) when g never occurs in D R , we add δ = 1 to µ(g).

Basic Idea
To exhaustively extract bursty phrases without any restriction of their forms, we have to refrain from using filtering rules such as stop word removal, POS tag restrictions, and length limit. Without filtering rules, there are far more bursty incomplete N-grams than bursty complete phrases. It is challenging to extract bursty phrases including irregularly-formed ones and at the same time to avoid extracting bursty incomplete N-grams. Is there any evident difference between bursty incomplete N-grams and bursty phrases of irregular forms?
We scrutinized Twitter data and found the following fact: Bursty incomplete N-grams always accompany overlapping bursty phrases provided that the definition of the burst is appropriate. We Algorithm 1: Greedy Set Cover Algorithm for Bursty Phrase Extraction Input: Target document set D T , reference document set D R Output: Ranked list of bursty phrases L 1 Initialize L; 2 Get a set of valid N-grams Select g ∈ G V that most cover the occurrence of bursty N-grams in G B ; 7 Get a set of longer N-grams G g ⊂ G V containing g; 8 Determine a set of containment N-grams G C ⊂ G g for g; 9 if G C = ∅ then 10 Push g into L;

11
Negate the occurrence of all N-grams in G B and G V where g overlaps; Negate the occurrence of g where containment N-grams g i ∈ G C overlap; 15 end 16 end 17 Rerank L based on the actual z-score; explain this phenomenon in due order. When many microblog users intensively use a certain phrase, it becomes a bursty phrase. Here, the increment of the occurrence of the phrase contributes to the increment of the occurrence of overlapping N-grams. Consequently, (incomplete) Ngrams that overlap the phrase can also burst. Thus, bursty incomplete N-grams always have their original bursty phrases that overlap each other.
Based on the phenomenon described in the previous paragraph, bursty incomplete N-grams cease bursting if their original bursty phrases disappear from microblog texts. In other words, all bursty N-grams including incomplete ones can be covered (overlapped) by bursty phrases. Given that bursty phrases cause bursty incomplete N-grams but the reverse was not true, we can formalize the extraction of bursty phrases as the set cover problem (Chvátal, 1979) where a minimum number of bursty phrases are selected to cover all bursty Ngrams. When all selected bursty phrases are removed from microblog texts, it is guaranteed that there is no bursty N-gram.

Proposed Algorithm
Algorithm 1 is a pseudo-code of the greedy set cover algorithm for bursty phrase extraction. As formulated in Section 3.1, the input data is target D T and reference document sets D R of microblog messages. The output is a ranked list of bursty phrases L = [g 1 , g 2 , · · · , g |L| ]. Basically, Algorithm 1 iteratively selects an N-gram that most covers the occurrence of bursty N-grams (Line 6) until all bursty N-grams are covered. In the following, we describe points of the Algorithm 1.

Bursty N-grams and Valid N-grams
A set of bursty N-grams G B in Algorithm 1 (Line 3) corresponds to the universe of the set cover problem which should be all covered. Bursty Ngrams satisfy z-score threshold θ both in document frequency-based (Eq. (1)) and term frequencybased z-scores (Eq. (2)).
Here, tf (g) is the term frequency of g in D T , and µ tf (g) and σ tf (g) are respectively the mean and standard deviation for the term frequency of g estimated from D R . The term frequency-based zscore is used to judge whether a bursty N-gram in G B still bursts when its occurrences are partly covered. This is required to handle N-grams repeatedly occurring in a single microblog message. Valid N-grams in G V (Line 2) are qualified to be bursty phrases to cover G B . We differentiate bursty N-grams and valid N-grams (specifically, G B ⊂ G V ) to handle threshold problems. Namely, it is possible that bursty incomplete Ngrams satisfy the z-score threshold but their original bursty phrases do not satisfy the threshold. The criterion of the valid N-gram is defined by using burstiness.
Here, df R (g) is the document frequency of Ngram g in D R . To avoid division by zero, smoothing term δ = 1 is added to df R (g). When burstiness(g) satisfies threshold 1 + ϵ (i.e., the occurrence of g actually increases), g becomes a valid N-gram. To reduce pointless processes, trivial N-grams that can never be phrases (e.g., starting or ending with spaces, occurring only as a part of a sole longer N-gram) are discarded (Line 4).

Occurrence-based Set Covering
Whereas the standard set cover problem assumes that each item is atomic, i.e., the state of an item is either not covered or covered, the proposed method manages the state of covering by using all occurrences of bursty N-grams. When a valid N-gram is selected (Line 10), the occurrence of all N-grams in G B and G V that the valid N-gram overlaps is negated (Line 11). Here, a bursty Ngram in G B is completely covered if the term frequency-based z-score computed from remaining occurrences of the N-gram does not satisfy the threshold (Line 12).
Occurrence-based set covering can solve the case when a bursty N-gram is covered by multiple bursty phrases. That is, the bursty N-gram is not completely covered even if one of the bursty phrases is selected. For example, given two bursty phrases "let it be" and "let it go" (a song), incomplete N-gram "let it" ceases bursting only when the occurrence of both phrases is negated. Also, it can handle partially overlapping N-grams (e.g., "let it be" and "be is") based on the number of overlaps.

Containment N-grams
N-grams that are contained in multiple phrases should be carefully treated in the set cover problem. Shorter N-grams are likely to be contained in more phrases and chosen in the set cover problem even if they are incomplete. For example, when phrases "let it be" and "let it go" burst, shared incomplete N-gram "let it" is likely to cover the occurrence of bursty N-grams more than the two phrases. To prevent selecting shared incomplete N-grams, we determine containment relations between an N-gram and longer N-grams containing it (Lines 7, 8). We define longer N-grams in containment relations as containment N-grams. Containment relations negate the occurrence of the shorter N-gram where containment N-grams overlap (Line 14). Containment relation is inspired by the idea of rectified frequency used in a segmentation-based quality phrase mining method (Liu et al., 2015), though how to rectify the frequency is different. Note that containment relations do not necessarily mean that shorter N-grams are incomplete because both shorter and longer Ngrams can be phrases (e.g., "new york" and "new york times").
How to determine containment relations is designed not to contradict the greedy set cover algorithm. Briefly, containment relations hold when only the containment N-grams among longer Ngrams can cover the occurrence of bursty N-grams more than the shorter N-gram owing to the containment relations. That is, we find a stable state of containment relations. To find a stable state, we initially define temporal containment relations and then iteratively find a set of containment N-grams so that containment relations become stable.
Initial containment N-grams are determined using burst context variety, which is an extension of accessor variety (Feng et al., 2004). Accessor variety roughly measures how much an N-gram is likely to be a phrase. It specifically counts the number of distinct characters (or words) before or after the N-gram and employs the minimum one. The drawback of accessor variety is that it handles left and right contexts independently. We thus modify it as context variety in which both left and right contexts are simultaneously counted. Context variety can also be calculated using the set cover problem. In particular, a left or right character (or word) that most covers the occurrence of the N-gram is iteratively selected until all the occurrences are covered. Burst context variety is a further extension of context variety to count the number of contexts only for additional term frequencies given the mean of the term frequency. When the burst context variety of an N-gram is not greater than that of a longer N-gram, we extract the context (i.e., left or right character) and define all longer N-grams having the context as initial containment N-grams.
There are two minute settings for measuring burst context variety. One is that the start and end of every line are all unique and should be counted as distinct contexts. The other is that symbols 3 should be ignored when checking left and right contexts of the N-gram.

Reranking Bursty Phrases
The output L is finally reranked by using actual z-scores (Line 17), which are different from zscores calculated from raw document frequency. The actual z-score is calculated from the number of occurrences of bursty N-grams that N-gram g ∈ L actually covered during the set cover process. Since a single valid N-gram usually covers multiple bursty N-grams, the z-score for every covered bursty N-gram is recalculated and the maximum is used as the actual z-score. When the maximum z-score exceeds the original z-score, the original z-score is preserved. Without reranking, incomplete N-grams that covered very few occurrences of bursty N-grams may be overestimated.

Evaluation
We evaluated the proposed method using two weeks of Japanese Twitter data.

Dataset.
We created a dataset using Twitter Streaming API statuses/sample. We chose Japanese as a language because it was one of popular languages used in Twitter and because it has no word boundary and finding phrases is difficult compared to space-delimited languages such as English. We specifically collected 15 days of tweets from September 30 to October 14, 2016 4 . For each day from October 1 to 14, we extracted bursty phrases in reference to the previous day.
To maximally alleviate the influence of autogenerated contents such as tweets posted by bots and spammers, we filtered out them. Detecting bots and spammers (Chu et al., 2010;Subrahmanian et al., 2016) is a nontrivial research task and out of the scope of this paper. In this experiment, we used simple but effective heuristics. First, we only used tweets posted by Twitter official clients 5 because they were mainly used by normal users. Second, we discarded tweets including URLs because most spammers tried to lure users to visit their websites. Third, retweets (actions to propagate someone's tweets) were discarded. The maximum and minimum numbers of remaining tweets per day were 326,002 (Oct. 2) and 253,044 (Oct. 13), respectively. Additionally, we deleted hashtags (starting by #) and mentions (starting by @) from tweets. Note that the degree of the burst for URLs, hashtags, and mentions can be independently measured. We concentrated on extracting bursty phrases from plain texts.
Ground truth. To create the ground truth, we mixed N-grams extracted with all methods and then manually annotated each N-gram by checking its real usage in tweets. While most N-grams were easily judged as complete phrase (i.e., correct) or incomplete N-grams (i.e., incorrect), a few N-grams were difficult to judge (e.g., a last name that was not frequently used to indicate the person in tweets). We annotated such N-grams as maybe correct and regarded as correct in the strict case and incorrect in the loose case when measuring evaluation metrics.
Evaluated methods. In the proposed method (Proposed), we processed Japanese texts as sequences of characters. Default threshold parameters θ and ϵ were set to 10 and 0.5, respectively. Comparative methods included the word-based method (Word), noun phrase-based method (NP), and segmentation-based method (Segment) (Li et al., 2012). Because these methods require word breaking, we used MeCab (Kudo et al., 2004) (version 0.996, ipadic as a dictionary), a Japanese morphological analyzer. The word-based method uses all self-sufficient words as candidate phrases. The noun phrase-based method regards concatenated successive nouns as a candidate phrase. In word-based and noun phrase-based methods, the dictionary or vocabulary significantly affects the performance. Thus, we also used neologd 6 (Sato et al., 2017) (version v0.0.5 updated at May 2, 2016), a neologism dictionary extracted from many language resources on the web, as an additional dictionary (+Dic). The segmentationbased method detects segments (chunks) from a sequence of words as candidate phrases. The segmentation model was constructed using three  Evaluation metrics. We employed precision and min-z-score of top K bursty phrases (K was set to 10, 20, 30, 40, or 50) as evaluation metrics. We measured the precision both in strict and loose cases based on the ground truth labels. The minz-score (the minimum of the z-score, computed from raw document frequency) was introduced to evaluate how much the top K output ranked by zscores included highly bursty phrases. To increase the min-z-score, all the top K phrases should have high z-scores and hence highly bursty phrases should not be ignored. Thus the min-z-score can evaluate the coverage of extracted bursty phrases using a fixed size of the output. Higher precision and min-z-score indicate that the method can more accurately and exhaustively extract bursty phrases. Tables 2, 3 show the precision of bursty phrase extraction. It was surprisingly that the proposed method achieved higher precision than nounphrase based methods, which were supposed to be safety by sacrificing irregularly-formed phrases. The precision of the proposed method for top 50 bursty phrases was 97.3% (correct phrases were 681 out of 700) in the strict case and 99.1% (694 out of 700) in the loose case. The precision for top 10 bursty phrases was 99.3% (139 out of 140) even in the strict case. The results demonstrate that the burst information alone can accurately find the boundary of bursty phrases using the set cover problem. Error cases of the proposed method were largely classified into two: base sequences of diversified expressions and phrases with stronglycorrelated attached characters.

Performance Results: Precision
In comparative methods, the accuracy tended to be high when noun phrases were used and the dictionary was well defined. Especially, the use of the neologism dictionary boosted the precision. The segmentation-based method also marked moderately high precision. Table 4 shows the min-z-score of bursty phrase extraction. The proposed method achieved higher min-z-score than the comparative methods. This was because the proposed method extracted bursty phrases regardless of their forms. Noun phrasebased methods tended to miss highly bursty phrases of irregular forms. Therefore the minz-score of extracted top K bursty phrases became small. Among comparative methods, the segmentation-based method best achieved the min-z-score since it did not restrict the form of phrases. However, it was still possible to miss very irregular phrases due to segmentation mistakes and the min-z-score was less than that of the proposed method. The use of the neologism dictionary increased the min-z-score as well as the precision, indicating that it had no negative effect.

Performance Results: Coverage
To intuitively assess the coverage of the proposed method, we manually counted the number of bursty phrases that were extracted with the proposed method (when K = 10, i.e., 139 correct phrases) but completely missed with the comparative methods. Here, we regarded completely missed when neither of the bursty phrase, overlapping N-grams including incomplete ones nor orthographic variants were extracted in top  100 bursty N-grams. Table 5 shows the results. Although the percentages were small, any comparative method completely missed some highly bursty phrases that were extracted with the proposed method. Note that the proposed method did not completely miss top 10 bursty phrases of comparative methods at all since the set cover problem inevitably covered all bursty N-grams.

Influence of Parameter Settings
We changed threshold parameters θ and ϵ to evaluate their influence on performance. Tables 6, 7 show the performance results with different parameter settings. We confirmed that both parameters, especially ϵ, hardly affected the precision and min-z-score. The results indicate that the threshold parameters can be roughly set based on data. Table 8 shows top 10 bursty phrases (all of them are correct) on Oct. 1. This day contained many irregularly-formed phrases; phrases containing hiragana 7 characters (rank 5,7,8,10), other than noun phrases (rank 5, 7), and containing symbols (rank 3). Especially, (rank 5) and (rank 7) were troublesome since they contain hiragana characters and at the same time they are other than noun phrases. Even with the neol-  ogism dictionary, the noun phrase-based method extracted but missed . To demonstrate the language-independent nature of the proposed method, we also applied it to English with character-level processing 8 . Table 9 shows an example of bursty phrases extracted from English tweets. Whereas an incomplete phrase (rank 2) was extracted due to autogenerated contents that could not be eliminated from data, other top bursty phrases were correctly extracted. The proposed method also extracted a very long Internet meme (rank 5), which burst along with its counterpart "anyone that knows me knows i love _________." (rank 17).

Conclusions
We proposed a language-independent data-driven method to accurately and exhaustively extract bursty phrases of arbitrary forms from microblogs. We found that bursty incomplete N-grams always Table 9: Example of top 10 bursty phrases in English (Oct. 1, 2016, PST). Note that these bursty phrases were generated by processing English tweets in character level. accompany overlapping bursty phrases by ascertaining the mechanism why incomplete N-grams burst. Based on the findings, the proposed method solves the extraction of bursty phrases as the set cover problem where a minimum set of bursty phrases covers all bursty N-grams including incomplete ones. We confirmed from experimental results that the proposed method outperformed noun phrase-based and segmentation-based methods both in terms of the accuracy and coverage. The source code of the proposed method is publicly available 9 .
The future work includes the reduction or estimation of the computation time and memory usage. They increase as the target document set grows or, specifically the number of occurrences of bursty N-grams increases. Also, handling autogenerated contents is an important issue. The proposed method should be used with effective methods of identifying auto-generated contents, bots, and spammers in microblogs.