Multi-task Pairwise Neural Ranking for Hashtag Segmentation

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.


Introduction
A hashtag is a keyphrase represented as a sequence of alphanumeric characters plus underscore, preceded by the # symbol. Hashtags play a central role in online communication by providing a tool to categorize the millions of posts generated daily on Twitter, Instagram, etc. They are useful in search, tracking content about a certain topic (Berardi et al., 2011;Ozdikis et al., 2012), or discovering emerging trends (Sampson et al., 2016).
Hashtags often carry very important information, such as emotion 1 Our toolkit along with the code and data are publicly available at https://github.com/mounicam/ hashtag_master  2017), sentiment (Mohammad et al., 2013), sarcasm (Bamman and Smith, 2015), and named entities (Finin et al., 2010;Ritter et al., 2011). However, inferring the semantics of hashtags is nontrivial since many hashtags contain multiple tokens joined together, which frequently leads to multiple potential interpretations (e.g., lion head vs. lionhead). Table 1 shows several examples of single-and multi-token hashtags. While most hashtags represent a mix of standard tokens, named entities and event names are prevalent and pose challenges to both human and automatic comprehension, as these are more likely to be rare tokens. Hashtags also tend to be shorter to allow fast typing, to attract attention or to satisfy length limitations imposed by some social media platforms. Thus, they tend to contain a large number of abbreviations or non-standard spelling variations (e.g., #iloveu4eva) (Han and Baldwin, 2011;Eisenstein, 2013), which hinders their understanding. The goal of our study is to build efficient methods for automatically splitting a hashtag into a meaningful word sequence. Our contributions are: • A larger and better curated dataset for this task; • Framing the problem as pairwise ranking using novel neural approaches, in contrast to previous work which ignored the relative order of candidate segmentations; • A multi-task learning method that uses different sets of features to handle different types of hashtags; • Experiments demonstrating that hashtag segmentation improves sentiment analysis on a benchmark dataset.
Our new dataset includes segmentation for 12,594 unique hashtags and their associated tweets annotated in a multi-step process for higher quality than the previous dataset of 1,108 hashtags (Bansal et al., 2015). We frame the segmentation task as a pairwise ranking problem, given a set of candidate segmentations. We build several neural architectures using this problem formulation which use corpus-based, linguistic and thesaurus based features. We further propose a multi-task learning approach which jointly learns segment ranking and single-vs. multi-token hashtag classification. The latter leads to an error reduction of 24.6% over the current state-of-the-art. Finally, we demonstrate the utility of our method by using hashtag segmentation in the downstream task of sentiment analysis. Feeding the automatically segmented hashtags to a state-of-the-art sentiment analysis method on the SemEval 2017 benchmark dataset results in a 2.6% increase in the official metric for the task.

Background and Preliminaries
Current approaches for hashtag segmentation can be broadly divided into three categories: (a) gazeteer and rule based (Maynard and Greenwood, 2014;Declerck and Lendvai, 2015;Billal et al., 2016), (b) word boundary detection Ozgür, 2017, 2016), and (c) ranking with language model and other features (Wang et al., 2011;Bansal et al., 2015;Berardi et al., 2011;Reuter et al., 2016;Simeon et al., 2016). Hashtag segmentation approaches draw upon work on compound splitting for languages such as German or Finnish (Koehn and Knight, 2003) and word segmentation (Peng and Schuurmans, 2001) for languages with no spaces between words such as Chinese (Sproat and Shih, 1990;Xue and Shen, 2003). Similar to our work, Bansal et al. (2015) extract an initial set of candidate segmentations using a sliding window, then rerank them using a linear regression model trained on lexical, bigram and other corpus-based features. The current state-ofthe-art approach (Ç elebi andÖzgür, 2017, 2016) uses maximum entropy and CRF models with a combination of language model and hand-crafted features to predict if each character in the hashtag is the beginning of a new word. Generating Candidate Segmentations. Microsoft Word Breaker (Wang et al., 2011) is, among the existing methods, a strong baseline for hashtag segmentation, as reported in Ç elebi and Ozgür (2017) and Bansal et al. (2015). It employs a beam search algorithm to extract k best segmentations as ranked by the n-gram language model probability: where [w 1 , w 2 . . . w n ] is the word sequence of segmentation s and N is the window size. More sophisticated ranking strategies, such as Binomial and word length distribution based ranking, did not lead to a further improvement in performance (Wang et al., 2011). The original Word Breaker was designed for segmenting URLs using language models trained on web data. In this paper, we reimplemented 2 and tailored this approach to segmenting hashtags by using a language model specifically trained on Twitter data (implementation details in §3.6). The performance of this method itself is competitive with state-of-the-art methods (evaluation results in §5.3). Our proposed pairwise ranking method will effectively take the top k segmentations generated by this baseline as candidates for reranking.
However, in prior work, the ranking scores of each segmentation were calculated independently, ignoring the relative order among the top k candidate segmentations. To address this limitation, we utilize a pairwise ranking strategy for the first time for this task and propose neural architectures to model this.

Multi-task Pairwise Neural Ranking
We propose a multi-task pairwise neural ranking approach to better incorporate and distinguish the relative order between the candidate segmentations of a given hashtag. Our model adapts to address single-and multi-token hashtags differently via a multi-task learning strategy without requiring additional annotations. In this section, we describe the task setup and three variants of pairwise neural ranking models ( Figure 1). hashtag (h) #songsonghaddafisitunes segmentation (s * ) songs on ghaddafi s itunes (i.e. songs on Ghaddafi's iTunes) candidate segmentations (s ∈ S) songs on ghaddafis itunes songs on ghaddafisi tunes songs on ghaddaf is itunes song song haddafis i tunes songsong haddafisitunes (and . . . ) Table 2: Example hashtag along with its gold and possible candidate segmentations.

Segmentation as Pairwise Ranking
The goal of hashtag segmentation is to divide a given hashtag h into a sequence of meaningful words s * = [w 1 , w 2 , . . . , w n ]. For a hashtag of r characters, there are a total of 2 r−1 possible segmentations but only one, or occasionally two, of them (s * ) are considered correct (Table 2). We transform this task into a pairwise ranking problem: given k candidate segmentations {s 1 , s 2 , . . . , s k }, we rank them by comparing each with the rest in a pairwise manner. More specifically, we train a model to predict a real number g(s a , s b ) for any two candidate segmentations s a and s b of hashtag h, which indicates s a is a better segmentation than s b if positive, and vice versa. To quantify the quality of a segmentation in training, we define a gold scoring function g * based on the similarities with the ground-truth segmentation s * : g * (s a , s b ) = sim(s a , s * ) − sim(s b , s * ).
We use the Levenshtein distance (minimum number of single-character edits) in this paper, although it is possible to use other similarity measurements as alternatives. We use the top k segmentations generated by Microsoft Word Breaker ( §2) as initial candidates.

Pairwise Neural Ranking Model
For an input candidate segmentation pair s a , s b , we concatenate their feature vectors s a and s b , and feed them into a feedforward network which emits a comparison score g(s a , s b ). The feature vector s a or s b consists of language model probabilities using Good-Turing (Good, 1953) and modified Kneser-Ney smoothing (Kneser and Ney, 1995;Chen and Goodman, 1999), lexical and linguistic features (more details in §3.5). For training, we use all the possible pairs s a , s b of the k candidates as the input and their gold scores g * (s a , s b ) as the target. The training objective is to minimize the Mean Squared Error (MSE): (1) where m is the number of training examples.
To aggregate the pairwise comparisons, we follow a greedy algorithm proposed by Cohen et al. (1998) and used for preference ranking (Parakhin and Haluptzok, 2009). For each segmentation s in the candidate set S = {s 1 , s 2 , . . . , s k }, we calculate a single score Score P N R (s) = s =s j ∈S g(s, s j ), and find the segmentation s max corresponding to the highest score. We repeat the same procedure after removing s max from S, and continue until S reduces to an empty set.

Margin Ranking (MR) Loss
As an alternative to the pairwise ranker ( §3.2), we propose a pairwise model which learns from candidate pairs s a , s b but ranks each individual candidate directly rather than relatively. We define a new scoring function g which assigns a higher score to the better candidate, i.e., g (s a ) > g (s b ), if s a is a better candidate than s b and vice-versa. Instead of concatenating the features vectors s a and s b , we feed them separately into two identical feedforward networks with shared parameters. During testing, we use only one of the networks to rank the candidates based on the g scores. For training, we add a ranking layer on top of the networks to measure the violations in the ranking order and minimize the Margin Ranking Loss (MR): where m is the number of training samples. The architecture of this model is presented in Figure 1

Adaptive Multi-task Learning
Both models in §3.2 and §3.3 treat all the hashtags uniformly. However, different features address different types of hashtags. By design, the linguistic features capture named entities and multiword hashtags that exhibit word shape patterns,  Figure 1: Pairwise neural ranking models for hashtag segmentation. Given two candidate segmentations s a and s b of hashtag h, the goal is to predict the segmentation's goodness relative score (g) or absolute (g ) score.
such as camel case. The ngram probabilities with Good-Turing smoothing gravitate towards multiword segmentations with known words, as its estimate for unseen ngrams depends on the fraction of ngrams seen once which can be very low (Heafield, 2013). The modified Kneser-Ney smoothing is more likely to favor segmentations that contain rare words, and single-word segmentations in particular. Please refer to §5.3 for a more detailed quantitative and qualitative analysis.
To leverage this intuition, we introduce a binary classification task to help the model differentiate single-word from multi-word hashtags. The binary classifier takes hashtag features h as the input and outputs w h , which represents the probability of h being a multi-word hashtag. w h is used as an adaptive gating value in our multitask learning setup. The gold labels for this task are obtained at no extra cost by simply verifying whether the ground-truth segmentation has multiple words. We train the pairwise segmentation ranker and the binary single-vs. multi-token hashtag classifier jointly, by minimizing L M SE for the pairwise ranker and the Binary Cross Entropy Error (L BCE ) for the classifier: ] and s GL ab = [s GL a ; s GL b ] by concatenation, then combine them based on the adaptive gating value w h before feeding them into the feedforward network G for pairwise ranking: We use summation with padding, as we find this simple ensemble method achieves similar performance in our experiments as the more complex multi-column networks (Ciresan et al., 2012). Figure 1(c) shows the architecture of this model. An analogue multi-task formulation can also be used for the Margin Ranking loss as:

Features
We use a combination of corpus-based and linguistic features to rank the segmentations. For a candidate segmentation s, its feature vector s includes the number of words in the candidate, the length of each word, the proportion of words in an English dictionary 3 or Urban Dictionary 4 (Nguyen et al., 2018), ngram counts from Google Web 1TB corpus (Brants and Franz, 2006), and ngram probabilities from trigram language models trained on the Gigaword corpus (Graff and Cieri, 2003) and 1.1 billion English tweets from 2010, respectively. We train two language models on each corpus: one with Good-Turing smoothing using SRILM (Stolcke, 2002) and the other with modified Kneser-Ney smoothing using KenLM (Heafield, 2011). We also add boolean features, such as if the candidate is a named-entity present in the list of Wikipedia titles, and if the candidate segmentation s and its corresponding hashtag h satisfy certain word-shapes (more details in appendix A.1). Similarly, for hashtag h, we extract the feature vector h consisting of hashtag length, ngram count of the hashtag in Google 1TB corpus (Brants and Franz, 2006), and boolean features indicating if the hashtag is in an English dictionary or Urban Dictionary, is a named-entity, is in camel case, ends with a number, and has all the letters as consonants. We also include features of the bestranked candidate by the Word Breaker model.

Implementation Details
We use the PyTorch framework to implement our multi-task pairwise ranking model. The pairwise ranker consists of an input layer, three hidden layers with eight nodes in each layer and hyperbolic tangent (tanh) activation, and a single linear output node. The auxiliary classifier consists of an input layer, one hidden layer with eight nodes and one output node with sigmoid activation. We use the Adam algorithm (Kingma and Ba, 2014) for optimization and apply a dropout of 0.5 to prevent overfitting. We set the learning rate to 0.01 and 0.05 for the pairwise ranker and auxiliary classifier respectively. For each experiment, we report results obtained after 100 epochs.
For the baseline model used to extract the k initial candidates, we reimplementated the Word Breaker (Wang et al., 2011) as described in §2 and adapted it to use a language model trained on 1.1 billion tweets with Good-Turing smoothing using SRILM (Stolcke, 2002) to give a better performance in segmenting hashtags ( §5.3). For all our experiments, we set k = 10.

Hashtag Segmentation Data
We use two datasets for experiments (Table 3): (a) STAN small , created by Bansal et al. (2015), which consists of 1,108 unique English hashtags from 1,268 randomly selected tweets in the Stanford Sentiment Analysis Dataset (Go and Huang, 2009) along with their crowdsourced segmentations and  Dataset Analysis. STAN small is the most commonly used dataset in previous work. However, after reexamination, we found annotation errors in 6.8% 5 of the hashtags in this dataset, which is significant given that the error rate of the state-of-theart models is only around 10%. Most of the errors were related to named entities. For example, #lionhead, which refers to the "Lionhead" video game company, was labeled as "lion head".
Our Dataset. We therefore constructed the STAN large dataset of 12,594 hashtags with additional quality control for human annotations. We displayed a tweet with one highlighted hashtag on the Figure-Eight 6 (previously known as Crowd-Flower) crowdsourcing platform and asked two workers to list all the possible segmentations. For quality control on the platform, we displayed a test hashtag in every page along with the other hashtags. If any annotator missed more than 20% of the test hashtags, then they were not allowed to continue work on the task. For 93.1% of the hashtags, the workers agreed on the same segmentation. We further asked three in-house annotators (not authors) to cross-check the crowdsourced annotations using a two-step procedure: first, verify if the hashtag is a named entity based on the context of the tweet; then search on Google to find the correct segmentation(s). We also asked the same annotators to fix the errors in STAN small . The human upperbound of the task is estimated at ∼98% accuracy, where we consider the crowdsourced segmentations (two workers merged) as correct if at least one of them matches with our expert annotator's segmentations.

Experiments
In this section, we present experimental results that compare our proposed method with the other state-of-the-art approaches on hashtag segmentation datasets. The next section will show experiments of applying hashtag segmentation to the popular task of sentiment analysis.

Existing Methods
We compare our pairwise neural ranker with the following baseline and state-of-the-art approaches: (a) The original hashtag as a single token; (b) A rule-based segmenter, which employs a set of word-shape rules with an English dictionary (Billal et al., 2016); (c) A Viterbi model which uses word frequencies from a book corpus 7 (Berardi et al., 2011);   loss between g * and a scoring function similar to g . It is trained on the STAN large dataset.

Evaluation Metrics
We evaluate the performance by the top k (k = 1, 2) accuracy (A@1, A@2), average token-level F 1 score (F 1 @1), and mean reciprocal rank (MRR). In particular, the accuracy and MRR are calculated at the segmentation-level, which means that an output segmentation is considered correct if and only if it fully matches the human segmentation. Average token-level F 1 score accounts for partially correct segmentation in the multi-token hashtag cases.

Results
Tables 4 and 5 show the results on the STAN small and STAN large datasets, respectively. All of our pairwise neural rankers are trained on the 2,518 manually segmented hashtags in the training set of STAN large and perform favorably against other state-of-the-art approaches. Our best model (MSE+multitask) that utilizes different features adaptively via a multi-task learning procedure is shown to perform better than simply combining all the features together (MR and MSE). We highlight the 24.6% error reduction on STAN small and 16.5% on STAN large of our approach over the previous SOTA (Ç elebi andÖzgür, 2017) on the Multi-token hashtags, and the importance of having a separate evaluation of multi-word cases as it is trivial to obtain 100% accuracy for Singletoken hashtags. While our hashtag segmentation model is achieving a very high accuracy@2, to be practically useful, it remains a challenge to get the top one predication exactly correct. Some hashtags are very difficult to interpret, e.g., #BTVSMB refers to the Social Media Breakfast (SMB) in Burlington, Vermont (BTV). The improved Word Breaker with our addition of a Twitter-specific language model is a very strong K n e s e r -N  baseline, which echos the findings of the original Word Breaker paper (Wang et al., 2011) that having a large in-domain language model is extremely helpful for word segmentation tasks. It is worth noting that the other state-of-the-art system (Ç elebi andÖzgür, 2017) also utilized a 4-gram language model trained on 476 million tweets from 2009.

Analysis and Discussion
Feature Analysis. To empirically illustrate the effectiveness of different features on different types of hashtags, we show the results for models using individual feature sets in pairwise ranking models (MSE) in Table 6. Language models with modified Kneser-Ney smoothing perform best on single-token hashtags, while Good-Turing and Linguistic features work best on multi-token hashtags, confirming our intuition about their usefulness in a multi-task learning approach. Table 7 shows a qualitative analysis with the first column (•••) indicating which features lead to correct or wrong segmentations, their count in our data and illustrative examples with human segmentation.
Length of Hashtags. As expected, longer hashtags with more than three tokens pose greater challenges and the segmentation-level accuracy of our best model (MSE+multitask) drops to 82.1%. For many error cases, our model predicts a close-to-correct segmentation, e.g., #youknowyouupttooearly, #iseelondoniseefrance, which is also reflected by the higher token-level F 1 scores across hashtags with different lengths (Figure 2).
Size of the Language Model. Since our approach heavily relies on building a Twitter language model, we experimented with its sizes and show the results in Figure 3. Our approach can perform well even with access to a smaller amount of tweets. The drop in F 1 score for our pairwise neural ranker is only 1.4% and 3.9% when using the language models trained on 10% and 1% of the total 1.1 billion tweets, respectively.
Time Sensitivity. Language use in Twitter changes with time (Eisenstein, 2013). Our pairwise ranker uses language models trained on the tweets from the year 2010. We tested our approach on a set of 500 random English hashtags posted in tweets from the year 2019 and show the results in Table 8

Extrinsic Evaluation: Twitter Sentiment Analysis
We attempt to demonstrate the effectiveness of our hashtag segmentation system by studying its impact on the task of sentiment analysis in Twitter (Pang et al., 2002;Nakov et al., 2016;Rosenthal et al., 2017). We use our best model (MSE+multitask), under the name HashtagMaster, in the following experiments.

Experimental Setup
We compare the performance of the BiL-STM+Lex (Teng et al., 2016) sentiment analysis model under three configurations: (a) tweets with hashtags removed, (b) tweets with hashtags as single tokens excluding the # symbol, and (c) tweets with hashtags as segmented by our system, Hash-tagMaster. BiLSTM+Lex is a state-of-the-art open source system for predicting tweet-level sentiment (Tay et al., 2018). It learns a context-sensitive sentiment intensity score by leveraging a Twitterbased sentiment lexicon (Tang et al., 2014). We use the same settings as described by Teng et al. (2016) to train the model. We use the dataset from the Sentiment Analysis in Twitter shared task (subtask A) at SemEval 2017 (Rosenthal et al., 2017). 9 Given a tweet, the goal is to predict whether it expresses POSITIVE, NEGATIVE or NEUTRAL sentiment. The training and development sets consist of 49,669 tweets and we use 40,000 for training and the rest for development. There are a total of 4,840 tweets containing 12,128 hashtags in the SemEval 2017 test set, and our hashtag segmenter ended up splitting 6,975 of those hashtags present in 3,384 tweets.

Results and Analysis
In Table 9, we report the results based on the 3,384 tweets where HashtagMaster predicted a split, as for the rest of tweets in the test set, the hashtag segmenter would neither improve nor worsen the sentiment prediction. Our hashtag segmenter successfully improved the sentiment analysis performance by 2% on average recall and F P N 1 comparing to having hashtags unsegmented. This improvement is seemingly small but decidedly important for tweets where sentiment-related information is embedded in multi-word hashtags  Table 9: Sentiment analysis evaluation on the 3384 tweets from SemEval 2017 test set using the BiL-STM+Lex method (Tang et al., 2014). Average recall (AvgR) is the official metric of the SemEval task and is more reliable than accuracy (Acc). F P N 1 is the average F 1 of positive and negative classes. Having the hashtags segmented by our system HashtagMaster (i.e., MSE+multitask) significantly improves the sentiment prediction than not (p < 0.05 for AvgR and F P N 1 against the single-word setup). and sentiment prediction would be incorrect based only on the text (see Table 10 for examples). In fact, 2,605 out of the 3,384 tweets have multiword hashtags that contain words in the Twitterbased sentiment lexicon (Tang et al., 2014) and 125 tweets contain sentiment words only in the hashtags but not in the rest of the tweet.

Other Related Work
Automatic hashtag segmentation can improve the performance of many applications besides sentiment analysis, such as text classification (Billal et al., 2016), named entity linking (Bansal et al., 2015) and modeling user interests for recommendations (Chen et al., 2016). It can also help in collecting data of higher volume and quality by providing a more nuanced interpretation of its content, as shown for emotion analysis (Qadir and Riloff, 2014), sarcasm and irony detection (Maynard and Greenwood, 2014;Huang et al., 2018). Better semantic analysis of hashtags can also potentially be applied to hashtag annotation (Wang et al., 2019), to improve distant supervision labels in training classifiers for tasks such as sarcasm (Bamman and Smith, 2015), sentiment (Mohammad et al., 2013), emotions (Abdul-Mageed and Ungar, 2017); and, more generally, as labels for pre-training representations of words (Weston et al., 2014), sentences (Dhingra et al., 2016), and images (Mahajan et al., 2018).

Conclusion
We proposed a new pairwise neural ranking model for hashtag segmention and showed significant performance improvements over the state-of-theart. We also constructed a larger and more curated dataset for analyzing and benchmarking   (Tang et al., 2014), respectively.
hashtag segmentation methods. We demonstrated that hashtag segmentation helps with downstream tasks such as sentiment analysis. Although we focused on English hashtags, our pairwise ranking approach is language-independent and we intend to extend our toolkit to languages other than English as future work.