Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter

Keyphrases can provide highly condensed and valuable information that allows users to quickly acquire the main ideas. The task of automatically extracting them have received considerable attention in recent decades. Different from previous studies, which are usually focused on automatically extracting keyphrases from documents or articles, in this study, we considered the problem of automatically extracting keyphrases from tweets. Because of the length limitations of Twitter-like sites, the performances of existing methods usually drop sharply. We proposed a novel deep recurrent neural network (RNN) model to combine keywords and context information to perform this problem. To evaluate the proposed method, we also constructed a large-scale dataset collected from Twitter. The experimental results showed that the proposed method performs signiﬁcantly better than previous methods.


Introduction
Keyphrases are usually the selected phrases that can capture the main topics described in a given document (Turney, 2000). They can provide users with highly condensed and valuable information, and there are a wide variety of sources for keyphrases, including web pages, research articles, books, and even movies. In contrast to keywords, keyphrases usually contain two or more words. Normally, the meaning representations of these phrases are more precise than those of single words. Moreover, along with the increasing development of the internet, this kind of summarization has received continuous consideration in recent years from both the academic and entiprise communities (Witten et al., 1999;Wan and Xiao, 2008;Jiang et al., 2009;Zhao et al., 2011;Tuarob et al., 2015).
Because of the enormous usefulness of keyphrases, various studies have been conducted on the automatic extraction of keyphrases using different methods, including rich linguistic features (Barker and Cornacchia, 2000;Paukkeri et al., 2008), supervised classification-based methods (Witten et al., 1999;Wu et al., 2005;Wang et al., 2006), ranking-based methods (Jiang et al., 2009), and clustering-based methods (Mori et al., 2007;Danilevsky et al., 2014). These methods usually focus on extracting keyphrases from a single document or multiple documents. Typically, a large number of words exist in even a document of moderate length, where a few hundred words or more is common. Hence, statistical and linguistic features can be considered to determine the importance of phrases.
In addition to the previously mentioned methods, a few researchers have studied the problem of extracting keyphrases from collections of tweets (Zhao et al., 2011;Bellaachia and Al-Dhelaan, 2012). In contrast to traditional web applications, Twitter-like services usually limit the content length to 140 characters. In (Zhao et al., 2011), the contextsensitive topical PageRank method was proposed to extract keyphrases by topic from a collection of tweets. NE-Rank was also proposed to rank keywords for the purpose of extracting topical keyphrases (Bellaachia and Al-Dhelaan, 2012). Because multiple tweets are usually organized by topic, many document-level approaches can also be adopted to achieve the task. In contrast with the previous methods, Marujo et al. (2015) focused on the task of extracting keywords from single tweets. They used several unsupervised methods and word embeddings to construct features. However, the proposed method worked on the word level.
In this study, we investigated the problem of automatically extracting keyphrases from single tweets. Compared to the problem of identifying keyphrases from documents containing hundreds of words, the problem of extracting keyphrases from a single short text is generally more difficult. Many linguistic and statistical features (e.g., the number of word occurrences) cannot be determined and used. Moreover, the standard steps of keyphrase extraction usually include keyword ranking, candidate keyphrase generation, and keyphrase ranking. Previous works usually used separate methods to handle these steps. Hence, the error of each step is propagated, which may highly impact the final performance. Another challenge of keyphrase extraction on Twitter is the lack of training and evaluation data. Manual labelling is a time-consuming procedure. The labelling consistency of different labellers cannot be easily controlled.
To meet these challenges, in this paper, we propose a novel deep recurrent neural network (RNN) model for the joint processing of the keyword ranking, keyphrase generation, and keyphrase ranking steps. The proposed RNN model contains two hidden layers. In the first hidden layer, we capture the keyword information. Then, in the second hidden layer, we extract the keyphrases based on the keyword information using a sequence labelling method. In order to train and evaluate the proposed method, we also proposed a novel method to construct a dataset that contained a large number of tweets with golden standard keyphrases. The proposed dataset construction method was based on the hashtag definitions in Twitter and how these were used in specific tweets.
The main contributions of this work can be summarized as follows: • We proposed a two-hidden-layer RNN-based method to jointly model the keyword ranking, keyphrase generation, and keyphrase ranking steps.
• To train and evaluate the proposed method, we proposed a novel method for constructing a large dataset, which consisted of more than one million words.
• Experimental results demonstrated that the proposed method could achieve better results than the current state-of-the-art methods for these tasks.

Proposed Methods
In this paper, we will first describe the deep recurrent neural network (RNN). Then, we will discuss the proposed joint-layer recurrent neural network model, which jointly processes the keyword ranking, keyphrase generation, and keyphrase ranking.

Deep Recurrent Neural Networks
One way to capture the contextual information of a word sequence is to concatenate neighboring features as input features for a deep neural network. However, the number of parameters rapidly increases according to the input dimension. Hence, the size of the concatenating window is limited. A recurrent neural network (RNN) can be considered to be a deep neural network (DNN) with an indefinite number of layers, which introduces the memory from previous time steps. A potential weakness of a RNN is its lack of hierarchical processing for the input at the current time step.

Joint-layer Recurrent Neural Networks
The proposed joint-layer recurrent neural network (joint-layer RNN) is a variant of an sRNN with two hidden layers. The joint-layer RNN has two output layers, which are combined into a objective layer. Suppose there is an L intermediate layer sRNN that has an output layer for each hidden layer. The l-th hidden activation is defined as: where h l t is the hidden state of the l-th layer at time t. U l and W l are the weight matrices for the hidden activation at time t − 1 and the lower level activation h l−1 t , respectively. When l = 1, the hidden activation is computed using h 0 t = x t . φ l is an element-wise non-linear function, such as the sigmoid function. The l-th output activation is defined as: where V l is the weight matrix for the l-th hidden layer h l t . ϕ l is also an element-wise non-linear function, such as the softmax function.
A joint-layer recurrent neural network is an extension of a stacked RNN with two hidden layers. At time t, the training input, x t , of the network is the concatenation of features from a mixture within a window. We use word embedding as a feature in this paper. The output targets, y 1

Training
In this work, we joined learning the parameters θ in the deep neural network.
where X are the words embeddings, the other parameters are defined before. Once give a labeled sentence we can know both the keyword and keyphrase (keyphrase is made of keywords). At the first output layer we use our model to discriminate keyword and at the second output layer we use our model to discriminate keyphrase. Then we combine these two sub-objective which at different discrimination level into the final objective. The final objection is defined as: where α is linear weighted factor. Given N training , the subobjective formulation is defined as: where d(a, b) is a predefined divergence measure between a and b, such as Euclidean distance or cross-entropy. Eq. (8) and Eq. (9) show that we discover keyword and extract keyphrase at different level simultaneously. The experimental results will show that combination of different granularity discrimination can significantly improve the performance.
To minimize the objective function, we optimize our models by back-propagating the gradients with respect to the training objectives. The stochastic gradient descent (SGD) algorithm is used to train the models. The update rule for the i-th parameter θ i at epoch e is as follows: where the λ is a global learning rate shared by all dimensions. g e is the gradient of the parameters at the e-th iteration. We select the best model according to the validation set.

Data Construction
To analyze the effectiveness of our model for keyphrase extraction on Twitter, we constructed an evaluation dataset. We crawled a large number of tweets. Generally, for each user, we gathered about 3K tweets, with a final total of more than 41 million tweets.
From analyzing these tweets, we found that some of the hashtags can be considered as the keyphrases of the tweet. For example: "The Warriors take Game 1 of the #NBAFinals 104-89 behind a playoff career-high 20 from Shaun Livingston.". "NBA Finals" can be considered as the keyphrase of the twitter. Based on this intuition, to construct the dataset, we firstly filtered out all non-Latin tweets using regular expressions. Then, we removed any URL links from the tweets since we were focusing on the textual content. Tweets that start with the "@username" are generally considered replies and have a conversational nature more than topical nature. Therefore, we also removed any tweets that start with "@username" to focus on topical tweets only. Moreover, we designed some rules about the hashtags in tweets to filter the remaining tweets. First, one tweet could have only one hashtag. Second, the position of the hashtag had to be inside the tweet because we needed the hashtag and tweet to be semantically inseparable. When a hashtag appears inside a tweet, it is most likely to be an inseparable semantical part of the tweet and has important meaning. Therefore, we regarded this hashtag as a keyphrase of the tweet.
Each hashtag was split into keywords if it encompassed more than one word, for example "Old-StockCanadians" for "Old Stock Canadians". After an effort to filter the tweets we finally had 110K tweets with the hashtags which could meet our resultList.append((t, hashtag)) 16: end while 17: return resultList needs. The pseudocode is defined in Alg. 1. The statistical information of the dataset can be seen in Table 1. To evaluate the quality of the tweets in our dataset, we randomly selected 1000 tweets from our dataset and chose three volunteers. Every tweet was assigned a score of 2 (perfectly suitable), 1 (suitable), or 0 (unsuitable) to indicate whether the hashtag of the tweet was a good keyphrase for it. The results showed that 90.2% were suitable and 66.1% were perfectly suitable. This demonstrated that our constructed dataset was good for keyphrase extraction on Twitter.

Experiment Configurations
To perform an experiment on extracting keyphrases, we used 70% as a training set, 10% as a development set, and 20% as a testing set. For evaluation metrics, we used the precision (P), recall (R), and F1-score (F1) to evaluate the performance. The precision was calculated based on the percentage of keyphrases truly identified among the keyphrases labeled by the system. Recall was calculated based on the keyphrases truly identified among the golden standard keyphrases.
In the experiments, we use word embeddings as input to the neural network. The word embeddings we used in this work were pre-trained vectors trained on part of a Google News dataset (about 100 billion words). A skip-gram model (Mikolov et al., 2013) was used to generate these 300-dimensional vectors for 3 million words and phrases. We used the word embeddings to initialize our word weight matrix. The matrix was updated in the training process.
The default parameters of our model are as follows: The window size is 3, number of neurons in the hidden layer is 300, and α is 0.5, which were chosen based on the performance using the valid set.

Methods for Comparison
Several algorithms were implemented and used to evaluate the validity of the proposed approach. Among these algorithms, CRF, RNN, LSTM, and R-CRF treat the keyphrase extraction task as a sequence labelling task. Automatic keyword extraction on Twitter (AKET) uses an unsupervised method to extract keywords on Twitter.
• CRF: The keyphrase extraction task can be formalized as a sequence labeling task that involves the algorithmic assignment of a categorical label to each word of a tweet. CRF is a type of discriminative undirected probabilistic graphical model and can process a sequence labeling task. Hence, we applied CRF to extract keyphrases on Twitter.
• RNN: A recurrent neural network (RNN) is a type of artificial neural network where the connections between units form a directed cycle. This creates an internal state of the network that allows it to exhibit dynamic temporal behavior. In an RNN model, word embedding is introduced to represent the semantics of words.
• LSTM: Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture. Unlike traditional RNNs, an LSTM network is well-suited to learn from experience to classify, process, and predict time series when there are very long time lags of unknown size between important events.
• R-CRF: A recurrent conditional random field (R-CRF) (Yao et al., 2014) is a mixture model  • AKET (Automatic Keyword Extraction on Twitter) (Marujo et al., 2015): Several unsupervised methods and word embeddings were used to construct features to obtain keyword. Table 2 shows the performances of different methods on the dataset for keyphrase extraction. From the results, we observe that the joint-layer RNN achieved a better performance than the state-of-theart methods. The relative improvement in the Fscore of the joint-layer RNN over the second best result was 6.1%. AKET performed the worst. This was because AKET worked on the word level. Of the other methods, CRF performed the worst, RNN and LSTM were almost the same but better than CRF, and R-CRF was the best of these methods, with the exception of our joint-layer RNN. The results can be explained by the word embedding and long shortterm memory cell providing some benefits. The best result was found with our joint-layer RNN. This indicated that the joint processing of the keyword finding and keyphrase extraction worked well and could to some degree demonstrate the effectiveness of our model in keyphrase extraction on Twitter.

Experiment Results
To further analyze the keyword extraction results on Twitter, we compared AKET and our method. In Table 3, we can see that except for the recall, AKET is a little better than our method, but our method performed significantly better than AKET in the precision and F-score. This indicates that our  In summary, the experimental results conclusively demonstrated that the proposed joint-layer RNN method is superior to the state-of-the-art methods when measured using commonly accepted performance metrics on Twitter.
To analysis the sensitivity of the hyper-parameters of the joint-layer RNN, we conducted several empirical experiments on the dataset. Fig.2(a) shows the performances of the jointlayer RNN with different numbers of neurons in the hidden layers. To simplify, we made hidden layer 1 and hidden layer 2 have the same number of neurons. In the figure, the x-axis denotes the number of neurons, and the y-axis denotes the precision, recall, and F-score. The data used for constructing the test set were the same as we used in the previous section. From the figure, we can observe that the number of neurons in the hidden layers do not highly affect the final performance. Three performance indicators of the joint-layer RNN change stably with different numbers of neurons. Fig.2(b) shows the performances of the joint-layer RNN with different window sizes. In the figure, the x-axis denotes the different window size, and the yaxis denotes the precision, recall, and F-score. From the figure, we observe that when the window size is one, the three performance indicators of jointlayer RNN perform badly. Then, as the window size increases, the three performance indicators change stably. The main reason may possibly be that when the window size is one, the model just uses the current word information. When the window size increases, the model uses the context information of the current word but the most important context information is nearby the current word. Fig.2(c) shows the performances of the joint-layer RNN with different α values. In the figure, the xaxis denotes the value of α used for training, and the y-axis denotes the precision, recall, and F-score.   We can see that the best performance is obtained when α is around 0.5. This indicates that our model emphasizes the combination of keyword finding and keyphrase extraction. Table 4 lists the effects of word embedding. We can see that the performance when updating the word embedding is better than when not updating, and the performance of word embedding is a little better than random word embedding. The main reason is that the vocabulary size is 147,377, but the number of words from tweets that exist in the word embedding trained on the Google News dataset is just 35,133. This means that 76.2% of the words are missing. This also confirms that the proposed jointlayer RNN is more suitable for keyphrase extraction on Twitter. Fig.3(a) shows the performances of the joint-layer RNN with different percentages of training data. In the figure, the x-axis denotes the percentages of data used for training, and the y-axis denotes the precision, recall, and F-score. From the figure, we observe that as the amount of training data increases, the three performance indicators of the joint-layer RNN consequently improve. When the percentage of training data is greater than 60% of the whole dataset, the performance indicators slowly increase. The main reason may possibly be that the number concepts included in these data sets are small. However, on the other hand, we can say that the proposed joint-layer RNN method can achieve acceptable results with a few ground truths. Hence, it can be easily adopted for other data sets.
Since the keyphrase extraction training process is solved using an iterative procedure, we also evaluated its convergence property. Fig.3 (b) shows the precision, recall, and F-score performances of the joint-layer RNN. In the figure, the x-axis denotes the number of epochs for optimizing the model, and the y-axis denotes the precision, recall, and Fscore. From the figure, we observe that the jointlayer RNN can coverage with less than six iterations. This means that the joint-layer RNN can achieve a stable and superior performance under a wide range of parameter values.

Related Work
In general, keyphrase extraction methods can be roughly divided into two groups: supervised machine learning approaches and unsupervised ranking approaches.
In the supervised line of research, keyphrase extraction is treated as a classification problem, in which a candidate must be classified as either a keyphrase (i.e., keyphrases) or not (i.e., non-  keyphrases). A classifier needs to be trained using annotated training data. The trained model is then applied to documents for which keyphrases are to be identified. For example (Frank et al., 1999) developed a system called KEA that used two features: tf-idf and first occurrence of the term and used them as input to Naive Bayes (Hulth, 2003) used linguistic knowledge (i.e., part-of-speech tags) to determine candidate sets: potential pos-patterns were used to identify candidate phrases from the text. Tang et al. (2004) applied Bayesian decision theory for keyword extraction. Medelyan and Witten extended the KEA to KEA++, which uses semantic information on terms and phrases extracted from a domain specific thesaurus, thus enhances automatic keyphrase extraction (Medelyan and Witten, 2006). In the unsupervised line of research, keyphrase extraction is formulated as a ranking problem. A well-known approach is the Term Frequency Inverse Document Frequency (TF-IDF) (Sparck Jones, 1972;Zhang et al., 2007;Lee and Kim, 2008). Measures like term frequencies (Wu and Giles, 2013;Rennie and Jaakkola, 2005;Kireyev, 2009), inverse document frequencies, topic proportions, etc. and knowledge of specific domain are applied to rank terms in documents which are aggregated to score the phrases. The ranking based on tf-idf has been shown to work well in practice (Hasan and Ng, 2010). Mihalcea et al. proposed the TextRank, which constructs keyphrases using the PageRank values obtained on a graph based ranking model for graphs extracted from texts (Mihalcea and Tarau, 2004). Liu et al. proposed to extract keyphrases by adopting a clustering-based approach, which ensures that the document is semantically covered by these keyphrases (Liu et al., 2009). Ali Mehri et al. put forward a method for ranking the words in texts, which can also be used to classify the correlation range between word-type occurrences in a text, by using non-extensive statistical mechanics (Mehri and Darooneh, 2011).
Recurrent neural networks(RNNs) (Elman, 1990) has been applied to many sequential prediction tasks, which is an important class of naturally deep architecture. In NLP, RNNs deal with a sentence as a sequence of tokens and have been successfully applied to various tasks like spoken language understanding (Mesnil et al., 2013) and language modeling (Mikolov et al., 2011). Classical recurrent neural networks incorporate information from preceding, there are kinds of variants, bidirectional RNNs are also useful for NLP tasks, especially when making a decision on the current token, information provided by the following tokens is generally useful.

Conclusion
In this work, we proposed a novel deep recurrent neural network (RNN) model to combine keywords and context information to perform the keyphrase extraction task. The proposed model can jointly process the keyword ranking and keyphrase generation task. It has two hidden layers to discriminate keywords and classify keyphrases, and these two sub-objectives are combined into a final objective function. We evaluated the proposed method on a dataset filtered from ten million crawled tweets. The proposed method can achieve better results than the state-of-the-art methods. The experimental results demonstrated the effectiveness of the proposed method for keyphrase extraction on single tweets.