Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts

Existing keyphrase extraction methods suffer from data sparsity problem when they are conducted on short and informal texts, especially microblog messages. Enriching context is one way to alleviate this problem. Considering that conversations are formed by reposting and replying messages, they provide useful clues for recognizing essential content in target posts and are therefore helpful for keyphrase identification. In this paper, we present a neural keyphrase extraction framework for microblog posts that takes their conversation context into account, where four types of neural encoders, namely, averaged embedding, RNN, attention, and memory networks, are proposed to represent the conversation context. Experimental results on Twitter and Weibo datasets show that our framework with such encoders outperforms state-of-the-art approaches.


Introduction
The increasing popularity of microblogs results in a huge volume of daily-produced user-generated data. As a result, such explosive growth of data far outpaces human beings' reading and understanding capacity. Techniques that can automatically identify critical excerpts from microblog posts are therefore in growing demand. Keyphrase extraction is one of the techniques that can meet this demand, because it is defined to identify salient phrases, generally formed by one or multiple words, for representing key focus and main topics for a given collection (Turney, 2000;Zhao et al., 2011). Particularly for microblogs, keyphrase extraction has been proven useful to downstream applications such as information retrieval (Choi * Work was done during the internship at Tencent AI Lab. 1 Our datasets are released at: http://ai.tencent. com/ailab/Encoding_Conversation_Context_ for_Neural_Keyphrase_Extraction_from_ Microblog_Posts.html Target post for keyphrase extraction: "I will curse you in that forum" is the lowest of low. You are an embarrassment president Duterte. Childish! Messages forming a conversation: [R1]: any head of state will be irked if asked to report to another head of state [R2]: Really? Did Obama really asked Duterte to report to him? LOL Table 1: An example conversation about "president Duterte" on Twitter. [Ri]: The i-th message in conversation ordered by their positing time. president Duterte: keyphrase to be detected; Italic words: words that are related to the main topic in conversations and can indicate the keyphrase. et al., 2012), text summarization (Zhao et al., 2011), event tracking (Ribeiro et al., 2017), etc.
To date, most efforts on keyphrase extraction on microblogs treat messages as independent documents or sentences, and then apply ranking-based models (Zhao et al., 2011;Bellaachia and Al-Dhelaan, 2012;Marujo et al., 2015) or sequence tagging models  on them. It is arguable that these methods are suboptimal for recognizing salient content from short and informal messages due to the severe data sparsity problem. Considering that microblogs allow users to form conversations on issues of interests by reposting with comments 2 and replying to messages for voicing opinions on previous discussed points, these conversations can enrich context for short messages (Chang et al., 2013;Li et al., 2015), and have been proven useful for identifying topicrelated content (Li et al., 2016). For example, Table 1 displays a target post with keyphrase "president Duterte" and its reposting and replying messages forming a conversation.
Easily identified, critical words are mentioned multiple times in conversations. Such as in [R2], keyword "Duterte" re-occurs in the conversation. Also, topic-relevant content, e.g., "head of state", "another head of state", "Obama", helps to indicate keyphrase "president Duterte". Such contextual information embedded in a conversation is nonetheless ignored for keyphrase extraction in existing approaches.
In this paper, we present a neural keyphrase extraction framework that exploits conversation context, which is represented by neural encoders for capturing salient content to help in indicating keyphrases in target posts. Conversation context has been proven useful in many NLP tasks on social media, such as sentiment analysis (Ren et al., 2016), summarization (Chang et al., 2013;Li et al., 2015), and sarcasm detection (Ghosh et al., 2017). We use four context encoders in our model, namely, averaged embedding, RNN (Pearlmutter, 1989), attention (Bahdanau et al., 2014), and memory networks (Weston et al., 2015), which are proven useful in text representation Weston et al., 2015;Nie et al., 2017). Particularly in this task, to the best of our knowledge, we are the first to encode conversations for detecting keyphrases in microblog posts. Experimental results on Twitter and Sina Weibo datasets demonstrate that, by effectively encoding context in conversations, our proposed approach outperforms existing approaches by a large margin. Quantitative and qualitative analysis suggest that our framework performs robustly on keyphrases with various length. Some encoders such as memory networks can detect salient and topic-related content, whose occurrences are highly indicative of keyphrases. In addition, we test ranking-based models with and without considering conversations. The results also confirm that conversation context can boost keyphrase extraction of ranking-based models.

Keyphrase Extraction with Conversation Context Encoding
Our keyphrase extraction framework consists of two parts, i.e., a keyphrase tagger and a conversation context encoder. The keyphrase tagger aims to identify keyphrases from a target post, and the context encoders captures the salient content in conversations, which would indicate keyphrases in the target post. The entire framework is learned synchronously with the given target posts and their corresponding conversation context. In prediction, the keyphrase tagger identifies keyphrases in a  Figure 1: The overall structure of our keyphrase extraction framework with context encoder. Grey dotted array refer to the inputs of target posts that are also used in context encoding. SINGLE xi,t is a one-word keyphrase (keyword). BEGIN xi,t is the first word of a keyphrase.

MIDDLE
xi,t is part of a keyphrase but it is neither the first nor the last word of the keyphrase. END xi,t is the last word of a keyphrase NOT xi,t is not a keyword or part of a keyphrase. post with the help of representations generated by the encoder. Figure 1 shows the overall structure of our keyphrase extraction framework. In the rest of this section, Section 2.1 describes the keyphrase taggers used in our framework; Section 2.2 gives the details of different context encoders.

Keyphrase Taggers
We follow  to cast keyphrase extraction into the sequence tagging task. Formally, given a target microblog post x i formulated as word sequence < x i,1 , x i,2 , ..., x i,|x i | >, where |x i | denotes the length of x i , we aim to produce a tag sequence < y i,1 , y i,2 , ..., y i,|x i | >, where y i,t indicates whether x i,t is part of a keyphrase. In detail, y i,t has five possible values: Table 2 lists the definition of each value.  has shown that keyphrase extraction methods with this 5-value tagset perform better than those with binary outputs, i.e., only marked with yes or no for a word to be part of a keyphrase.

… …
In addition to one-type output, we also use joint-layer RNN proposed by , which is demonstrated to be the state-of-theart keyphrase tagger in previous work without modeling conversation context. As a multi-task learner (Collobert and Weston, 2008), joint-layer RNN tackles two tasks with two types of outputs, y 1 i,t and y 2 i,t . y 1 i,t has a binary tagset, which indicates whether word x i,t is part of a keyphrase or not. y 2 i,t employs the 5-value tagset defined in Table 2. Besides the standard RNN version, in implementation, we also build the joint-layer RNN with its GRU, LSTM, and BiLSTM counterparts. To be consistent, taggers with one-type output with the 5-value tagset are named as single-layer taggers.
As shown in Figure 1, our keyphrase tagger is built upon input feature map I(·), which embeds each word x i,t in target post into a dense vector format, i.e., I(x i,t ) = ν ν ν i,t . We initialize input feature map by pre-trained embeddings, and update embeddings during training.

Context Encoders
We aggregate all reposting and replying messages in conversations to form a pseudo-document as context by their posting time, and input context in forms of word sequences into context encoder. Let x c i denote the context word sequence of the target post x i , we propose four methods to encode x c i , namely, averaged embedding, RNN, attention, and memory networks. Similar to keyphrase taggers (see Section 2.1), each word x c i,s in context x c i takes the form of a vector ν ν ν c i,s mapped by an input layer I c (·), which is also initialized by pre-trained embeddings, and updated in the training process.

Averaged Embedding
As a straightforward sentence representation technique, averaged embedding simply takes the aver- age embeddings of words in a context, i.e., ν ν ν c i,s , as the encoding of context representation, i.e., where |x c i | is the length of x c i in the context.

RNN
RNN encoders employ the recurrent neural network model for the embedded context sequence where W 1 h and W 2 h are learnable weight matrices, and δ h is the component-wise sigmoid function. The encoder representation is thus given by the hidden units at the last state: In this paper, RNN-based encoders have four variants, namely, RNN, GRU, LSTM, and BiL-STM. Particularly, as BiLSTM has two opposite directions, its context representation takes the concatenation of the last states from both directions, which come from two ends of a given context.

Attention
Attention-based encoders put attention mechanism (Bahdanau et al., 2014) upon RNN model for "soft-addressing" important words in the conversation context. In this paper, we use the feed-forward attention (Raffel and Ellis, 2015;SØnderby et al., 2015), as shown in Figure 2. The encoder is thus represented as where α c i,s is the attention coefficient obtained for word x c s , which implicitly reflects its importance for helping keyphrase identification. α c i,s is computed via a softmax over the hidden states by where a(·) is a learnable function formulated as: which takes input only from on h c i,s . W a are parameters of the function a(·) to be learned.

Memory Networks
The encoder based on memory networks (MemNN) (Weston et al., 2015) stores and updates the representations of conversation contexts in a memory module. The updated representations are used to guide the keyphrase tagger. Figure 3 illustrates its structure.
Formally, each embedded context sequence We then yield the match between embedded target post V i =< ν ν ν i,1 , ν ν ν i,2 , ..., ν ν ν i,|x i | > and context memory M i by their inner product activated by softmax: where P i,j,j captures the similarity between the j-th word in conversation context x c i and the j -th word in target post x i .
To transform context input x c i into an aligned form so that it is able to be added with P i , we include another embedding matrix C i =< µ µ µ i,1 , ..., µ µ µ i,|x c i | >. Similar to attention encoder, the MemNN encoder aims to generate a representation, which addresses the important part in the conversation context that helps tagging keyphrases in target post x i . The sum of C i and matching matrix P i serves as the encoded representation for conversation context: In particular, both attention and MemNN explores salient words in conversations that describe main focus of the conversation, which helps indicate keyphrases of a target post. In comparison, MemNN explicitly exploits the affinity of target posts and conversations in matching each other, while attention implicitly highlights certain context without taking target posts into account.  3 Experiment Setup

Datasets
Our experiments are conducted on two datasets collected from Twitter and Weibo 3 , respectively. The Twitter dataset is constructed based on TREC2011 microblog track 4 . To recover conversations, we used Tweet Search API 5 to retrieve full information of a tweet with its "in reply to status id" included. Recursively, we searched the "in reply to" tweet till the entire conversation is recovered. Note that we do not consider retweet relations, i.e., reposting behaviors on Twitter, because retweets provide limited extra textual information for the reason that Twitter did not allow users to add comments in retweets until 2015. To build the Weibo dataset, we tracked real-time trending hashtags 6 on Weibo and used the hashtag-search API 7 to crawl the posts matching the given hashtag queries. In the end, a large-scale Weibo corpus is built containing Weibo messages posted during January 2nd to July 31st, 2014. For keyphrase annotation, we follow  to use microblog hashtags as gold-  standard keyphrases 8 and filtered all microblog posts by two rules: first, there is only one hashtag per post; second, the hashtag is inside a post, i.e., containing neither the first nor the last word of a post. Then, we removed all the "#" symbols in hashtags before keyphrase extraction. For both Twitter and Weibo dataset, we randomly sample 80% for training, 10% for development, and the rest 10% for test. Table 3 reports the statistics of the two datasets. The dataset released by  is not used because it does not contain conversation information.
We preprocessed Twitter dataset with Twitter NLP tool 9 (Gimpel et al., 2011;Owoputi et al., 2013) for tokenization. For Weibo dataset, we used NLPIR tool 10 (Zhang et al., 2003) for Chinese word segmentation. In particular, Weibo conversations have an relatively wide range (from 3 to 8,846 words), e.g., one conversation could contain up to 447 messages. If use the maximum length of all conversations as the input length for encoders, padding the inputs will lead to a sparse matrix. Therefore, for long conversations (with more than 10 messages), we use KLSum (Haghighi and Vanderwende, 2009) to produce summaries with a length of 10 messages and then encode the produced summaries. In contrast, we do not summarize Twitter conversations because their length range is much narrower (from 4 to 1,035 words). 8   proves that 90% of the hashtagannotated keyphrases match human annotations. 9 http://www.cs.cmu.edu/˜ark/TweetNLP/ 10 https://github.com/NLPIR-team/NLPIR

Model Settings
For keyphrase taggers based on RNN, GRU, and LSTM, we follow  and set their state size to 300. For the BiLSTM tagger, which has two directions, we set the state size for each direction to 150. The joint-layer taggers employ the same hyper-parameters according to . The state size of context encoders shares the same settings with keyphrase taggers. In training, the entire keyphrase extraction framework uses cross-entropy loss and RMSprop optimizer (Graves, 2013) for parameter updating.
We initialize input feature map I for target post and I c for conversation context by embeddings pre-trained on large-scale external microblog collections from Twitter and Weibo. Twitter embeddings are trained on 99M tweets with 27B tokens and 4.6M words in the vocabulary. Weibo embeddings are trained on 467M Weibo messages with 1.7B words and 2.5M words in the vocabulary.
In comparison, we employ neural taggers without encoding conversation context, which are based on RNN, GRU, LSTM, and BiLSTM. We also compare our models with the state-of-the-art joint-layer RNN  and its GRU, LSTM, and BiLSTM variations.
We design two experiment settings when running these models: 1) each target post is treated as a document; 2) each conversation (containing the target post) is treated as a document. We select the top N words for each target post by their rankedorders and the threshold N is tuned on the development set. As a result, N ranges from 2 to 7 for various methods. Particularly, since TF-IDF and TextRank extract keywords instead of keyphrases, we aggregate the selected keywords according to Bellaachia and Al-Dhelaan (2012).

Experimental Results
Section 4.1 to 4.5 present quantitative and qualitative analysis of our neural keyprhase extraction models. Section 4.6 reports the performance of ranking-based models where we test the general applicability of incorporating conversation context to non-neural keyphase extraction methods. Table 4 and Table 5 report F1 scores on Twitter and Weibo, respectively. 12 We have the following observations.

Overall Comparisons
Conversation context is useful for keyphrase extraction. By combining the encoded context in conversations, the F1 scores of all taggers are better than their basic versions without context encoders. It confirms that content in conversations helps in indicating keyphrases in target posts.
Selecting the correct context encoder is important. Encoding context simply by RNN or GRU yields poor results. The reason for RNN is that it suffers from gradient vanishing problem when encoding long conversations (conversions in our 12 We also tried BiRNN and BiGRU as keyphrase taggers and as context encoders. They are outperformed by BiLSTM. We don't report these results due to the space limitation. two datasets have over 45 words on average). The reason for GRU is that its forget gates may be not well trained to process important content when the training set is small.
The results of AvgEmb are the worst on Twitter while competitive to other encoders on Weibo. The performance of AvgEmb is competitive to other complex context encoders on Weibo. The reason may be that incorrect word orders generally do not affect the understanding in Chinese, where word order misuse is prevalent in Chinese Weibo messages. As a result, encoding word orders, as is done by the encoders except AvgEmb, might bring noise to keyphrase extraction on Weibo dataset. In contrast, AvgEmb is the worst encoder on Twitter dataset, as word order is crucial in English.
Identifying salient content in context is important. Four types of context encoders have different behaviors. Avg Emb considers all words in conversation context are equally important. RNNvariant context encoders, i.e., RNN, GRU, LSTM, and BiLSTM, additionally explore the relations between succeeded words without distinguishing salient and non-salient words. Attention (Att (LSTM) and Att (BiLSTM)) and MemNN can recognize critical content in conversations, which would indicate keyphrases in target posts. Therefore, our keyphrase extraction framework with attention or MemNN encoder has generally better F1 scores than those with other encoders.
MemNN can effectively capture salient content in context. On Twitter dataset, MemNN achieves the best F1 scores when combining with various keyphrase taggers except for single-layer GRU and BiLSTM. On Weibo dataset, although MemNN does not always outperform other encoders, its performance is close to the best ones.  Table 6: The F1 scores of BiLSTM taggers measured on test instances without conversation context (%). SL BiLSTM and JL BiLSTM denote keyphrase tagger as single-layer and joint-layer BiLSTM, respectively. The other abbreviations are defined the same as those in Table 4.

Test without Conversation Context
Although we have shown in the previous section that conversation context is useful for training effective models for keyphrase extraction on microblog posts, it is necessary to consider that conversation context might be unavailable to some microblog posts, which do not sparking any repost or reply message. Under this circumstance, the models trained on messages with conversation context might be affected in extracting the keyphrases for messages without conversation context. To study whether conversation context is critical in testing process, we assume that the conversations are only available for training data, while all the target posts in the test set have no context to be leveraged. To this end, we apply the models trained for the experiment in Section 4.1 on the test posts without using their conversation context. In prediction, context encoders of the trained models take the target posts instead of conversation as input. Results are reported in Table 6, where models with context encoders yield better F1 scores than their counterparts without such encoders no matter providing conversation to test data or not. This observation indicates that encoding conversations in training data helps in learning effective keyphrase extraction models, which is beneficial to detect keyphrases in a microblog post with or without its conversation context. In addition, by comparing Table 6 with Table 4 and 5, we find that, for each model with context encoder, higher F1 scores are observed when conversation context is used in testing process. This observation confirms that, conversation context of target posts helps in indicating keyphrases in prediction. Figure 4: The heatmap of the context representation generated by MemNN (see Eq. 8). The horizontal axis refers to words in the conversation context, while the vertical axis refers to words in the target post. Darker colors indicate higher weights. The red box indicates the keyphrase to be detected.

Qualitative Analysis
To qualitatively analyze why MemNN encoder generally performs better in comparison, we conduct a case study on the sample instance in Table 1. Recall that the keyphrase should be "president Duterte". We compare the keyphrases produced by the joint-layer BiLSTM tagger with various context encoders, given in Table 7. Of all models, only the one with MemNN encoder tags correctly. Interestingly, Avg Emb does not extract any keyphrase. The reason might be that it considers each word in conversations independent and equally important. Therefore, when using this encoder, non-topic words like "if " and "LOL" may distract the keyphrase tagger in identifying the key information. Models with BiLSTM, Att (BiLSTM), and the basic model without encoder mistakenly extract the sentiment word "childish" since sentiment words are prominent on Twitter. We also visualize context representation generated by MemNN for conversation context in a heatmap shown in Figure 4.
It is observed that MemNN highlights different types of words for keyphrases and non-keyphrases. For keyphrases, MemNN highlights topical words such as "Obama". For non-keyphrases, MemNN highlights non-topic words, e.g., "be", "to". Therefore, features learned for keyphrases and non-keyphrases are different, which can thus benefit keyphrase tagger to correctly distinguish keyphrases from non-keyphrases.

Keyphrases with Various Lengths
To further evaluate our methods, we investigate them on keyphrases with various lengths. Figure 5 Extracted  Table 7: Outputs of joint-layer BiLSTM combined with various context encoders given the example illustrated in Table1. "NULL": Avg Emb did not produce any keyphrase.
shows the histograms of F1 scores yielded by a single-layer and a joint-layer tagger on Twitter and Weibo when keyphrase lengths are different. Note that we only report the results of BiLSTM taggers because their overall F1 scores are the best according to Table 4 and Table 5.
In general, the F1 scores of all models decrease when keyphrases becomes longer, which implies that detecting longer keyphrases is harder than short ones. In comparison of different context encoders, we observe that MemNN obtained the best F1 score in detection of long keyphrases. This is because MemNN highlights salient content in conversation context by jointly considering its similarities with keyphrases in target posts. When the keyphrases become longer, there are more words in context highlighted, which hence helps keyphrase tagger. For short keyphrases, MemNN is still competitive with other context encoders. The observation suggests that MemNN is robust in detecting various length of keyphrases.

Error Analysis
In this section, we briefly discuss the errors found in our experiments. It is observed that one major incorrect prediction is additionally extracted neighboring words surrounding a gold-standard keyphrase. For example, in the tweet "Hillary Clinton accepted gifts from UAE, Saudi Arabia, Oman and others while SOS. CROOKED Podesta Emails 29 ...", in addition to the gold-standard "Podesta Emails 29", our models also extract out "CROOKED". In general, these additionally extracted words are mostly modifiers of keyphrases. External features for identifying modifiers can be used to filter these auxiliary parts of a keyphrase.
Another main error comes from the words that are not keyphrases in target posts but reflect the topics in conversations. For example, joint-layer BiLSTM tagger with MemNN encoder mistakenly extracts "Hillary" as a keyphrase for "DOU-BLE STANDARD: Obama DOJ Prosecuted Others For Leaking FAR LESS Than Hillary Espionage URL" whose keyphrase should be "Espionage".
Because the corresponding conversation of this post is centered around "Hillary" instead of "Espionage", such information is captured by the context encoder, which leads to incorrect keyphrase prediction. However, this type of error points out the potential of extending our framework to extracting keyphrases from conversations instead of a post, which would be beneficial to generating summary-worthy content for conversations (Fernández et al., 2008;Loza et al., 2014). Table 8 reports the results of ranking models on Twitter and Weibo. We have the following observations. First, tagging-based models perform much better than ranking-based ones in keyphrase extraction. Comparing the results in Table 8 with that in Table 4 and Table 5, all neural taggers outperform non-neural ranking-based models by a large margin. This fact, again, confirms that keyphrase extraction is a challenging task on short microblog messages. Compared to ranking-based models, neural tagging models have the ability  Table 8: Precision, recall, and F1 scores of ranking-based baselines (%). w/o context: each target post is treated as a document; w/ context: each conversation and its corresponding target post is treated as a document.

Ranking-based Models
to capture indicative features. Second, conversation context improves ranking-based models by a large margin. Simply by aggregating conversations to a pseudo-document, the F1 scores of TF-IDF, TextRank, and KEA are generally better than their counterparts that are only performed on target posts. For TF-IDF and TextRank, which are unsupervised, context remarkably improves recall by enriching more topic-related words. While for supervised method KEA, context improves both precision and recall, because supervision helps in identifying good features from conversations.

Related Work
Previous work on extracting keyphrases mainly focuses on formal texts like news reports (Wan and Xiao, 2008) and scientific articles (Nguyen and Kan, 2007). Existing keyphrase extraction models can be categorized as ranking-based models and tagging-based models. Ranking-based methods include models based on graph ranking (Mihalcea and Tarau, 2004;Wan and Xiao, 2008), text clustering (Liu et al., 2009), TF-IDF (Jones, 2004;Zhang et al., 2007;Lee and Kim, 2008;Kireyev, 2009;Wu and Giles, 2013), etc. The empirical study provided by Hasan and Ng (2010) shows that TF-IDF has robust performance and can serve as a strong baseline. Tagging models focus on using manually-crafted features for binary classifiers to predict keyphrases (Frank et al., 1999;Tang et al., 2004;Medelyan and Witten, 2006). Our models are in the line of tagging approaches, and provide an alternative choice that incorporates additionally knowledge from conversations. Recently, keyphrase extraction methods have been extended to social media texts (Zhao et al., 2011;Bellaachia and Al-Dhelaan, 2012;Marujo et al., 2015;. These work suffers from the data sparsity issue because social media texts are normally short. Also, they only use internal information in the input text and ignore external knowledge in conversation context. Thus our work provides an improved approach that compensates their limitations.

Conclusion
This work presents a keyphrase extraction framework for microblog posts with considering conversation context to alleviate the data sparsity in short and colloquial messages. The posts to be tagged are enriched by conversation context through four types of encoders based on averaged embedding, RNN, attention, and memory networks, which are effective in capturing salient content in conversations that is indicative for keyphrase identification. Experimental results on Twitter and Weibo dataset have shown that by effectively encoding conversation context, our proposed models outperform existing approaches by a large margin. Qualitative analysis confirm that our context encoders capture critical content in conversations.