Using Human Attention to Extract Keyphrase from Microblog Post

This paper studies automatic keyphrase extraction on social media. Previous works have achieved promising results on it, but they neglect human reading behavior during keyphrase annotating. The human attention is a crucial element of human reading behavior. It reveals the relevance of words to the main topics of the target text. Thus, this paper aims to integrate human attention into keyphrase extraction models. First, human attention is represented by the reading duration estimated from eye-tracking corpus. Then, we merge human attention with neural network models by an attention mechanism. In addition, we also integrate human attention into unsupervised models. To the best of our knowledge, we are the first to utilize human attention on keyphrase extraction tasks. The experimental results show that our models have significant improvements on two Twitter datasets.


Introduction
Rapidly growth of user-generated content on social media has far outpaced human beings' reading and understanding capacity. Keyphrase extraction is one of the technologies that can organize this massive content. A keyphrase consists of one or more salient words, which represents the main topics of a document. It has a series of downstream applications, e.g., text summarization (Zhao et al., 2011a) and information retrieval (Choi et al., 2012).
Generally, corpus with human annotated keyphrases are needed to train models in supervised keyphrase extraction frameworks. The premise for annotators to annotate keyphrases is to read the corresponding content. Intuitively, features estimated from human reading behavior can be leveraged to assist keyphrase extraction. * Corresponding Author.
Previous studies on keyphrase extraction have ignored these features (Zhang et al., 2016(Zhang et al., , 2018. Thus, this paper aims to integrate the reading behavior into keyphrase extraction frameworks. When human reading, they do not pay the same attention to all words (Carpenter and Just, 1983). The reading time of per-word is the indicative of textual (as well as lexical, syntactic and semantic) processing (Demberg and Keller, 2008), which reflects human attention on various content. To obtain human attention during reading, this paper estimates eye fixation duration from eye-tracking corpus inspired by Carpenter and Just (1983) and Barrett et al. (2018). The modern-day eye tracking equipment resulting in a very rich and detailed dataset (Cop et al., 2017). Thus, we utilize opensource eye-tracking corpora and do not require eye-tracking information of the target datasets.
To integrate human attention into keyphrase extraction models, this paper constructs a neural network model with attention mechanism. Attention mechanism is a neural module designed to imitate human visual attention when they reading and looking (Bahdanau et al., 2014). To regularize the predicted value of attention mechanism, human attention estimated from eye-tracking corpus is leveraged as the ground truth of it. Quantitative and qualitative analyses demonstrate that our models yield a better performance than state-of-the-art models. In addition, we prove that human attention is also effective on unsupervised keyphrase extraction models. We are, to the best of our knowledge, the first to integrate human attention into keyphrase extraction tasks.

Related Work
Recently, keyphrase extraction technologies have been extended to social media (Zhao et al., 2011b;Bellaachia and Al-Dhelaan, 2012), e.g., Twitter and Sina Weibo. Previous studies extract keyphrases using traditional supervised algorithms (Marujo et al., 2015), which depending on a large set of manually selected features. To overcome this drawback, neural network models, which can learn features from training corpus automatically, are proposed and are proven effective in keyphrase extraction. For instance, Zhang et al. (2016) propose a neural network model to extract keyphrases from Tweets. This model extracts keyphrases from Tweets directly, which suffers from the severe data sparsity problem. External knowledge is utilized to alleviate this problem. Zhang et al. (2018) encode conversation context consisting of Tweet reply in neural models. This model yields a better performance than Zhang et al. (2016) , which prove the effectiveness of external knowledge. Thus, this paper is in the line of integrating external knowledge into neural network models. In this paper, we explore the idea of using human attention estimated from available eye-tracking corpus to assist keyphrase extraction.
The open source eye-tracking corpus of natural reading include the Dundee corpus (Ekbal et al., 2007) and GECO (Cop et al., 2017). The features of eye tracking corpus include first fixation duration (FFD), total reading time (TRT), go-past time (GPT) , et al. TRT is a feature that has been applied to various natural language processing tasks, such as multi word expressions prediction (Rohanian et al., 2017) and sentiment analysis (Barrett et al., 2018). Thus, we select the TRT feature to represent the human attention. Since the GECO corpus is open sourced and is in English, we estimate the TRT feature from it.

Keyphrase Extraction Framework
Formally, given a target microblog post x i formulated as word sequence < x i,1 , x i,2 , · · · , x i,|x i | >, where |x i | denotes the length of x i , we aim to produce a tag sequence < y i,1 , y i,2 , · · · , y i,|x i | >, where y i,w indicates whether x i,w is part of a keyphrase. As shown in Figure 1, our models use the character-level word embedding proposed by Jebbara and Cimiano (2017), but we ignore this part of our architecture in the equations below: where h i,w is the representation of x i,w after passing through the Bi-directional LSTM (BiLSTM) layer, W y and b y are parameters of the function σ(·) to be learned. W y and b y are parameters of the function tanh(·) to be learned, σ(·) is a nonliner function. In detail, y i,w has five possible values following Zhang et al. (2016): (2) where Single represents that x i,w is a one-word keyword. Begin, Middle and End represent that x i,w is the first word, the middle word and the last word of a keyphrase, respectively. Not represents that x i,w is not a keyword or part of a keyphrase.
From the hidden states, we directly predict word level raw attention scores a i,w : where W e and b e are parameters of function tanh(·). Then, we normalize these predictions to attention weights a i,w : where k is the length of x i . Inspired by Barrett et al. (2018), we combine above mentioned two objections: word-level and attention-level. The word-level is to minimize the squared error between outputs y i,w and true word labelsŷ i,w .
The attention-level objective, similarly, is to minimize the squared error between the attention weights a i,w and real human attentionâ i,w estimated from eye-tracking corpus.
When combined, λ word and λ att (between 0 and 1) are utilized to trade off loss functions at the wordlevel and attention-level, respectively.
In addition to above mentioned single layer models, we also use joint-layer BiLSTM proposed by Zhang et al. (2016). As a multi-task learner, jointlayer BiLSTM tackles two tasks with two types of outputs, y 1 i,w and y 2 i,w . y 1 i,w has a binary tagset, which indicates whether the word x i,w is part of a keyphrase or not. y 2 i,w employs the 5-value tagset defined in Equation 2. There is an attention module upon each BiLSTM layer with a corresponding prediction. The loss changes with the number of layers in models. The out represents the number of layers in the model.
4 Experiment Settings

Twitter Dataset
Our experiments are conducted on two datasets, i.e., Daily-Life dataset and Election-Trec dataset.
Daily-Life This is collected from January of 2018 to April of 2018 using Twitter's steaming API with a set of daily life keywords.
Election-Trec This is constructed based on opensource dataset TREC2011 track 1 and Election corpus (Zeng et al., 2018) For keyphrase annotation, we follow Zhang et al. (2016) to use microblog hashtags as goldstandard keyphrases and filtered all microblog posts by two rules: first, there is only one hash tag per post; second, the hashtag is inside a post. Then, we removed all the '#' before keyphrase extraction. For both Twitter datasets, we randomly sample 0.8, 0.1 and 0.1 for training, development and testing. We preprocessed both Twitter datasets with Twitter NLP tool 3 for tokenization. After filtering and preprocessing, Daily-Life dataset and Election-Trec dataset contains 16,047 Tweets and 30,264 Tweets, respectively. Table 1 shows the statistic information of two Twitter datasets Since there are no spaces between words in hashtags, we use some strategies to segment hashtags. There are two kinds of hashtags in the datasets. One is the 'multi-word' that contains both capitals and lowercases, the other are the 'single-word' in all lowercases or capitals. If a hashtag is a 'multi-word', we segment hashtags with two patterns, first is (capital) * (lowercase)+, which represents one capital followed by one or more lowercases, second is (capital)+, which represents one or more capitals. When doing hashtag segmentation, the first pattern is utilized firstly and then the second pattern is applied. Meanwhile, we do not do any preprocessing if a hashtag is a 'single-word'.

Eye-tracking Corpus
This paper estimates human attention from GECO corpus (Cop et al., 2017), which is based on normal reading. In GECO, participants read a part of the novel 'The Mysterious Affair at Styles' by Agatha Christie. Six males and seven females whose native language is English participated in and read a total of 5,031 sentences. There are various features in GECO, including First Fixation Duration (FFD) and Total Reading Time (TRT). In this paper, we merely use the TRT feature, which represents total human attention on words during reading. This feature is also used by Carpenter and Just (1983) and Barrett et al. (2018). We then di-vide TRT values by the number of participants to get an average TRT (ATRT).
Human attention correlates with word frequency (Rayner and Duffy, 1988). Thus, ATRT is normalized by the word frequency of the British National Corpus (BNC) 4 . Before normalizing, BNC is log-transformed per million and inversed (INV-BNC), such that rare words get a high value. ATRT and INV-BNC are min-max-normalized to a value in the range 0-1. ATRT is multiplied with INV-BNC to get normalized ATRT (N-ATRT). After preprocessing, there are 5,012 unique words in the dataset. In addition, words that are not included in the GECO corpus, which do not have a corresponding N-ATRT value, are given the mean value of N-ATRT. Table 1 shows the percentage of words that can be found in GECO corpus.

Implementation Details
In the training phrase, we choose BiL-STM (Graves and Schmidhuber, 2005) with 300 dimensions. For single layer models, λ word and λ att are set to 0.7 and 0.3, respectively. For joint layer models, λ 1 word , λ 1 att , λ 2 word and λ 2 att are set to 0.4, 0.2, 0.2 and 0.2, respectively. Parameters are set under the best performance. The epoch is set to 5. We initialize target post by embeddings pre-trained on 99M tweets with 27B tokens and 4.6M words in the vocabulary.

Baseline Models
We compare our models with CRF (Zhang et al., 2008) and two kinds of neural network models: one kind is the neural network model without attention mechanism (BiLSTM model), the other is the neural network model with attention mechanism but is not modified by human attention (A-BiLSTM model). Similar as HA-BiLSTM proposed by this paper, BiLSTM models and A-BiLSTM models employ the single layer pattern and the joint layer pattern. The parameter setting of the joint layer pattern is same with Zhang et al. (2016). We compare the performance of models with the P, R and F1 evaluation metrics.
BiLSTM model This model is merely constructed by the character-level word embedding and the BiLSTM layer.   STM layer and attention mechanism. Different with HA-BiLSTM, the attention mechanism in A-BiLSTM is not modified by human attention.

Overall Comparisons
Human attention estimated from eye-tracking corpus is helpful in improving the performance of neural network keyphrase extraction. As shown in Table 2, all the F1 values of models with human attention are higher than those of baseline models. In this paper, human attention is represented by the total reading time of per-word estimated from eye-tracking corpus. Thus, it indicates that the attempt of integrating human reading behavior information into neural network is feasible.
The open-source eye-tracking corpus can improve the performance of models on datasets in different genres. Although the genre of the GECO eye-tracking corpus is fiction, which is different with the genre of the target dataset (Microblog), it has the ability to improve the performance of keyphrase extraction on target datasets.

Qualitative Analysis
To qualitatively analyze why models with human attention generally perform better in comparison, we conduct a case study on two simple instances in Table 3 and Table 4. In Table 3, the keyphrase of the target post should be 'hillary clinton'. We compare the keyphrase produced by A-BiLSTM   (Single) and HA-BiLSTM (Single). Interestingly, the A-BiLSTM extracts two phrases 'hillary clinton' and 'court'. It may due to that the attention weight of 'court' is the biggest among all words in the target post in A-BiLSTM. The HA-BiLSTM identifies the correct keyphrase. In this model, the attention weight of 'court' is the 6th biggest among all words in the target post. The reason of this phenomenon is that the 'court' has a low N-ATRT value (0.024). Using the N-ATRT value of 'court' can modify the attention weight of 'court'.
In Table 4, the keyphrase of the target post should be 'entertainment'. As shown in Table 4, the A-BiLSTM model do not extract any phrase, while the HA-BiLSTM model extract the correct keyphrase. It may due to that the attention weight of 'entertainment' in A-BiLSTM is the 13th biggest among all the words in the target post, while it is the third biggest in HA-BiLSTM, which is due to the high N-ATRT value (0.147) of 'entertainment' in GECO eye-tracking dataset modifying the corresponding attention weight.

Analysis on Unsupervised Models
In this section, we explore the idea of using human attention on TextRank (Mihalcea and Tarau, 2004), which is an unsupervised keyphrase extraction algorithm. As defined in Section 3, a Tweet x i consist of words x i,1 , x i,2 , · · · , x i,n . If x i,m is appeared within the window of x i,j , there is an edge e(x i,m , x i,j ) between these two words. Based on the graph composited by word vertices and edges, the importance of each word vertices can be calculated. In TextRank, the value of x i,j  and e(x i,m , x i,j ) are initialized unprivileged. In our models, we utilize human attention to normalize the initialized value of x i,j and e(x i,m , x i,j ). The initialized value of x i,j depends on the N-ATRT value of itself. The initialized value of e(x i,m , x i,j ) depends on the N-ATRT value of x i,m and x i,j . After extracting candidate words by HATR, we generate keyphrases by combining candidate words if words are connected together in target posts.
As shown in Table 5, all the P, R and F1 values of HATR are higher than those of TextRank. These observations indicate that integrating human attention during reading into TextRank is feasible. Moreover, more candidate keyphrases yield better keyphrase extraction performance.

Conclusion
In this paper, we consolidate the neural network keyphrase extraction algorithm with human attention represented by total reading time (TRT) estimated from GECO eye-tracking corpus. The proposed models yield a better performance on two Twitter datasets. Moreover, human attention is also effective on unsupervised models.
In the future, first, we try to utilize more eyetracking corpus and estimate more features of reading behavior. Then, we will attempt to analyze real human reading behavior on social media and thereby explore more specific human attention features on social media.