A Cognition Based Attention Model for Sentiment Analysis

Attention models are proposed in sentiment analysis because some words are more important than others. However,most existing methods either use local context based text information or user preference information. In this work, we propose a novel attention model trained by cognition grounded eye-tracking data. A reading prediction model is first built using eye-tracking data as dependent data and other features in the context as independent data. The predicted reading time is then used to build a cognition based attention (CBA) layer for neural sentiment analysis. As a comprehensive model, We can capture attentions of words in sentences as well as sentences in documents. Different attention mechanisms can also be incorporated to capture other aspects of attentions. Evaluations show the CBA based method outperforms the state-of-the-art local context based attention methods significantly. This brings insight to how cognition grounded data can be brought into NLP tasks.


Introduction
Sentiment analysis is critical for many applications such as sentimental product recommendation (Dong et al., 2013), public opinion detection (Pang et al., 2008), and human-machine interaction (Clavel and Callejas, 2016), etc.Sentiment analysis has been well-explored (Pang et al., 2002;Vanzo et al., 2014;Tang et al., 2015a;Maas et al., 2011).Recently, deep learning based methods have further elevated the performance of sentiment analysis without the need for labor intensive feature engineering.
Attention models are incorporated into sentiment analysis because not all words are created equal. Some words are more important than others in conveying the message in a sentence. Similarly, some sentences are more important than others in a document. Although the overall reading time as a cognitive process may reflect the syntax and discourse complexity, reading time of individual words is also an indicator of their semantic importance in text (Roseman, 2001;Demberg and Keller, 2008). Previous attention models are built using information embedded in text including users, products and text in local context for sentiment classification (Tang et al., 2015b;Yang et al., 2016;Gui et al., 2016). However, attention models using local context based text through distributional similarity lack theoretical foundation to reflect the cognitive basis. But, the key in sentiment analysis lies in its cognitive basis. Thus, we envision that cognition grounded data obtained in text reading should be helpful in building an attention model.
In this paper, we propose a novel cognition based attention(CBA) model for sentiment analysis learned from cognition grounded eye-tracking data. Eye-tracking is the process of measuring either the point of gaze or the motion of an eye relative to the head 1 . In psycho-linguistics experiments, Barrett(2016) shows that readers are less likely to fixate on close-class words that are predictable from context. Readers also fixate longer on words which play significant semantic roles (Demberg and Keller, 2008) in addition to infrequent words, ambiguous words, and morphological complex words (Rayner, 1998). Since reading time can be learned from an eye-tracking dataset, predicted reading time of words in its context can be used as indicators of attention weights.
We first build a regression model to map syntax, and context features of a word to its reading time based on eye-tracking data. We then apply the model to sentiment analysis text to obtain the estimated reading time of words at the sentence level. The estimated reading time can then be used as the attention weights in its context to build the attention layer in a neural network based sentiment analysis model. Evaluation on the four sentiment analysis benchmark datasets (IMDB,Yelp 13,Yelp 14 and IMDB2) show that our proposed model can significantly improve the performance compared to the state-of-the-art attention methods.
To sum up, we have two major contributions: (1) We propose a novel cognition grounded attention model to improve the state-of-the-art neural network based sentiment analysis models by learning attention information from eye-tracking data. This is one of the first attempts to use cognition grounded data in sentiment analysis. The CBA model not only can capture attention of words at the sentence level, it can also be aggregated to work at the document level. (2) Evaluation on several real-world datasets in sentiment analysis shows that our method outperforms other state-of-the-art methods significantly. This work validates the effectiveness of cognition grounded data in building attention models.

Related works
The basic task in sentiment analysis can be formulated as a classification problem. Class labels can either be binary (positive/negative) or polarity either as intensity by continuous values or as ratings in certain range such as 0 to 5 or 1 to 10, etc..
In recent years, deep learning based methods have greatly improved the performance of sentiment analysis. Commonly used models include Convolutional Neural Networks (Socher et al., 2011), Recursive Neural Network (Socher et al., 2013), and Recurrent Neural Networks (Irsoy and Cardie, 2014). RNN naturally benefits sentiment classification because of its ability to capture sequential information in text. However, standard RNN suffers from the gradient vanishing problem (Bengio et al., 1994) where gradients may grow or decay exponentially over long sequences. To address this problem, Long-Short Term Memory model (LSTM) is introduced by adding a gated mechanism to keep long term memory. Each LSTM layer is generally followed by mean pool-ing and then feed into the next layer. Experiments in datasets which contain long documents and sentences demonstrate that the LSTM model outperforms the traditional RNN (Tang et al., 2015a,c).
Not all words contribute equally to the semantics of a sentence . Attention based neural networks are proposed to highlight their difference in contribution (Yang et al., 2016). In document level sentiment classification, both sentence level attention and document level attention are proposed. In the sentence level attention layer, an attention mechanism identifies words that are important. Those informative words are aggregated as attention weights to form sentence embedding representation. This method is generally called local context attention method. Similarly, some sentences can also be highlighted to indicate their importance in a document.
Apart from local context attention, user/product attentions are also included in deep learning based methods either in a separate network (Gui et al., 2016) or a unified network (Tang et al., 2015c;Gui et al., 2016). Some feature engineering method to some specific datasets can also achieve very good result (Sadeghian and Sharafat, 2015). However, they are not suited for other genre of text as userproduct information are not generally available.
Attention models can be built not only from local text or user/product information but also from cognitive grounded data, especially eye-tracking data (Rayner, 1998;Allopenna et al., 1998).  proposes a novel metric called Sentiment Annotation Complexity for measuring sentiment annotation complexity based on eye-tracking data.  presents a cognitive study of sentiment detection from the perspective of AI where readers are tested as sentiment readers. Mishra (Mishra et al., 2016b) recently proposes a model in sentiment analysis and sarcasm detection by using eye-tracking data as a feature in addition to text features using Naive-Bayes and SVM classifiers.
In other NLP tasks, Joshi (2013) shows that Word-Sense-Disambiguation can make use of simultaneous eye-tracking. Eye-tracking data are also used to measure the difficulty in translation annotation (Mishra et al., 2013). Barrett (2016) finds that gaze patterns during reading are strongly influenced by the role a word plays in terms of syntax, semantic, and discourse.
Among different available eye-tracking datasets, the Dundee corpus, GECO (the Ghent Eye-Tracking Corpus), and Mishra et al. (Mishra et al., 2016b) are considered high-quality resources (Kennedy, 2003;Cop et al., 2016;Mishra et al., 2016b). The Dundee corpus contains eye movement data from English and French newspapers (Kennedy, 2003). Measurements were taken while 10 participants read 20 newspaper articles. GECO is an English-Dutch bilingual corpus with eye-tracking data from 17 participants collected from reading the complete novel The Mysterious Affair at Styles. The corpus has 4,934 sentences, 774,015 tokens, and 9,876 words. The Mishra (Mishra et al., 2016a) dataset contains 994 text snippets with 383 positive and 611 negative examples from newspaper clippings, sampled from seven native speakers.
To predict reading time using eye-tracking data, Tomanek et al. (2010) proposes a regression model using linguistic features related to syntax and semantics for calibration. Hahn (2016) proposes a novel approach to model both skipping and reading using unsupervised method which combines neural attention with auto-encoding trained on raw text using reinforcement learning.

Our proposed CBA model
The basic idea of our method is to add a CBA model into a neural-network based LSTM sentiment classifier. Let D be a collection of documents. A document d k , d k ∈ D, has m number of sentences S 1 , S 2 , ...S j , ..., S m . A sentence S j is formed by a sequence of words where n is the feature space size. The purpose of document level sentiment classification is to project a document d k into the target space of L class labels. Similarly, at the sentence level, the purpose is to project a sentence S j into the target class space.
To build the CBA model, we need to first build a reading time prediction model for words within each sentence. Reading time is predicted based on word features and text features calibrated by eyetracking data. Note that reading time from an eyetracking dataset cannot be used directly because the text of any eye-tracking dataset is too small for sufficient coverage. Consequently, our method has four tasks: (1) to predict the reading time of words using eye-tracking data and v w i as features; (2) to build attention models based on predicted reading time at sentence level and document level; (3) to integrate attentions from other attention models; and (4) to add the attention model into the LSTM based sentiment classifier.

Modeling of reading time
To learn the reading time of words in a sentence, our method is based on regression analysis using eye-tracking data as dependent variables and context information in v w∈S j as independent variables. In the eye-tracking process, a number of different time measures such as first fixation duration, gaze duration, and total reading time. In this work, we only use the total reading time.
Since a document set is always available for sentiment analysis, we use features extracted from these documents to train the regression model. We select features based on the works from Demberg (2008) and Tomanek (2010) to include word features such as word length and POS tags as well as context level syntax and semantic features such as the total number of dominated nodes in a dependency parsing three, the maximum dependency distance, semantic category etc.. Given a word w in a sentence S j , w ∈ S j , and its feature vector where n is the dimension size in feature space, the regression model on eye-tracking data is a mapping function g between reading time t w∈S j and v w∈S j as defined below: where t w∈S j is the predicted reading time for w, α i is the weight of feature F w i , and b is a constant. Note that the set of α i (i = 1...n) forms the weight vector α w for t w∈S j . When v w∈S j takes scalar values, g can be an identity function and thus this model becomes a typical linear regression model. When t w∈S j takes discrete values, g can be a logistic function and this model becomes a typical logistic regression model.
we set g to be the identity function. The objective function then becomes: where y w∈S j is the true eye-tracking values of reading time, R( α) is the regularization of α, and λ is the regularization weight. When λ = 0, the model degrades to a linear regression function. In this work, we evaluate the use of both the linear regression model and the Ridge regression model.

Building the attention based model
Once we have predicted reading time for words used in sentences, the attention model can be built with two components. The first component works at the sentence level to give different words different emphasis in a sentence. The second component works at the document level to give different sentences different emphasis in a document. For a sentence S j = w 1 w 2 ...w i ...w l j with length l j , each word w i in S j has a corresponding reading time t w i . Let t S j denote the total reading time of S j . Then, For sentence level attention, the CBA weight for w i in S j , denoted as A S j :w i , can be defined as: This sentence level attention model defined above gives more weights to words that have longer reading time relative to the total reading time of the sentence.
Let a document d k , d k ∈ D, be formed by a set of sentences S j = w 1 w 2 ...w i ...w l j . Now the CBA weight for a sentence S j in d k is defined as: This aggregated document level attention model gives more weights to the sentences that have longer reading time relative to the total reading time of the document. Let A d k denote the document level attention weight vector. The size of A d k should be m, the number of sentences in d k . Let S j denote the embedding of S j in N dimensional space, where S j ∈ d k . Then, the set of sentence representations for d k should be a matrix of size m × N , denoted byŜ d k . After the inclusion of the attention model,Ŝ d k should be: Let d k denote the document embedding of d k .
Since d k is an N dimensional vector, d k can now be defined by the adjusted attention model as

Incorporation of other attention models
Since document embedding representation allows the combined use of multiple attention mechanisms, it is to our advantage to incorporate different attention mechanisms which may help to capture different aspects of attentions. Generally speaking, different attention mechanisms can be incorporated either serially or in parallel. In principle, any number of attention models can be included. As an an example to illustrate the capability of our proposed method, we choose one state-of-the-art local attention model(shorthanded as LA). The model is a semantic-based local attention model proposed by Yang (2016) and also used by . For inclusion serially, the attention weight is formulated as follows: where LA s j :w i the sentence level attention model by the local attention model. To incorporate LA in parallel mode, the attention weight can be formulated by: Similar methods can be used at document level.

General sentiment analysis model
We take the neural network based LSTM sentiment classifier (Gers, 2001) to be applied in both the sentence level and the document level because of its excellent performance on long sentences (Tang et al., 2015a). The basic LSTM model has five internal vectors for a node i including an input gate i i , a forget gate f i , an output gate o i , a candidate memory cell c i , and a memory cell c i , and i i f i and o i are used to indicate which values will be updated, forget or for keeping in the LSTM model. c i and c i are used to keep the candidate features and the actual accepted features, respectively. At the sentence level, each word w i in a sentence S j is represented by its word embedding w i in the N dimensional space. The LSTM cell state c i and the hidden state h S j :w i can be updated in two steps. In the first step, the previous hidden state h S j :w i−1 uses a hyperbolic function to form c i as defined below.
whereŴ c is a parameter matrix, h S j :w i−1 is the previous hidden state and w i is the word vector.b is the regularization parameter matrix. In the second step, c i is updated by c i and its previous state c i−1 to form c i according to the below formula: The hidden state of w i can be obtained by The forget gate f i is designed to keep the long term memory. A series of hidden states h 1 h 2 ... h i can serve as input to the attention layer to obtain sentence representation S j . In the document level, similar method is used to get the sentence matrix S in the document level LSTM layer to obtain the final document representation d k .
In our work, the final document representation d k encodes both the sentence level information and the document level information. In the LSTM model, we use a hidden layer to project the final document vector d f k through a hyperbolic function.
whereŴ h is the hidden layer weight matrix and b h is the regularization matrix. Finally, sentiment prediction for any label l L obtained by the softmax function defined below: where W l is the softmax weight for each label.

Performance evaluation
Our proposed CBA for sentiment classification is evaluated on four document sets: The first three datasets IMDB, Yelp 13, and Yelp14 which are review texts including user/product information developed by Tang (2015a Two commonly used performance evaluation metrics are used. The first one is accuracy and the second one is rooted mean square error (RMSE) 3 . Let GR i be the golden sentiment ratings, P R i be the predicted sentiment rating, and T be the number of documents where GR i = P R i . Accuracy is then defined by and RMSE is defined by We train the skip-gram word embedding (Mikolov et al., 2013) on each dataset separately to initialize the word vectors. All embedding sizes on the model are set to 200, a commonly used size.
Three sets of experiments are conducted. The first is on the selection of the regression model for reading time prediction. The second set of experiments compares our proposed CBA with another sentiment analysis method which use text only. The third set of experiments evaluates the effectiveness of combining different attention models.

Reading time prediction
The training for the regression model for reading time prediction using eye-tracking data requires the learning from text and context features as discussed in Section 3.1. We compare our regression model with more complex deep learning based regression models in each of the three eye-tracking datasets. 4 We take the first 90% of sentences as training data and the rest 10% as test data. The configuration that performs the best is selected and predicated on the document sentiment analysis dataset to obtain estimated reading time. Ideally, an eyetracking corpus built from on-line reviews is more suitable for our experiments. But, we can only work with what is available.
In addition to the linear regression model(LL) and the Ridge regression model(RR), we also choose the Recurrent Neural Network (RNN) model and the Long Short Time Memory (LSTM) model for regression learning. For both models, there are two versions. The basic version inputs the extracted feature sets as word representation, labeled as RNN-1 and LSTM-1, respective. The second version takes word embedding (Pennington et al., 2014) as the initial word representation input, labeled as RNN-2 and LSTM-2, respectively. The RMSE results are listed in Table 2. GECO   Note that Ridge Regression(RR) has the best performance on all the three datasets because regularization in RR reduces over-fitting problem.In three eye tracking datasets, the RR can achieve coefficient of determination 5 of 0.32, 0.30 and 0.27 in three eye tracking datasets. The features, their types and the corresponding coefficients in RR are shown in Table 3.
The more complicated deep learning models suffer from serious over-fitting problem. And the result of Deep learning model with word embedding initialization partly supports the fact that the reading time are more depend on the micro level syntax and semantic feature for the word, such as number of letters in word and complexity score of the word instead of the deep level context features.

Comparison of different sentiment classification methods
Because the features used in our model are all text based, we compare CBA with two groups 5 https : //en.wikipedia.org/wiki/Coef f icientof d etermination  of baseline methods which also only use review text for learning. Group 1 methods include commonly known linguistic and context features for SVM classifiers. Group 2 includes recent sentiment classification algorithms which are top performers using review text for training including one method that uses local attention model. Below is the list of Group1 methods.
• Majority -A simple majority based classifier based on sentence labels.
• Text feature -A SVM classifier using word level and context level features, such as ngram and sentiment lexicons.
• AvgWordvec -A SVM classifier that takes the average of word embeddings in Word2Vec as document embedding.
Here is a list of Group 2 methods: • SSWE (Tang et al., 2014) -A SVM classifier using sentiment specific word embedding.
• Paragraph vector (Le and Mikolov, 2014) -A SVM classifier using document embedding as features.
• LSTM+LA ) -State-ofthe-art LSTM using local context as attention mechanism in both sentence level and document level.  • LSTM+CBA+LA G -The LSTM based classifier using both the CBA model and the local text context based attention model(LA) . Since combining method can either be serial or in parallel, there are actually two corresponding variations: LSTM+CBA+LA G s and LSTM+CBA+LA G p .
• LSTM+CBA+UPA G -The same framework to LSTM+CBA+LA G with additional user/product attention. The two corresponding variations are LSTM+CBA+UPA G s and LSTM+CBA+UPA G p . Table 4 shows the performance of the three groups using review text without user/product information on only the first three datasets methods in Group 1 and Group 2 do not have evaluations on IMDB2. Among all the reference methods that do not use any attention mechanism including all methods in Group 1 and Group 2(except LSTM+LA), LSTM is the best performer. LSTM+LA (2016), which is the state-of-the-art method, uses local attention mechanism to improve performance significantly. Among our CBA based variations, using the GECO dataset gives the best result outperforming LSTM+LA in all three datasets. LSTM+CBA G has significant improvement over LSTM+LA with p values of p < 0.016 on IMDB, p < 0.0019 on Yelp 13, and p < 0.00023 on Yelp 14. LSTM+CBA G has the best result compared to the other two variations because GECO has larger participant size. Its text genre is also closer to the review datasets for sentiment analysis.
In the third set of experiment, we compare our LSTM+CBA model with the combination of other attention models including the LA model and the UPA model as shown in Table 5. In the second set of experiment, since the GECO dataset gives the best performance, Table 5   only if user/product information is available. Such data is provided in the first three sets of data. Table 5 shows that among all three single attention models, UPA outperforms both LA and CBA in the first three datasets. This is easier to understand as UPA already included LA and it has more explicit information from users and products for its attention model compared to CBA which needs to learn hidden attention information. The combined method of CBA with UPA can still further improve performance. When CBA+UPA are combined in parallel, it has the best performance for both Yelp13 and Yelp14 (with p value of 0.027 and 0.032 respectively compare to LSTM+UPA). In the IMDB dataset, however, UPA has the best performance. This may be because user/product information is more effective in movie review IMDB dataset which is more subjective.
However, the UPA model works only if user and product information is available. Thus for IMDB2 where user/product information is not available, only CBA and LA models work and the combined use of CBA+LA gives the best performance.

Case study
A random sentence sample 'The Shelton hotel is lucky to receive 2stars from me considering ...' is taken from the Yelp13 dataset to demonstrate the difference in the two attention mechanisms, i.e. local text(LA), and cognition-based(CBA). Figure 1 shows visually the difference in attention weights of the two models.
The attention weights of words in the LA model does not change much. CBA, on the other hand, gives higher weights to the sentiment linked word 2stars and the verb receive. This two words do play significant roles as an indirect object and a main verb, respectively. This case shows that CBA does a better job in capturing micro level informa-tion in the sentence level. This support the experimental results in Table 4 and Table 5.

Conclusion and future works
In this paper, we propose a novel cognition based attention model to improve the state-of-the-art neural sentiment analysis model through cognition grounded eye-tracking data. A simple and effective regression model is used to predict reading time using both eye-tracking data and local text features. The predicted reading time is then used to build an attention layer in neural sentiment analysis models. The attention model considers both reading time and other syntactic and context features. It works in both the sentence level and the document level sentiment analysis.
Evaluation on benchmarking datasets validates the effectiveness of our method in sentiment analysis as our method clearly outperforms other stateof-the-art methods that use local context information to build their attention models. Our CBA mechanism can also be combined with other attention mechanisms to provide room for further improvement. Future work includes using other eye-tracking information such as saccade and fixation. The incorporation of other information such as user-product information can also be explored.