Adullam at SemEval-2017 Task 4: Sentiment Analyzer Using Lexicon Integrated Convolutional Neural Networks with Attention

We propose a sentiment analyzer for the prediction of document-level sentiments of English micro-blog messages from Twitter. The proposed method is based on lexicon integrated convolutional neural networks with attention (LCA). Its performance was evaluated using the datasets provided by SemEval competition (Task 4). The proposed sentiment analyzer obtained an average F1 of 55.2%, an average recall of 58.9% and an accuracy of 61.4%.


Introduction
Sentiment analysis is necessary to interpret the vast number of online opinions on social media platforms such as Twitter. This will allow governments and corporations to manage public relations and policies effectively. Existing sentiment analyzers are based on naïve bayes, SVM, RNN (Irsoy, 2014) and in particular convolutional neural networks (CNNs) (Kim, 2014).
In order to improve on existing CNN based sentiment analyzer, lexicon embedding and attention embedding were integrated into the proposed sentiment analyzer. Lexicon embedding allows extraction of sentimental score for each word and attention embedding enables the global view of the sentence.
The proposed LCA was both trained and evaluated using corpus from Twitter 2013 to 2016 provided by the SemEval-2017. Figure 1 shows the overview of the proposed sentiment analyzer. It consists of embedding, CNNs, concatenation, fully connected and softmax layer.

Input Features & Architecture
The proposed LCA consists of three input features (i) Word embeddings (ii) Lexicon embeddings (iii) Attention embeddings. Word embeddings are trained by implementation of word2vec using skip-gram (Mikolov, 2014) and negative sampling. The word embeddings are trained using an unlabeled corpus of 1.6M tweets from Sentiment 140 dataset with different dimensions (50,100,200,400). The dimensions of word embeddings are and the number of words in a document is Lexicon embeddings are considered because they are useful features. Lexicon embeddings consist of set of words each paired with a score ranging from -1 to +1. Where a score of -1 represents a negative sentiment and +1, a positive sentiment. The lexicon document corresponding to each word is ∈ ℝ × , where is the dimension of lexicon embeddings and it is set by the number of lexicon corpus.

Attention embeddings are important for Deep
Learning in terms of performance and explanation of models . CNN uses several filters which have length . It considers -gram features, but it only takes local views into account not considering the global view of sentence. Sentiment analysis must consider transitional cases such as negation. While attention embeddings can capture keywords to improve sentiment analysis, it also considers the global view of sentence. In order to do so, CNN for attention embedding used 1 as the length of filter. Then, it executes max pooling for each row of attention matrix. The output of max pooing is an attention vector which has probabilities assigned to each word vector that has -dimension.
The architecture of LCA consists of (i) a word and lexicon embedding layer, (ii) CNNs, (iii) a concatenation layer, (iv) a fully connected layer (v) and a softmax layer.
Word and lexicon embedding layer transform input data into vector representation. The input to our model is a document, treated as a sequence of words. Instead of hand-crafted features, we used word2vec (w2v) to represent words to vectors. We also converted lexicons to vectors, containing sentiment score. The Input document matrix is ∈ ℝ × where is the number of words in a document.
Convolutional neural networks are effective for extracting high level features. We modified the LCA architecture of Shin (2016). The proposed LCA consists of two layer CNNs with a nonlinearity, max pooling layers, a concatenation layer and a softmax classification layer with respect to the word embedding layer. The architecture of the proposed LCA was chosen empirically. The document matrix is convolved by the filter ∈ ℝ × , where is the length of filters. In convolving lexicon embeddings by the filter, we used the separate convolution approach of Shin (2016).
Concatenation layer consists of 1-layer CNN, 2layer CNN, lexicon and attention outputs. We deliberately designed our model so that the output of 1-layer CNN captures low level feature for getting additional information. The dimension of concatenation layer is ∈ ℝ 2 + × , where is the number of filters with the same length and is the number of filters with different lengths.
Fully connected layer (FC) is used to create nonlinear combinations with rectified linear unit (ReLU) (Nair and Hinton, 2010). The input of fully connected layer is the output of concatenation layer. The dimension of weight is ∈ ℝ × and bias is ∈ ℝ , where is the number of class.
Softmax layer is used to convert the output of FC layer into classification probabilities. In order to compute the probabilities, softmax function was used: The output dimension is 3 because our model classified tweets into 3 classes (positive, neutral and negative).
Regularization is achieved by 2 regularizer. In order to prevent overfitting from our CNN model, dropout is used at the output of CNN and fully connected layer. To do this, each node is randomly removed. We also apply 2 regularization to the cost function by adding the term ‖ ‖ 2 2 , where is the regularization strength and ∈ Θ are the fully connected neural network parameters.

Data and Preprocessing
Tweets are used as the training and development dataset from Twitter 2013 to 2016 (The training and development dataset were provided by the SemEval-2017 competition.) In addition, sentiment 140 corpus are added for training word embedding.
Lexicons in the proposed LCA have six types of sentiment lexicons (that include sentimental score). Some lexicons only contain positive and negative sentiment polarities. Sentiment scores were normalized to the range from -1 to +1 because some lexicons have different scales. If some words are missing in a lexicon, we assigned neutral sentiment score of 0.
The following preprocessings were applied to every tweets and lexicon in the corpus: • Lowercase: all the characters in tweets and lexicons are converted in lowercase.
• Cleaning: URLs and '#' token in hashtag were removed to reduce sparse representation.

Training and Hyperparameters
The parameters of our model were trained by Adam (Diederik et al., 2014) optimizer. To anneal the learning rate over time, the learning rate were calculated by exponential decay. The following configuration is our hyperparameters: • Embedding dimension = (50, 100, 200, 400) for both word and attention embeddings.
• Number of filters = (128) for convolving the document matrix combined with lexicon and attention embeddings.
• Batch size = (64) for calculating losses to update weight parameters.
• Number of epochs = (80) for training our models.
• Exponential decay steps and rate = (3000, 0.96) for annealing the learning rate.
• Dropout rate = (0.5) for avoiding overfitting from the last layer of CNN and FC layer

Evaluation
The evaluation metric consisted of (i) macro-averaged F1 measure, (ii) recall (iii) and accuracy in the competition across the positive, negative and neutral classes.

Results
The result of competition showed that our model was overfitting because our experimental results were higher than the actual result. In our experiment, lexicon and word embedding feature showed that it could improve our model. Table 2 presents the various dimensions of word embeddings that could change performance which is high when the dimension of word embedding is 100. Table 3 shows lexicons as the feature more important than word2vec because the overall performances of model with lexicon were higher than the overall performance with word2vec. Since the sentiment score of missing words (such as 0; neutral) has been replaced, the lexicon feature is not perfect. Nonetheless, lexicon is still an important and essential feature for sentiment analysis.

Conclusion
This paper proposes the integration of lexicon with attention on CNN as an approach to sentiment analysis. We considered various features to capture improved representations by concatenating the output of 1-layer and 2-layer CNN. Lexicon and word embedding showed that these fea-tures improved the model performance significantly. Additional enhancements are viable by gathering more training dataset or lexicon dataset with distant supervision (Deriu et al, 2016), because it will extend the coverage of our model. Furthermore, in the aspect of models, the combined CNN-CRF model, recursive neural network and ensembles of multi-layer CNN can be applied.