RiTUAL-UH at SemEval-2017 Task 5: Sentiment Analysis on Financial Data Using Neural Networks

In this paper, we present our systems for the “SemEval-2017 Task-5 on Fine-Grained Sentiment Analysis on Financial Microblogs and News”. In our system, we combined hand-engineered lexical, sentiment and metadata features, the representations learned from Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Unit (Bi-GRU) with Attention model applied on top. With this architecture we obtained weighted cosine similarity scores of 0.72 and 0.74 for subtask-1 and subtask-2, respectively. Using the official scoring system, our system ranked the second place for subtask-2 and eighth place for the subtask-1. It ranked first for both of the subtasks by the scores achieved by an alternate scoring system.


Introduction
Predicting sentiments of financial data has a wide range of applications. The most important application is being able to predict the ups and downs of the share market as the changes in sentiments and opinions can change the market dynamics (Goonatilake and Herath, 2007;Van de Kauter et al., 2015). Stock market related information is typically found in newspapers (Malo et al., 2013) and people discuss them in social media platforms like Twitter and StockTwits. Positive news has the capability to boost the market by increasing optimism among people ( Van de Kauter et al., 2015;Schuster, 2003). SemEval-2017 Task * Both authors contributed equally. 1 http://alt.qcri.org/semeval2017/ task5/data/uploads/description_second_ approach.pdf 5 on 'Fine-Grained Sentiment Analysis on Financial Microblogs and News' aims at analyzing the polarity of public sentiments from financial data found in newspapers and social media. In this paper, we describe our systems, which exploit automatically learned representations using deep learning architecture based methods along with hand-engineered features in order to predict the sentiment polarity of financial data.

Dataset Task
Training Trial Test Subtask-1 1,694 10 799 Subtask-2 1,142 14 491 Table 1: Data distribution for subtask-1 and subtask-2. Table 1 shows the distribution of training, trial, and test data for subtask-1 and subtask-2. For the subtask-1, the financial microblogs and tweets were collected from Twitter 2 and StockTwits 3 whereas for subtask-2, the financial news headlines were collected from different financial news sources such as Yahoo Finance 4 . Each instance was labeled with a floating point value ranging from -1 to +1, indicating the sentiment score. A score of -1 means very negative or bearish whereas, a score of +1 means very positive or bullish. A score of 0 means neutral sentiment. The dataset is noisy and contains URLs, cashtags, digits, usernames, and emoticons. The messages are short with an average number of 13 tokens for the microblog data and 10 tokens for the headlines data.

Methodology
We designed two systems for predicting sentiment polarity scores. The first system exploits the hand-engineered features and uses Support Vector Regression (SVR) to predict the sentiment scores. The next system combines the handengineered features with representation learned using CNN and Bi-GRU to predict the sentiment scores. These systems are explained below:

System 1
With the hand-engineered features explained in Section 3.1.1, we built a support vector regression (SVR) model with linear kernel using the implementation of Scikit-learn (Pedregosa et al., 2011). We only used linear kernel as most of the text classification problems are linearly separable (Joachims, 1998). We tuned the C parameter through grid search cross-validation over the values {10, 1, 0.1, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06} during the training phase.

Hand-crafted Features
Before extracting the features, we first lowercased, applied stemming and removed stopwords from the messages. We also replaced named entities (NE), and digits with common identifiers to reduce noise. Lexical: We extracted word n-grams (n=1,2,3) and character n-grams (n=3,4,5) from the messages as they are strong lexical representations (Cavnar and Trenkle, 1994;Mcnamee and Mayfield, 2004;Sureka and Jalote, 2010). Sentiment: SenticNet  have been used successfully in problems related to sentiment analysis (Bravo-Marquez et al., 2014;Maharjan et al., 2017) as it provides a collection of concept-level opinion lexicons with scores in five dimensions (aptitude, attention, pleasantness, polarity, and sensitivity). We used both of the stemmed and non-stemmed versions of the messages to extract concepts from the knowledge base. We modeled the concepts as bag-of-concepts (BoC) and used them as binary features. We averaged the concept scores of five dimensions for each text and used them as numeric features. Word Embeddings: Word embeddings have been shown to capture semantic information. Hence, in order to capture the semantic representation of the messages, we used publicly available word vec-tors 5 trained on Google News. It was trained by the method proposed by (Mikolov et al., 2013) and has 3M vocabulary entries. We averaged the word vectors of every word in the messages and represented them with a 300-dimensional vector. If any word is not available in the pre-trained vectors vocabulary, we skipped that word. The coverage of the Google word embedding is 73% and 82% for the microblog and headlines data, respectively. Metadata: We used the message sources, cashtags and company names as metadata features.  Figure 1 shows the overall system architecture of our neural network model. The main motivation to use deep learning methods is the wide success these methods have achieved in various NLP tasks (Bahdanau et al., 2014;Collobert and Weston, 2008). It is a combination of two deep learning architecture based models and a multilayer perceptrons (MLP) model operating on hand-engineered fea-tures discussed in Section 3.1.1.

System 2
(1) We tokenized each messages and represented them as sequences of word vectors as in Equation 1. The maximum length (T ) of the sequences was set to 18 for headlines and 33 for microblogs. These lengths were determined from the training data. The embeddings for the words were initialized using pre-trained word embeddings. We used zero vectors to pad the shorter sequences and represent the missing words in the pre-trained vectors. We used Keras (Chollet, 2015) to build the model with Theano (Theano Development Team, 2016) as the back end.

Convolutional Neural Network (CNN)
We used two parallel deep learning architecture based models on the embeddings. As the first model, we used a Convolutional Neural Network (CNN) (LeCun et al., 1989). In this model, we stacked 4 sets of convolution modules with 512 filters each for filter sizes 1, 2, 3, and 4 to capture the n-grams (n = 1 to 4). The t-th convolution output using filter size c is defined by: (2) The filter is applied from window t to window t + c − 1 of size c. Each convolution unit calculates a convolution weight W c and a bias b c . Each filter of size c produces a high-level feature map h c .
On those filters, we apply pooling operation using an attention layer. Attention models have been used effectively in many problems related to computer vision Ba et al., 2014) and adopted successfully in natural language related problems (Bahdanau et al., 2014;Seo et al., 2016). An attention layer applied on top of a feature map h i computes the weighted sum c i .
The weight α ij is defined by where Here, W w , b w and u w are model parameters. A dense layer containing 128 neurons were applied on the attention layer to get the final representation for the high-level features produced by the CNN model.

Bidirectional GRU (Bi-GRU)
The second model was based on a bidirectional GRU (Bahdanau et al., 2014). It summarized the contextual information from both directions of a sequence and provided annotation for the words. The bidirectional GRU contains a forward GRU of 200 units and another backward GRU of 200 units. The forward GRU − → f reads a sequence s i of size n from w 1 to w n to calculate a sequence of forward hidden states ( − → h 1 , ..., − → h n ) and the backward GRU ← − f reads the same sequence from w n to w 1 to calculate a sequence of backward hidden states ( For each word w j , we get an annotation h j by concatenating the forward hidden state − → h j and backward hidden state . We applied an attention layer similar to CNN on the word annotations to find out the important features and got a vector of 200 dimensions.

Multilayer Perceptrons (MLP)
To use the hand-engineered features we employed a multilayer perceptron in parallel with the deep learning architecture based models. We fed the extracted features in the input layer and used four hidden dense layers having 200, 100, 50, and 10 neurons respectively. For the feature vector representation − → x i = [x i,1 , x i,2 , ..., x i,T ] of message m i , each neuron of a hidden layer j calculates a vector − → h j defined by the following equation.
Here, W ij is the weight matrix and b j is the bias vector of the layer j. This model produced a highlevel feature representation in the output layer of size 10.
By concatenating the outputs of these three models we created a merged layer of size 338. It contained the three types of high-level features computed by three different types of models. CNN captured the local information, Bi-GRU captured the sequence information and MLP represented the hand-engineered features. We apply a dense layer of 128 neurons on top of this merged layer. It was similar to the layers used in the MLP model but we used tanh as the activation function instead of ReLU here. The outputs of this layer were passed to the activation layer containing only one neuron having tanh as the activation function. We used tanh in the final two layers as it produces val-  ues between -1 and +1 and it is also the range of sentiment scores.

Experiments and Results
As the trial dataset too small compared to the training data, we merged it with the training data and ran 10-fold cross-validation to evaluate different models. We tuned the hyper-parameters during the training phase through grid-search cross validation method for System 1. We also experimented with different architectures to build System 1 in this phase. We evaluated our models using the official scoring system that measures the weighted cosine similarity, similar to the scorer used in Ghosh et al. (2015). As the predictions are continuous values between -1 and +1, cosine similarity measures the degree of alignment between the true values and the predictions. The final weighted score is computed by multiplying the cosine similarity with the ratio of predicted values against the number of test cases. As no official baseline scores were provided, we did a baseline experiment using a simple linear regression model with the hand-engineered features to compare our models. Table 2 presents the weighted cosine scores achieved by the models we experimented with. System-1 achieved weighted cosine scores of 0.70 and 0.67 for subtask-1 and subtask-2, respectively. Among the neural network models, Bi-GRU performed better than the others. It achieved 0.68 for both of the subtasks. The combination of the three neural network based model (System-2) performed better than the individual models. The neural network models were trained for 10 epochs. We observed issue of overfitting when we increased the number of epochs beyond this.
From the results for subtask-1 we can see that all the other models performed better than CNN. It indicates that other features captured by Bi-GRU and hand-engineering process were more informative than the local information captured by   CNN. We can understand the strength of the handcrafted features also by observing the performance of SVR. Although it did not perform as expected on the test data of subtask-1, it showed good performance on the validation set. We submitted predictions by System-1 (sub-1) and System-2 (sub-2) for subtask-1 as they were the best models. Due to comparatively better performance of System-2 in subtask-2, we submitted predictions from two different models with this system but varied the number of epochs from 10 (sub-1) to 20 (sub-2). For subtask-1, sub-1 and sub-2 was ranked eleventh and eighth, respectively. For subtask-2, both of the submissions achieved almost similar scores and ranked second.
Submitted systems were evaluated simultaneously with an alternate scoring system that measures cosine similarity by grouping instances based on the related company. Our systems ranked the first for both subtasks when evaluated with this scorer. Table 4 shows that our system worked well when there are more plain texts (MB2, HL2). But it struggles when the text contains more statistics (e.g. $RIG -13% $EK -10%) than plain texts or mix of words with strong positive and negative sentiments (worst, strong, weird, wilt, falls, exit). If we look at MB1, it is very difficult to determine the sentiment polarity from the message. The message starts with 'Worst performers today', which indicates a negative polarity. Rest of the message contains statistics for different companies. Among them four indicate a drop in prices and only one indicates a rise in the stock price. It is noticeable that, although the message started with negative impression, it ended with a positive impression by saying 'best stock: $WTS +15%'. As this is the only possible reason for the highly positive true sentiment polarity score of 0.857 for this message, we get a hint that our systems might need to put more attention on how a message ends. MB3 starts with the phrase 'Weird day' followed by a positive and a negative news about stock prices of two companies and our model predicted 0.588 as the polarity score where gold score is -0.649. We tried to find out the possible reason behind our prediction by simply looking at the distribution of the words in this message. In the training data, the word 'weird' appeared only once. 68% of the 241 messages that contain 'day' are positive in the training data. Out of 270 messages that contain 'up', 201 messages (75%) are positives. We found 118 messages that contain 'down' and 80 (68%) of them are negative. So, we can see that if a message contains 'up' and 'down', chance of predicting it as positive is higher. Related cashtag $GPRO was found in three messages and $amba appeared only in one message. Our model predicted a positive sentiment for HL1 although it contains a clear indication of a negative polarity by the word 'wilt'. To find out the reason we observed that, 'wilt' did not appear in the headlines training data at all. Its polarity is -0.087 in the scale of -1 to +1 according to the SenticNet database we used. So we can say that, our model needs to handle this type of trigger word that can control the polarity itself.

Conclusions
In this paper, we presented our system for analyzing sentiments from microblogs and news headlines. We used deep learning architecture based models to automatically identify important local and sequential features from the texts and concate-nated them with multilayer perceptron-based representation of hand-engineered features extracted from the data. Future works include analyzing the statistics of ups and downs in stock prices of companies from the messages to incorporate them as features of the model.