funSentiment at SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs Using Word Vectors Built from StockTwits and Twitter

This paper describes the approach we used for SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs. We use three types of word embeddings in our algorithm: word embeddings learned from 200 million tweets, sentiment-specific word embeddings learned from 10 million tweets using distance supervision, and word embeddings learned from 20 million StockTwits messages. In our approach, we also take the left and right context of the target company into consideration when generating polarity prediction features. All the features generated from different word embeddings and contexts are integrated together to train our algorithm


Introduction
Domain specific Sentiment Analysis has received much attention recently. The financial domain is a high-impact use case for Sentiment Analysis because it has been shown that sentiments and opinions can affect market dynamics [9,48]. Given the link between sentiment and market dynamics, the analysis of public sentiment becomes a powerful method to predict the market reaction. One main source of public sentiment is social media, such as Twitter and StockTiwts.
In this paper, we describe our approach for SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs (Cortis et al., 2017). The task is: given a microblog message, predict the sentiment score for each of the companies/stocks mentioned. Sentiment values needed to be floating point values within the range of -1 (very negative/bearish) to 1 (very positive/ bullish), with 0 designating neutral sentiment. Our approach uses word embeddings (WE-Twitter) learned from general tweets, sentiment specific word embeddings (SSWE) learned from distance supervised tweets, and word embeddings learned from StockTwits messages (WE-StockTwits).
Message or sentence level sentiment classification has been studied by many previous works (Go et al., 2009;Mohammand et al., 2013;Pang et al., 2002;Liu, 2012;, but there are few studies on target-dependent, or entity level, sentiment prediction (Jiang et al., 2011;Dong et al., 2014;Vo and Zhang, 2015). A target entity in a message does not necessarily have the same polarity type as the message, and different entities in the same message may have different polarities. For example, in the tweet "iPhone is better than Blackberry", the two named entities, iPhone and Blackberry, will have different sentiment polarities. Recent studies have focused on learning features directly from tweet text. One approach is to generate sentence representations from word embeddings. Several word embedding generation algorithms have been proposed in previous studies (Collobert et al., 2011;Mikolov et al., 2013). Using the general word embeddings directly in sentiment analysis is not effective, since they mainly model a word's semantic context, ignoring the sentiment clues in text. Therefore, words with opposite polarity, such as worst and best, are mapped onto vectors embeddings that are close to each other in some dimensions.  propose a sentiment-specific word embedding (SSWE) method for sentiment analysis, by extending the word embedding algorithm. SSWE encodes sentiment information in the word embeddings.
Many terms in financial market have different meanings, especially sentiment polarity, from that in other domains or sources, such as general news articles and Twitter. For example, terms long, short, put and call have special meanings in stock market. Another example is the term underestimate, which is a negative term in general, but it can suggest an opportunity to buy when used in stock market messages. Therefore, in this study we also build word embedding specifically from StockTwits messages.
The context of an entity will affect its polarity value, and usually an entity has a left context and also a right one, unless it is at the beginning or end of a message. Both the context information and the interaction between these two contexts are included in the algorithm features of our approach. In this task, the financial microblogs are from StockTwits and Twitter, so in our approach, we incorporate features generated from WE-Twitter, SSWE and WE-StockTwits to represent these contexts, since they complement each other.

Previous Studies
Sentence or Message Level Sentiment: Traditional sentiment analysis approaches use sentiment lexicons (Mohammad et al., 2013;Thelwall et al., 2012;Turney, 2002) to generate various features. Pang et al. treat sentiment classification as a special case of text categorization, by applying learning algorithms (2002). Many studies follow Pang's approach by designing features and applying different learning algorithms on them (Feldman, 2013;Liu, 2012). Go et al. (2009) proposed a distance supervision approach to derive features from tweets obtained by positive and negative emotions. Some studies (Hu et al., 2013;Liu, 2012;Pak and Paroubek 2010) follow this approach. Feature engineering plays an important role in microblog sentiment analysis; Mohammad et al. (2013) implemented hundreds of handcrafted features for tweet sentiment analysis.
Deep learning has been used in the sentiment analysis tasks, mainly by applying word embeddings (Collobert et al., 2011;Mikolov et al., 2013). Learning the compositionality of phrase and sentence and then using them in sentiment classification is also explored by some studies (Hermann and Blunsom, 2013;Socher et al., 2011;Socher et al., 2013). Using the general word embeddings directly in sentiment analysis may not be effective, since they mainly model a word's semantic context, ignoring the sentiment clues in text.  propose a sentimentspecific word embedding method by extending the word embedding algorithm from (Collobert et al., 2011) and incorporating sentiment data in the learning of word embeddings. Entity or Target Level Sentiment: Jiang et al. (2011) use both entity dependent and independent features generated based on a set of rules to assign polarity to entities. By using POS features and the CRF algorithm, Mitchell et al. (2013) identify polarities for people and organizations in tweets. Dong et al. (2014) apply adaptive recursive neural network on the entity level sentiment classification. These two approaches use syntax parsers to parse the tweet to generate related features. In our approach, we consider both the left and right contexts of a target when generating features.

Methodology
In this section, we describe the three main components used in our method, the WE-Twitter, SSWE and WE-StockTwits models, and how the learning features are generated from them and integrated together.

WE-Twitter and WE-StockTwits Word Embedding
Word Embedding: A word embedding is a dense, low-dimensional and real-valued vector for a word. The embeddings of a word capture both the syntactic structure and semantics of the word. Traditional bag-of-words and bag-of-ngrams hardly capture the semantics of words. Word embeddings have been used in many NLP tasks. The C&W model (Collobert et al., 2011) and the word2vec model (Mikolov et al., 2013), which is used in this study to generate the WE-Twitter and WE-StockTwits embeddings, are the two popular models. The embeddings are learned to optimize an objective function defined on the original text, such as likelihood for word occurrences. One implementation is the word2vec from Mikolov et al. (2013). This model has two training options, continuous bag of words and the Skip-gram model. The Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. This model is used in our method for building WE-Twitter and WE-StockTwits models. Generating word embeddings from text corpus is an unsupervised process. To get high quality embedding vectors, a large amount of training data is necessary. After training, each word, including all hashtags in the case of tweet, is represented by a low-dimensional, dense and realvalued vector. WE-Twitter model construction: The tweets for building the WE-Twitter model include tweets obtained through Twitter's public streaming API and The Decahose data (10% of Twitter's streaming data) obtained from Twitter. Only English tweets are included in this study. In total there are about 200 million tweets. Each tweet text is preprocessed to get a clean version, following steps blow:  all URLs and mentions are removed.  dates are converted to a symbol.  all ratios are replaced by a special symbol.  integers and decimals are normalized to two special symbols.  all special characters, except hashtags, cashtags, emoticons, question marks and exclamations, are removed. Stop words are not removed, since they provide important information on how other words are used. In total, about 2.9 billion words were used to train the WE-Twitter model. Based on our pilot experiments, we set the embedding dimension size, word frequency threshold and window size as 300, 5 and 8, respectively. There are about 1.9 million unique words in this model. WE-StockTwits model construction: StockTwits is a financial social network for sharing ideas among traders. Anyone on StockTwits can contribute contentshort messages limited to 140 characters that cover ideas on specific investments. Most messages have a cashtag, which is a stock symbol, such as $aapl, to specify the entity (stock) this message is about. We received the permission from StockTwits to access their historical message archive from 2011 to 2016. We extract 20 million messages from this data set to build the WE-StockTwits model. Some preprocessing steps are performed to clean the messages:  messages that contain only cashtags, URLs, or mentions are discarded, since they do not have meaningful terms.  message text is converted to lower case.  all URLs are removed.  all mentions are converted to a special symbol, for privacy reason.  all cashtags are replaced by a special symbol, to avoid cashtags to gain a polarity value related to a particular time period.
The embedding dimension size, word frequency threshold and window size are set as 300, 5 and 8, respectively.

Sentiment-Specific Word Embedding
SSWE: The C&W model (Collobert et al., 2011) learns word embeddings based on the syntactic contexts of words. It replaces the center word with a random word and derives a corrupted ngram. The training objective is that the original n-gram is expected to obtain a higher language model score than the corrupted n-gram. The original and corrupted n-grams are treated as inputs of a feed-forward neural network, respectively. SSWE extends the C&W model by incorporating the sentiment information into the neural network to learn the embeddings; it captures the sentiment information of sentences as well as the syntactic contexts of words . Given an original (or corrupted) n-gram and the sentiment polarity of a tweet as input, it predicts a two-dimensional vector (f 0 , f 1 ), for each input n-gram, where (f 0 , f 1 ) are the language model score and sentiment score of the input n-gram, respectively. The training objectives are twofold: the original n-gram should get a higher language model score than the corrupted n-gram, and the polarity score of the original n-gram should be more aligned to the polarity label of the tweet than the corrupted one. The loss function is the linear combination of two losses -loss 0 (t, t') is the syntactic loss and loss 1 (t, t') is the sentiment loss:

loss (t, t') = α * loss 0 (t, t') + (1-α) * loss 1 (t, t')
The SSWE model used in this study was trained from massive distant-supervised tweets, collected using positive and negative emotions. SSWE model construction: The SSWE model for Twitter was trained from massive distantsupervised tweets, collected using positive and negative emoticons, such as :), =), :( and :-(. A total of 10 million tweets were collected, where 5 million contain positive emotions and the other 5 million contain negative ones. The embedding dimension size was set as 50 and the window size as 3.

Features
Given a message and the target entity, nine types of features are generated based on WE-Twitter, SSWE, and WE-StockTwits models. They are integrated together to train the algorithm. Figure  1 shows the nine types of features. Three types of features are generated from SSWE embeddings for a target entity. The red ones are SSWE embeddings, and the blue ones are WE-Twitter and WE-StockTwits embeddings. The subscript letter L and R refer to the left and right side of an entity, respectively. These features are described below: WE-Twitter L and WE-Twitter R : These are the WE-Twitter embeddings for the text on the left side and right side of the target entity, respectively. In this task, occasionally, the given cashtag (company stock symbol) does not appear in the message text. In this case, the whole tweet text is used for both the left and right contexts, and this case is handled in the same way when generating WE-StockTwits L , WE-StockTwits R , SSWE L and SSWE R described below. WE-StockTwits L and WE-StockTwits R : These are the WE-StockTwits embeddings for the text on the left side and right side of the target entity, respectively. SSWE L and SSWE R : These are the SSWE embeddings for the text on the left side and right side of the target entity, respectively. WE-Twitter, WE-StockTwits and SSWE: these are the embeddings generated from the whole message text, which means they are entity independent features. We use these three features to capture the whole message, which reflects the interaction between the left and right sides of the entity. These nine types of embeddings together capture different types of information we are interested: the entity's left and right contexts, the interaction of the two sides, the sentiment specific word embedding information, and the general word embedding information learned from Twitter and StockTwits.

Text Representation from Term
Embeddings A message or text segment, such as the left/right context of an entity, has multiple words and each word has its own embedding vector. How to combine them together to represent this message so that all messages will have the same size of embedding vector needs to be explored. There are different ways to do this, e.g. for each embedding dimension, using the max value of all the words. In our approach, we use the concatenation convolution layer, which concatenates the layers of max, min and average of word embeddings, because this layer gives the best performance based on our pilot experiments. The concatenation layer is expressed as follow:

Z(t) = [Z max (t), Z min (t), Z ave (t)]
where Z(t) is the representation of text segment t.

Experiment and Result
For this task, the training data are provided by the task organizers. There are 1,704 tweets and StockTwits messages. We downloaded them from Twitter and StockTwits. To build our model, we split this data set into three parts: 80% as training data and 20% as development data. Since the predicted output in this task is a real value, so we use a liner regression algorithm in our approach. Based on the cosine similarity metric and the evaluation data set, which consists of 800 tweets and StockTwits messages (Cortis et al., 2017), the score of our approach is 0.7153, and our team is ranked at #6 among the 38 submissions.

Conclusion
This paper describes the approach we used for SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs. We use three types of word embeddings in our algorithm: general word embeddings learned from 200 million tweets, sentiment-specific word embeddings learned from 10 million tweets using distance supervision, and word embeddings learned from 20 million StockTwits messages. We treat the task as a target-dependent sentiment analysis problem and consider the context of the target company.