HCS at SemEval-2017 Task 5: Polarity detection in business news using convolutional neural networks

Task 5 of SemEval-2017 involves fine-grained sentiment analysis on financial microblogs and news. Our solution for determining the sentiment score extends an earlier convolutional neural network for sentiment analysis in several ways. We explicitly encode a focus on a particular company, we apply a data augmentation scheme, and use a larger data collection to complement the small training data provided by the task organizers. The best results were achieved by training a model on an external dataset and then tuning it using the provided training dataset.


Introduction
This paper describes our approach to Task 5 of the SemEval-2017 Challenge-fine-grained sentiment analysis on financial microblogs and news. The task is to determine the sentiment score (positive or negative) of a mention of a given company in a business-related text document-a microblog message (Track 1) or a news headline (Track 2).
Our solution, "HCS," is a convolutional neural network to classify sentiment scores. The model's input takes two kinds of information: an article text, a list of focus points-positions in the text where a given company is mentioned. Foci allow the model to distinguish company mentions within the text, and to assign different scores to them.
The data provided by the task organisers, (Handschuh et al., 2016), is short, one-sentence messages, with a given focus company. To train the model on additional data, we use the Named Entity (NE) recognition module of PULS (Yangarber and Steinberger, 2009;Atkinson et al., 2011), a news monitoring system, to find company mentions in arbitrary text.

Data
The SemEval training set contains 1700 sentences for the microblog track and 1300 news headlines for the headline track, which is a very limited resource for training flexible models. To compensate for the small size of the provided training sets, we built an extended training set. The PULS news monitoring system 1 collects articles from a range of sources of business news (Pivovarova et al., 2013;Du et al., 2016). One of our data sources is a collection of news summaries written by business analysts, which contain metadata annotations.
The metadata does not include sentiment scores. However, the metadata does provide labels that indicate business events mentioned in the article, e.g., Investment, Fraud or Merger. The labels are not mutually exclusive, and some documents may have more than one label. There are approximately 300 labels, some of which implyor weakly imply-positive or negative sentiment. However, most labels do not. We selected only those labels with the most clear sentiment implications: e.g. Investment, New Product, Sponsorship, etc., are considered "positive," while Fraud, Layoff, Bankruptcy, etc., are considered "negative." In total, we used 26 positive and 12 negative labels.
Using these labels, we collected a training set from the corpus of short articles. We selected only documents for which we can infer a clear sentiment score; if a document has event several labels with conflicting sentiment, it is not used for training. Further, we used only those documents, whose headline and first sentence mention exactly one company. The rationale for this is that two companies mentioned together may have different scores. Since our event labels do not provide such detailed information, we avoid these cases to keep the training data as clean as possible. A positive label is considered to have a score of 1 and a negative label is -1.
The dataset produced in this fashion is highly skewed: 90% of the data are positive. We apply a random undersampling strategy (Stamatatos, 2008;Erenel and Altınçay, 2013) by randomly selecting a subset of positive documents so that positive and negative training data are more balanced. In our corpus, 100,000 documents have a negative label and mention exactly one company. Thus, the total dataset consists of 200,000 documents. Of these, 10% are used as a development set to determine when to stop training.

Approach
Our model is based on a convolutional neural network (Kim, 2014), which demonstrated stateof-the-art performance on sentiment analysis (Tai et al., 2015). The original model is relatively simple, and we adapt it for determining sentiment score for a given company. We add an indicator of focus to the input, i.e., the position of the company of interest, for which we wish to determine a sentiment score. We also augment the network by incorporating additional convolutional layers.
An overview of our model is shown in Figure 1. The inputs are fed into the network as zero-padded sentences of a fixed size, where each word is represented as a fixed-dimensional embedding, complemented with a scalar indicator of focus. The inputs are fed into a layer of convolutional filters with multiple widths, optionally followed by deeper convolutional layers. The results of the last convolutional layer are max-pooled, producing a vector with one scalar per filter, which is then fed into a fully-connected layer with dropout regularisation, and a soft-max output layer. The output is a 2-dimensional vector that is interpreted as probability distributions over two possible outcomes: positive and negative. Thus, if an instance has a sentiment score -1 it is mapped into [1, 0], a score of 1 is mapped into [0,1]. A cross-entropy loss function is computed between the network's output and the true value to update the network weights via back-propagation.
Next, we briefly describe the details of the components of the model.
Embeddings: Words are represented by 128dimensional embeddings. The initial embeddings were trained using GloVe (Pennington et al., 2014) on a corpus of 5 million business news articles.
Each document was pre-processed using lemmatisation and named entity (NE) recognition. All NEs of a certain type are mapped to the same token, e.g., all company names have the same embedding.
Following the suggestion of Kim (2014), we tune the embeddings during training by updating them at each iteration. This allows the model to learn word properties that are significant for sentiment detection, such as the difference between antonyms, that are not necessarily captured well in the initial embeddings.
Focus: One crucial extension beyond the model in (Kim, 2014) is the focus vector, indicating the position(s) of a given company in the text. The focus vector is shown in darker grey in Figure 1, with the company position in a red frame. This provides an additional dimension to the word embedding, and helps to distinguish between training instances that differ only in focus and sentiment.
The reason for introducing focus is that sentiment is not a feature of the text as a whole, but of each company mention. Two mentions in the same text may have different sentiments and a model needs be able to distinguish them. In this sense, this task is similar to aspect-based sentiment analysis (Pontiki et al., 2016), where the task is not to classify a text or sentence, but an entity within the text. The notion of focus is similar to attention (Bahdanau et al., 2016;Yin et al., 2016), with the difference that attention is learned during training whereas focus is given as an additional input.
We experiment with three alternative representations for focus. The baseline model has no focus, and uses only lexical features without NEs. In the binary strategy, the focus vector contains ones in positions where the target company is appears, and zeros elsewhere. In the smoothed strategy, the focus value for each word indicates the proximity of the current word to the position of the nearest mention of the target company. Proximity is computed according to the formula: where p is the position of the current word and m is the position of the nearest mention of the target company. Thus, proximity is 1 for a company mention, 1/2 for its immediate neighbours, 1/3 for the next neighbours, etc. It is never 0, which allows a convolution filter to use information about focus points, even if it exceeds the filter length. Data augmentation: Since the training set contains only "simple" instances-that mention exactly one company, as described in Section 2-we introduce a method for data augmentation which allows us to generate more realistic data. By feeding our model instances that mention several companies, we force the network to make use of the focus information, so it can learn to handle more complex test instances, producing a better model.
To augment the data we randomly select two simple instances-which gives them a 50% chance of having different sentiments-and concatenate them. We then randomly decide which of them should receive focus. As a result, we get an instance that mentions a focus company and a distractor company either on the left or on the right of the focus. We expect that using these examples the model would learn to ignore sentiment signals if they are far removed from the focus.
Model tuning: We have two different corpora-a large one collected by us and a small training set provided by the task organisers. We used a two-stage learning procedure, where the model is first trained using the large corpus and then it is refined using the shared task data. The core idea is that the first stage is used to learn a coarse solution for the problem on rich data, while the latter stage is used to fine-tune the model for the specific task at hand. In particular, in the second stage the model should calibrate an output to the exact values of the scores, since in the first stage all instances are labelled using only 1 or -1.
For the first training phase we used 10K sentences as a development set to determine when to stop training. For the second phase we take another approach since we want to use as much data as possible for training. First, we split the data into two halves and tune the model, using the second half as a development set to define the number of steps before it overfits. Then we tune the model using the entire training set (and no development set) and allow it to train the same number of epochs, which means the model has seen each training instance the same number of times. Table 1 shows the results for a selection of models trained on our data and tested on the shared task's training set. For the experiments we use only English microblogs. The evaluation is done in terms of cosine similarity between a model's output and the correct answer, as well as accuracy. 2 We explore several hyper-parameters of the model: the number of convolution layers and the number and size of convolution filters. We also report the effect of using (or not) the data augmentation scheme described above. We also manipulate the instances, where the same company is mentioned several times, by considering instances with (many) foci or splitting them into several instances with only one focus point.

Results
As shown in the table, the data augmentation scheme does not help the performance for this par-    ticular task. Thus, we submitted a solution without augmentation. Using foci increases performance for microblogs but not for headlines, probably because most instances in the task have only one mention. However, we submitted a solution with (smooth) focus since we believe it will be crucial in more realistic settings. It can also be seen from the table that, although the results for headlines and microblogs have comparable accuracy, microblog classification is substantially worse in terms of cosine similarity.
The model we chose for the SemEval submission (for both subtasks) is highlighted in blue in the table. For each subtask, we made two submissions: one without tuning-using only our data, and one with the tuning step-we continue refining the model, using headlines and microblog data respectively. The final results of the shared task are shown in Table 3. As can be seen in the table, tuning provides a substantial improvement-16% for headlines and 24% for microblogs. Table 2 shows some examples of the more problematic cases that we found during error analysis.
Example 1 would require processing of longdistance dependencies. In this sentence the key phrase accounting scandal is far from the focus company Tesco, so none of the convolutional filters is applied to the company name and the phrase at the same time. The focus mechanism reduces the weight of the phrase, since another company name appears between the focus and the phrase, which may indicate a drawback of our model on such short input strings. Some sentences are incorrectly classified due to a complicated syntactic structure.
Example 2 contains a string of strongly negative cues (breaks, downward slide, cutting, sales decline), which should cancel each other out, but correct processing of such sentences would require deeper semantic analysis. Note, that in this task we have rather short pieces of text; in a more realistic setting the model should classify an entire document, where the company of interest would be mentioned multiple times with different keywords in context.