SSN_MLRG1 at SemEval-2017 Task 4: Sentiment Analysis in Twitter Using Multi-Kernel Gaussian Process Classifier

The SSN MLRG1 team for Semeval-2017 task 4 has applied Gaussian Process, with bag of words feature vectors and fixed rule multi-kernel learning, for sentiment analysis of tweets. Since tweets on the same topic, made at different times, may exhibit different emotions, their properties such as smoothness and periodicity also vary with time. Our experiments show that, compared to single kernel, multiple kernels are effective in learning the simultaneous presence of multiple properties.


Introduction
Twitter is a huge microblogging service with more than 500 million tweets per day from different locations of the world and in different languages (Nabil et al., 2016). The sentiment analysis in Twitter has been applied in various domains such as commerce (Jansen et al., 2009), disaster management (Verma et al., 2011) and health (Chew and Eysenbach, 2010). The task is challenging because of the informal writing style, the semantic diversity of content as well as the "unconventional" grammar. These challenges in building a classification model can be handled by using proper approaches to feature generation and machine learning.
The heart of every Gaussian process model is a covariance kernel. Multi Kernel Learning (MKL)-using multiple kernels instead of a single one-can be useful in two ways: • Different kernels correspond to different notions of similarity, and instead of trying to find which works best, a learning method does the picking for us, or may use a combination of them. Using a specific kernel may be a source of bias which is avoided by allow-ing the learner to choose from among a set of kernels. • Different kernels may use inputs coming from different representations, possibly from different sources or modalities. (Gonen and Alpaydn, 2011) and (Wilson and Adams, 2013) explain how multiple kernels definitely give a powerful performance. (Gonen and Alpaydn, 2011) also describe in detail various methodologies to combine kernels. (Wilson and Adams, 2013) introduces simple closed form kernels that can be used with Gaussian Processes to discover patterns and enable extrapolation. The kernels support a broad class of stationary covariances, but Gaussian Process inference remains simple and analytic.
We studied the possibility of using multiple kernels to explain the relation between the input data and the labels. While there is a body of work on using Multi Kernel Learning (MKL) on numerical data and images, yet applying MKL on text is still an exploration.

Gaussian Process
Gaussian Process is a non-parametric Bayesian modelling in supervised setting. Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006). Using a Gaussian process, we can define a distribution over functions f (x), where m(x) is the mean function, usually defined to be zero, and k(x, x ) is the covariance function (or kernel function) that defines the prior properties of the functions considered for inference. Gaussian Process has the following main advantages (Cohn and Specia, 2013;Cohn et al., 2014).
• The kernel hyper-parameters can be learned via evidence maximization. • GP provides full probabilistic prediction, and an estimate of uncertainty in the prediction. • Unlike SVMs which need unbiased version of dataset for probabilistic prediction, yet does not take into account the uncertainty of f (x), GP does not suffer from this problem. • GP can be easily extended and incorporated into a hierarchical Bayesian model. • GP works really well when combined with kernel models. • GP works well for small datasets too.

Gaussian Process Classification
In Gaussian Process Classification (GPC), we place a GP prior over a latent function f (x) and then "squash" this prior through the logistic function to obtain a prior on π(x) = ∆ p(y = +1|x) = σ(f (x)). Note that π is a deterministic function of f , and since f is stochastic, so is π.
Inference is divided into two steps: first, computing the distribution of the latent variable corresponding to a test case where p(f |X, y) = p(y|f )p(f |X)/p(y|X) is the posterior over the latent variables, and subsequently using this distribution over the latent to produce a probabilistic prediction In classification, the non-Gaussian likelihood in Equation 2 makes the integral analytically intractable. Similarly, Equation 4 can also be analytically intractable for certain sigmoid functions. Therefore, we need an analytical approximation of integrals. We can approximate the non-Gaussian joint posterior with a Gaussian one, using Expectation Propagation (EP) method (Minka, 2001). EP, however, uses the probit likelihood which makes the posterior analytically intractable.
To overcome this hurdle in the EP framework, the likelihood is approximated by a local likelihood approximation in the form of an un-normalized Gaussian function in the latent variable f i which defines the site parametersZ i ,μ i andσ 2 i .
The posterior p(f |X, y) is approximated by A practical implementation of Gaussian Process Classification (GPC) for binary class (Rasmussen and Williams, 2006) is outlined in the following algorithm: Algorithm: Predictions for Expectation Propagation GPC. Input:ν,τ (Natural site param), X (Training inputs), y (Training targets), k (Covariance function), x * (Test input). Output: Predictive class probability.
1. L := cholesky(I n +S 1/2 KS 1/2 ) 2. z : . return: π * (predictive class probability) The natural site parametersν andτ for Expectation Propagation GPC are found using EP approximation algorithm. Multi-class classification can be performed using either one-versus-rest or oneversus-one for training and prediction. For Gaussian Process classification, "one-vs-one" might be computationally cheaper, so we have used it to for subtasks A and C.

Multiple Kernel Gaussian Process
The covariance kernel k of Gaussian Process directly specifies the covariance between every pair of input points in the dataset. The particular choice of covariance function determines the properties such as smoothness, length scales, and amplitude, drawn from the GP prior.
We have used Exponential kernel and Multi-Layer Perceptron kernel combined with Squared Exponential kernel, and found the combinations to give better results. The text data used in sentiment analysis is collected over a period of time. Comments on the same topic may exhibit different emotions, depending on the time it was made, and hence their properties, such as smoothness and periodicity, also vary with time. Since any one kernel learns only certain properties well, multiple kernels are effective in detecting the simultaneous presence of different emotions in the data.
The MKL algorithms use different learning methods for determining the kernel combination function. It is divided into five major categories: Fixed rules, Heuristic approaches, Optimization approaches, Bayesian approaches and Boosting approaches. The combination of kernels in different learning methods can be performed in one of the two basic ways, either using linear combination or using non-linear combination. Linear combination seems more promising (Gonen and Alpaydn, 2011), and have two basic categories: unweighted sum (i.e., using sum or mean of the kernels as the combined kernel) and weighted sum. Non-linear combination uses non-linear functions of kernels, namely multiplication, power, and exponentiation. We have studied the fixed rule linear combination in this work which can be represented as (7) For training, we have used one-step method together with the simultaneous approach. One-step methods, in a single pass, calculate both the parameters of the combination function, and those of the combined base learner; and the simultaneous approach ensures that both sets of parameters are learned together.

System Overview
The system comprises of the following modules: data extraction, preprocessing, feature vector generation, and multi-kernel Gaussian Process model building. The data is preprocessed with lemmatization and tokenization, using NLTK toolkit. Then train variable is assigned an integer value. A data dictionary is built using training sentences, and feature vectors for train sets are generated by encoding BoW representation. These feature vectors are given as input to build the MKGPC model.
The Multi-Kernel Gaussian Process Classification (MKGPC) model building is outlined in the following algorithm. There are different kernels that can be used to build a GPC model. The Squared Exponential (SE) kernel, sometimes called the Gaussian or Radial Basis Function (RBF), has become the default kernel in GPs. To model the long-term smoothrising trend, we use a Squared Exponential covariance term.
where σ 2 is the variance and l is the length-scale. The usage of Exponential kernel is particularly common in machine learning and hence is also used in GPs. They perform tasks such as statistical classification, regression analysis, and cluster analysis on data in an implicit space.
The Multi-Layer Perceptron kernel has also found use in GP as it can learn the periodicity property present in the dataset; its k(x, x ) is given by w is the vector of the variances of the prior over input weights and σ 2 b is the variance of the prior over bias parameters. The kernel can learn more effectively because of the additional parameters σ 2 w and σ 2 b .

Results and Discussion
The output submitted for the task was obtained using MKGPC with Radial Basis Function kernel and Exponential Kernel. We also used Multi-Layer Perceptron Kernel.  Classifier with sum of RBF and Multi-Layer Perceptron kernels. We observe from Table 1 that though the macroaveraged precision of the MKGPC models is the same as SGPC, their macro-averaged recall and F-measure are better than SGPC (except for MKGPC(R+E)), because the Multi-Layer Perceptron kernel learns the periodicity better than RBF and Exponential kernels do. These different models, when evaluated on dataset for subtask A and subtask C, exhibited similar performance as in subtask B. The system underperform compared to the baseline system in task C, and to logistic regression on 1-gram in tasks A and B since only a small fraction of the dataset was used for training.

Official Evaluation
Our system scored a macro-averaged recall of 0.431 and was ranked 35 for subtask A, macroaveraged recall of 0.586 and was ranked 20 for subtask B, and macro-averaged mean absolute error of 1.325 and was ranked 15 for subtask C.