Aicyber at SemEval-2016 Task 4: i-vector based sentence representation

This paper introduces aicyber’s systems for SemEval 2016 , Task 4A. The ﬁrst system is build on vector space model (VSM), the second system is build on a new framework to estimate sentence vector, it is inspired by the i-vector in speaker veriﬁcation domain. Both systems are evaluated on SemEval 2016 (Task4A) as well as IMDB dataset. Evaluation results show that the i-vector based sentence vector is an alternative approach to present sentence.


Introduction
The SemEval 2016 Task 4 is sentiment analysis in tweets. The subtask, task A focused on classifying tweets into three classes: positive, negative or neutral sentiment (Nakov et al., 2016). This paper will first presents the submitted system used by team aicyber. Then a new framework of estimating sentence vector will be introduced and evaluated.

The aicyber system
This section introduces the submitted system for team aicyber. The text data is first being processed by tweet tokenizer, emoticons are preserved as tokens. Bag-of-ngram feature is extracted and filtered by a TF-IDF (Salton, 1991) selection. Resulting feature dimension is around 3800, it is then reduced to 400 by truncated singular value decomposition (SVD) (Klema and Laub, 1980;Halko et al., 2009). This process is also known as Latent Semantic Analysis (LSA) or Vector Space Model (Turney and Pan-tel, 2010). Finally a Linear Discriminant Analysis (LDA) classifier (Hastie et al., 2009) is trained to classify the test data.
The SemEval 2016 training dataset which contains 3887 tweets are selected to train the TF-IDF, SVD and LDA. Development dataset is used for tuning parameters and develop-test dataset are used for local testing.
Task A adopted Macro-F1 measure as evaluation metric (Nakov et al., 2016): The Macro-F1 for Aicyber system measured on development, develop-test and 2016 Tweet set, are respectively 0.4514, 0.4787 and 0.4024.
It is obvious that this classification problem has not been satisfactorily answered. One possible reason is the unbalanced training data causes system bias towards positive classes. There are 2101 positive, 1292 neutral but only 494 negative tweets.
Another reason is the size of labeled training set is too small,with only 3887 tweets, which could hardly cover a reasonable amount of words. As a result, the bag-of-ngarm features learned from this training set, could not generalized well. This motivate us to seeking alternative feature representation of tweet, that is sentence vector.
3 Sentence vector (Le and Mikolov, 2014) proposed sentence vector or paragraph vector (PV) which could learn the continuous distributed vector representation for text of variable-length and achieved promising result on movie review texts. It is inspired by word2vec (Mikolov et al., 2013) embedding which captures rich notions of semantic relatedness and constitutionality between words. (Mesnil et al., 2014) shows the ensemble of sentence vector, RNN language model (Mikolov et al., 2010) and NB SVM (Wang and Manning, 2012) achieved new state-of-the-art result. (Dai et al., 2015) extends the study and provide a more thorough comparison of PV to other task such as document modeling. It concluded PV can effectively be used for measuring semantic similarity between long pieces of texts.
However such approach assume testing data is known during learning of vector representation. Access of testing data during training may not allowed for certain machine learning task, and it is not practical for real application.
Thus we would like to introduce a new approach to estimate sentence vector or PV for variable-length of texts from word embeddings by using i-vector framework.

i-vector framework in speech domain
i-vector (Dehak et al., 2011) is one of the dominant approaches in speaker verification (SV) research in the recent years. It projects variable length speech utterances into a fixed-size lowdimensional vector, namely i-vector. Its development advanced from traditional techniques such as the Gaussian Mixture Models (GMMs) (Rose and Reynolds, 1990;Reynolds and Rose, 1995) , adapted GMMs (Reynolds et al., 2000), GMM supervectors (Campbell et al., 2006) and Joint Factor Analysis (Kenny et al., 2007).
To understand i-vector, firstly we need to understand the speech data, the speech data is a sequence of frames. At each frame a fixed-size feature vector is extracted, such as MFCC (Davis and Mermelstein, 1980), with addition of dynamics features such as "delta" and "delta-delta" (Furui, 1986). Thus one utterance could be viewed as a list of continuousvalued vectors as shown in Fig.1. A large amount of utterances are used to train the background GMM in i-vector framework. Secondly we need to learn the supervector. Its aim was to convert a spoken utterance with arbitrary duration to a fixed length vec- tor. The supervector mentioned here is specific to the GMM supervector constructed by stacking the means of the mixture components. For example, a GMM with 2048 components built on 60 dimensional feature vectors results a 122880 (2048*60) dimensional supervector. A review of GMM and supervector is presented in (Kinnunen and Li, 2010).
Given an utterance, its GMM supervector s can be represented as follows: where m denotes Universal Background Gaussian Mixture Model's (UBM) supervector, T is the totalvariability matrix, which is used to represent the primary directions' variation across a large amount of training data. The coefficients w of this total variability is known as identity vector or simply i-vector. Extraction of this i-vector can be done as follows (Given a SV system built on F dimensional MFCC features and UBM with C Gaussian components): where I is F × F identity matrix, N is a CF × CF diagonal matrix whose diagonal blocks are N c I(c=1,2,....C), and the supervector A is generated by the concatenation of the centralized first-order Baum-Welch statistics. Σ is the covariance matrix of the residual variability not captured by T . The ivector's dimension is usually 400, much lower than that of supervectors. This thus allows the use of various techniques that were not practical in high dimensional space. To give a completed review of i-vector is out of scope of this paper, interested individuals are strongly encouraged to read (Kenny et al., 2008;Dehak et al., 2011). 3.2 i-vector framework for Natural Language Processing task (Shepstone et al., 2016) points out that the central to computing the total variability matrix and i-vector extraction, is the computation of the posterior distribution for a latent variable conditioned on an observed feature sequence of an utterance. In Natural Language Processing(NLP) when observations (words) could be represented as sequences of feature vectors, will the same methodology apply? This motivated us to bring i-vector from speech to NLP domain. Fig. 2 illustrates the fundamental principle of proposed i-vector framework for NLP task, where a sentence is represented through its word embedding during data preprocessing. Delta and delta-delta feature is added during training in order to capture the context information. Compare to Fig. 1 where spoken utterance is viewed as list of frame level MFCC feature vectors. The proposed i-vector framework replace MFCC features by word vectors, and trained using similar implementation as of speech data.

The implementation of i-vector framework
Many implementations of i-vector framework are developed recently, (Glembek et al., 2011) use standard GMM approach. (Snyder et al., 2015) incorporated time delayed deep neural network (TDNN) trained on speech recognition task into i-vector framework. A 50% relative improvement is obtained when TDNN instead of GMM is used to collect firstorder Baum-Welch statistics. In this paper we use the light weighted conventional GMM approach for proof of concept purpose.

Experiment and evaluation
Training of i-vector system is in a completely unsupervised manner, it includes training of word2vec and training of i-vector extractor. Evaluation is done on IMDB similar to (Maas et al., 2011;Le and Mikolov, 2014;Mesnil et al., 2014) and SemEval 2016 Task4A.

Word2vec training
Word2vec training is done by using gensim 1 . Training dataset is selected from IMDB, it contains 25000 labeled training samples and 50000 unlabeled data, a total of 75000 movie reviews. The training use a context window of 7, minimum word count of 10 and the resulting dimension of word vector is 20.

i-vector extractor training
Same data from word2vec training is used for ivector extractor training. As illustrated in Fig 2, the data preprocessing involved word-to-vector conversion. Words that not appear in the word2vec model are ignored. Each review is treated as one sentence. The sentence is saved as list of word vectors, can be viewed as a M × 20 matrix. Where M denotes number of words in that sentence.
The proposed i-vector extractor training system 2 is developed using the Kaldi Speech Recognition Toolkit (Povey et al., 2011) 3 . Feature used is a 60 dimensional features consisting of the 20 dimensional word vector and its delta, double delta. Mean and variance normalization is not applied. During training, a universal background model (UBM) with 2048 Gaussian mixture components is trained on 64000 sentences. Each Gaussian in UBM has a full covariance matrix. After UBM is trained, the total variability matrix is similar trained with all the 75000 sentences. Learned i-vector extractor is then used to estimate vectors for IMDB and SemEval 2016 dataset. The output dimension of i-vector is 200. Note that, test data of IMDB and all data in SemEval 2016 are not observed during training.

Evaluation of proposed framework
The i-vector framework is first evaluated on the IMDB dataset then on the SemEval 2016 dataset.

Evaluation on IMDB
Evaluation metric is accuracy measured on IMDB database. Table 1 shows the performance of different systems. The current state-of-the-art system is an ensemble of RNN language model, sentence vectors and NB SVM, achieved 92.57% (Mesnil et al., 2014) testing accuracy. Sentence vector system is one of sub-system used in ensemble and achieved 88.73% accuracy alone. Aicyber's system is the same system mentioned in Section 2, a VSM approach, it obtained 88.38%. To make a fair comparison same type of classifier, LDA is used to train and classify i-vector system, a 87.52% accuracy is reported. Concatenation of i-vector and vector from VSM a 89.94% accuracy can be achieved.

Evaluation on SemEval 2016
Evaluation metric for SemEval 2016 task is Macro-F1 introduced in Equation 1. During evaluation period we validate the performance on development and develop-test dataset. Results as shown in Table 2 indicate i-vector system is worse than our baseline system. So we only submitted the baseline system.

Discussion
Judging from the IMDB evaluation, the idea of ivector from speech domain is successfully applied for NLP task. Though it didn't beat the state-ofthe-art, it is well-known that basic machine learning techniques can yield strong baselines (Wang and Manning, 2012)   to data mis-match, the word2vec and i-vector extractor is trained on the movie review texts which are much longer, more formal than tweets and are of different vocabulary. Further improvement can be made in the following aspects.
1. The word-to-vector conversion is done via word2vec, using of Glove word vectors (Pennington et al., 2014), either alone or as concatenation of two type of word vectors for system training is worth to explored.
2. Parameters in word2vec, GMM and i-vector training is not yet optimized to the task. For example, word vector is set to 20 dimensions which is much smaller than 300 dimensional Google word vector. Finding the best parameters will bring more insight of semantical meaning of sentence vector.
3. In this paper, the implementation of i-vector framework is a GMM based approach. Incorporating deep neural network (Snyder et al., 2015) or a Convolutional Neural Network (Kalchbrenner et al., 2014) for NLP task is worth to investigated.

Conclusion
This paper had presented a vector space model approach for team aicyber and a new idea of estimating sentence vector. Proposed i-vector framework had evaluated on SemEval 2016 as well as IMDB dataset. Result shows that the i-vector based sentence vector is an alternative approach to present sentence.