INESC-ID at SemEval-2016 Task 4-A: Reducing the Problem of Out-of-Embedding Words

We present the INESC-ID system for the 2016 edition of SemEval Twitter Sentiment Analysis shared task (subtask 4-A). The system was based on the Non-Linear Sub-space Embed-ding (NLSE) model developed for last year’s competition. This model trains a projection of pre-trained embeddings into a small sub-space using the supervised data available. Despite its simplicity, the system attained performances comparable to the best systems of last edition with no need for feature engineering. One limitation of this model was the assumption that a pre-trained embedding was available for every word. In this paper, we investigated different strategies to overcome this limitation by exploiting character-level embed-dings and learning representations for out-of-embedding vocabulary words. The resulting approach outperforms our previous model by a relatively small margin, while still attaining strong results and a consistent good performance across all the evaluation datasets.


Introduction
Pre-trained word embeddings provide a simple means to attain semi-supervised learning in Natural Language Processing (NLP) tasks (Collobert et al., 2011). They can be trained with large amounts of unsupervised data and be fine tuned as the initial building block of a semi-supervised system. However, in domains with a significant number of typos, use of slang and abbreviations, such as social media, the high number of singletons leads to a poor fine tuning of the embeddings. In previous work, we addressed this by learning a projection of the embeddings into a small sub-space (Astudillo et al., 2015b). This allowed us to attain representations also for Out-Of-Vocabulary (OOV) words, provided that embeddings for those words are available. However, even if the embeddings are estimated from large amounts of unlabeled text, in noisy domains, such as Twitter, a significant number of words will not be seen and therefore will not have an embedding. We refer to those words as the Outof-Embedding Vocabulary (OOEV).
In this paper, we focus on the problem of obtaining good representations for OOEV words. We experimented with character to word models (C2W) and investigated different strategies for initializing and updating OOEVs from the available training data. The best results were attained by using the labeled data to perform small updates to these representations in the first few epochs of training. The resulting system outperforms that of the previous evaluation, although by a small margin. It ranks fourth in the 2016 evaluation with a consistently high performance in all years.

NLSE Model Overview
In this section, we briefly review the approach introduced in the 2015 evaluation (Astudillo et al., 2015a). For a particular regression or classification task, only a subset of all the latent aspects captured by the word embeddings will be useful. Therefore, instead of updating the embeddings directly with the available labeled data, we estimate a projection of these embeddings into a low dimensional sub-space. This simple method brings two fundamental advan-tages. Firstly, the lower dimensional embeddings require fewer parameters fitting the complexity of the target task and the available training data. Secondly, the learned projection can be used to adapt the representations for all words with an embedding, even if they do not occur in the labeled dataset.
Assuming we are given a matrix of pre-trained embeddings, where each column represents a word from a vocabulary V, let such matrix be denoted by E ∈ R e×|V| , where e is the number of latent dimensions. We define the adapted embedding matrix as the factorization S · E, where S ∈ R s×e , with s e. The parameters of matrix S are estimated using the labeled dataset, while E is kept fixed. In other words, we determine the optimal projection of the embedding matrix E into a sub-space of dimension s. In what follows, we will refer to this approach as Non-Linear Sub-space Embedding (NLSE) model.
The NLSE can be interpreted as a simple feedforward neural network model (Rumelhart et al., 1985) with one single hidden layer utilizing the embedding sub-space approach. Let (1) denote a message of n words. Each column w ∈ {0, 1} v×1 of m represents a word in one-hot form, that is, a vector of zeros of the size of the vocabulary v with a 1 on the i-th entry of the vector. Let y denote a categorical random variable over K classes. The NLSE model, estimates the probability of each possible category y = k ∈ K given a message m as with parameters θ = {S, Y}. Here, h ∈ [0, 1] e×n are the activations of the hidden layer for each word, given by where σ() is a sigmoid function acting on each element of the matrix. The matrix Y ∈ R 3×s maps the embedding sub-space to the classification space and 1 ∈ 1 n×1 is a matrix of ones that sums the scores for all words together, prior to normalization. This is equivalent to a bag-of-words assumption. Finally, the model computes a probability distribution over the K classes, using the softmax function.

Out-of-Embedding Vocabulary Words
Despite the fact that word embeddings are typically estimated from very large amounts of unlabeled data, it is often the case that a number of words appearing on the training or test sets are not present on the unlabeled corpus. These words will not be represented in E. This problem is even more significant in social media environments like Twitter, where there is a significant lexical variation and where novel words, expressions and slang can be introduced over time. In Table 1, we show the percentage of OOV and OOEV words on each Twitter dataset.
The näive way of dealing with this issue, is to simply set the embeddings of unknown words to zero, essentially ignoring them. As will see later, a better approach is to treat these words as model parameters and use the training signal to learn a better representation for them.

Character-level Embeddings
One natural way of avoiding OOEV in neural network models, is to learn character-level embeddings and define a composition function to combine them into word representations, thus obtaining representations for any given word.
We experimented using C2W, a simple compositional model for learning word representations, from character embeddings. Given a word w, the C2W model generates a set of character n-grams {c 1 , . . . , c m }, and projects each n-gram c i into a vector e c i ∈ R d , where d is the number of latent dimensions. The individual character representations are then combined to obtain a fixed-size representation for word w as e w = e c 1 ⊕ . . . ⊕ e cm , where ⊕ denotes pointwise sum. These word representations can be used as the input to a standard neural language model where the parameters are estimated from unlabeled data by learning to predict words within a context.

Mapping C2W to SSG Embeddings
Unfortunately the C2W embeddings performed very poorly in our model. Therefore, to have embeddings for all the words we employed an approach similar to (Mikolov et al., 2013). We learn a mapping between the embedding spaces induced by C2W and     , allowing us compute an approximate SSG embedding for all the words. To this end, we first obtained C, the set of words present in the two embeddings spaces. Then, we learned a linear map T by solving for the following objective: where, c w denotes the C2W embedding for word w and s w denotes the SSG embedding for word w. This mapping, was then used to compute a SSG embeddings for each OOEV as s w = T · c w .

Partial Update of Embeddings during Training
Given the small amount of supervised data, directly updating the embeddings with the SemEval train set leads to very poor results. It is however possible to update only the OOEV words present in the training set simultaneously to the computation of the subspace (Astudillo et al., 2015a). To obtain positive results with this approach, it was also necessary to reduce the effect of training by lowering the learning rate to 0.001 and updating the embeddings only in the first two iterations.

Main Improvements over the 2015 NLSE
One complication with Twitter-based evaluations is the need of the participant to retrieve the tweets themselves, since some of the tweets may no longer be available. The INESC-ID system presented in 2015 employed a train set of 8604 tweets, considerably smaller than the original dataset (with 11328 tweets). For this edition, it was possible to get ahold of the full dataset, as utilized by Severyn and Moschitti (2015). For reproducibility and comparison purposes our systems this year were developed with this dataset. The system presented in 2015 was very simple both in its structure and the number hyperparameters. Furthermore, tunning and selection of candidate systems was also performed without automatic grid-search. It was therefore expected that our previous setup would outright produce better results by training on a larger dataset. Disappointingly, this was not the case. In fact, the NLSE optimized for the 2015 competition seemed to be sitting on a local optimum that was difficult to come out from. To overcome this problem, we introduced two modifications in the training procedure 1 . The NLSE is trained by minimizing the negative log-likelihood. This cost function is sub-optimal taking into account the evaluation metric, as it weights equally positive, negative and neutral predictions. A simple improvement over this cost is an asymmetric weighting that penalizes the predictions of neutral tweets. This was incorporated as a multiplicative factor on the loglikelihood of values 4/3, 4/3 and 1/3 for the positive, negative and neutral classes, respectively. To reduce the risk of getting trapped into a local minimum, the train data was shuffled before each training epoch. The asymmetric cost and randomization led to a slower, less consistent convergence. For this reason the number of iterations was increased from 8 to 12. The learning rate was changed from 0.01 to 0.005. Table 2 shows the effect of the improvements on the submitted system.
After introducing these two improvements, we investigated different methods to address the problem of OOEV as described in the previous sec-tion. Namely those exploiting C2W embeddings, mapping C2W embeddings to SSG embeddings and training the embeddings for OOEVs. The results of these strategies are displayed in Table 3.  Table 3: Comparision of strategies to address the problem of OOEV

The Submitted System
As mentioned in the previous section, the system submitted is an improvement over our 2015 system (Astudillo et al., 2015b). It therefore shares the same training characteristics as the previous model. The 52 million tweets used by Owoputi et al. (2013) and the tokenizer described in the same work were used to train the word embeddings Structured Skip-Gram (SSG). For this submission, the C2W embeddings were also trained using a publicly available toolkit 2 . For the annotated SemEval training data, the messages were previously pre-processed as follows: lower-casing, replacing Twitter user mentions and URLs with special tokens and reducing any character repetition to at most 3 characters. Following Astudillo et al. (2015a), we used embeddings with 600 dimensions and set the sub-space size to 10 dimensions.
To train the model, the development set was split into 80% for parameter learning and 20% for model evaluation and selection, maintaining the original relative class proportions in each set. The weights were all randomly initialized uniformly with ranges of [−0.001, 0.001], [−0.1, 0.1] and [−0.7, 0.7] for the OOEVs, subspace and classification layers respectively. The training procedure entailed minimizing the negative log-likelihood over the training data with respect to the parameters, using standard Stochastic Gradient Descent (Rumelhart et al., 1985) with a fixed learning rate of 0.005 and minibatch of size 1, i.e., updating the weights after each message was processed. We reshuffled the training  Priority was given to models that presented a consistent high performance in all the datasets. In retrospect, this was most probably a suboptimal decision judging from the evaluation results. Table 4 displays the performance for the top 5 systems in SemEval 2016 task 4-B (Nakov et al., 2016). The NLSE system (labeled INESC-ID) ranks forth with a stable performance across all years. The results are particularly strong for 2013 with a difference of 0.017 points over the next best performing system on the top five. This is consistent with the divide noticed during system selection between performance in 2013 and 2015. High-performing systems in 2014, and particularly in 2013, do not appear to be equally performing in recent years.

Conclusions
We presented the INESC-ID system for the SemEval 2016 task 4-A, built on top of the successful Non-Linear Subspace Embedding model. We found that training with a larger dataset required a more careful procedure to avoid overfitting. Reproducing the best results obtained in SemEval 2015 required shuffling the data before each training epoch and adapting the cost function to better reflect the evaluation metric.
To address the problem of out-of-embedding words, we tried to introduce character-level embeddings in our model but found these to be detrimental. We obtained better results by learning embeddings for these words during the training. Even though the performance gains were not very pronounced, our system still attained very strong results across all the evaluation datasets.