NEUROSENT-PDI at SemEval-2018 Task 3: Understanding Irony in Social Networks Through a Multi-Domain Sentiment Model

This paper describes the NeuroSent system that participated in SemEval 2018 Task 3. Our system takes a supervised approach that builds on neural networks and word embeddings. Word embeddings were built by starting from a repository of user generated reviews. Thus, they are specific for sentiment analysis tasks. Then, tweets are converted in the corresponding vector representation and given as input to the neural network with the aim of learning the different semantics contained in each emotion taken into account by the SemEval task. The output layer has been adapted based on the characteristics of each subtask. Preliminary results obtained on the provided training set are encouraging for pursuing the investigation into this direction.


Introduction
Sentiment Analysis is a natural language processing (NLP) task (Dragoni et al., 2015a) which aims at classifying documents according to the opinion expressed about a given subject (Federici and Dragoni, 2016a,b). Many works available in the literature address the sentiment analysis problem without distinguishing specific information context of documents when sentiment models are built.
The necessity of investigating this problem from a multi-domain perspective is led by the different influence that a term might have in different contexts. The idea of adapting terms polarity to different domains emerged only in the last decade (Blitzer et al., 2007;Dragoni and Petrucci, 2017). Multi-domain sentiment analysis approaches discussed in the literature focus on building models for transferring information between pairs of domains (Dragoni, 2015;Petrucci and Dragoni, 2015). While on the one hand such approaches allow to propagate specific domain in-formation to others, their drawback is the necessity of building new transfer models every time a new domain has to be analyzed. Thus, such approaches do not have a great generalization capability of analyzing texts, because transfer models are limited to the N domains used for building the models.
The problem of detecting irony in text can be considered from a multi-domain perspective. The development of the social web has stimulated creative and figurative language use like irony. This frequent use of irony on social media has important implications for natural language processing tasks, which struggle to maintain high performance when applied to ironic text (Liu and Zhang, 2012;Maynard and Greenwood, 2014;Ghosh and Veale, 2016). Although different definitions of irony co-exist, it is often identified as a trope or figurative language use whose actual meaning differs from what is literally enunciated. As such, modeling irony has a large potential for applications in various research areas, including text mining, author profiling, detecting online harassment, and perhaps one of the most popular applications at present, sentiment analysis. As described by (Joshi et al., 2017), recent approaches to irony can roughly be classified into rule-based and machine learning-based methods. While rule-based approaches mostly rely upon lexical information and require no training, machine learning invariably makes use of training data and exploits different types of information sources, including bags of words, syntactic patterns, sentiment information or semantic relatedness. Recently, deep learning techniques gain increasing popularity for this task as they allow to integrate semantic relatedness by making use of, for instance, word embeddings.
In this paper, we discuss how the NeuroSent tool has been applied in SemEval 2018Task 3 (Hee et al., 2018. The tool leverages on the following pillars: (i) the use of word embeddings for representing each word contained in raw sentences; (ii) the word embeddings are generated from an opinion-based corpus instead of a general purpose one (like news or Wikipedia); (iii) the design of a deep learning technique exploiting the generated word embeddings for training the sentiment model; and (iv) the use of multiple output layers for combining domain overlap scores with domain-specific polarity predictions.
The last point enables the exploitation of linguistic overlaps between domains, which can be considered one of the pivotal assets of our approach. This way, the overall polarity of a document is computed by aggregating, for each domain, the domain-specific polarity value multiplied by a belonging degree representing the overlap between the embedded representation of the whole document and the domain itself.

Related Work
Sentiment analysis from the multi-task and multidomain perspective is a research field which started to be explored only in the last decade. According to the nomenclature widely used in the literature (see (Blitzer et al., 2007;Dragoni and Petrucci, 2017)), we call domain a set of documents about similar topics, e.g. a set of reviews about similar products like mobile phones, books, movies, etc.. The massive availability of multidomain corpora in which similar opinions are expressed about different topics opened the scenario for new challenges. Researchers tried to train models capable to acquire knowledge from a specific domain and then to exploit such a knowledge for working on documents belonging to different ones. This strategy was called domain adaptation. The use of domain adaptation techniques demonstrated that opinion classification is highly sensitive to the domain from which the training data is extracted. The reason is that when using the same words, and even the same language constructs, we may obtain different opinions, depending on the domain. The classic scenario occurs when the same word has positive connotations in one domain and negative connotations in another one, as we showed within the examples presented in Section 1.
While on the one hand such approaches demonstrated their effectiveness in working in a multidomain environment, on the other hand they suffered by the limitation of being influenced by the linguistic overlap between domains. Indeed, such an overlap leads learning algorithms to infer similar polarity values to domains that are similar from the linguistic perspective.
The adoption of evolutionary algorithms within the sentiment analysis research field is quite recent. First studies focused on the use of evolutionary solutions for modeling financial indicators by starting from investors sentiments (Yamada and Ueda, 2005;Chen and Chang, 2005;Huang et al., 2012;Yang et al., 2017;Simoes et al., 2017). Here, the evolutionary component was used for learning the trend of financial indicators with respect to the sentiment information extracted from opinions provided by the investors. With respect to these papers, we propose an approach adopting evolutionary computation to a more fine-grained level where the evolution component affects also the polarities of opinion concepts.
Studies considering the use of evolutionary algorithms for optimizing the polarity values of opinion concepts have been proposed only recently (Ferreira et al., 2015;Onan et al., 2016Onan et al., , 2017. However, these works focused on learning candidate refinements of opinion concepts polarity without considering the context dimension associated with them. A variant of this problem is the use of polarity adaptation strategy in the field of social media and microblogs (Alahmadi and Zeng, 2015; Wang et al., 2014;Keshavarz and Abadeh, 2017;Hu et al., 2016;Fu et al., 2016;Gong et al., 2016).
With respect to state of the art, this work represents the first exploration of evolutionary algorithms for multi-domain sentiment analysis with the aim of learning multiple dictionaries of opinion concepts. Moreover, we differ from the literature by do not considering the propagation of polarity information across domain (i.e., we keep them completely separated) in order to avoid transfer learning drawbacks.

System Implementation
NeuroSent has been entirely developed in Java with the support of the Deeplearning4j library 1 and it is composed by following two main phases: • Generation of Word vectors (Section 3.1): raw text, appropriately tokenized using the Stanford CoreNLP Toolkit, is provided as input to a 2-layers neural network implementing the skip-gram approach with the aim of generating word vectors.
• Learning of Sentiment Model (Section 3.2): word vectors are used for training a recurrent neural network with an output layer customized based on the addressed subtask. The customizations have been explained in Section 4.
In the following subsections, we describe in more detail each phase by providing also the settings used for managing our data.

Generation of Word Vectors
The generation of the word vectors has been performed by applying the skip-gram algorithm on the raw natural language text extracted from the smaller version of the SNAP dataset (McAuley and Leskovec, 2013). The rationale behind the choice of this dataset focuses on three reasons: • the dataset contains only opinion-based documents. This way, we are able to build word embeddings describing only opinion-based contexts.
• the dataset is multi-domain. Information contained into the generated word embeddings comes from specific domains, thus it is possible to evaluate how the proposed approach is general by testing the performance of the created model on test sets containing documents coming from the domains used for building the model or from other domains.
• the dataset is smaller with respect to other corpora used in the literature for building other word embeddings that are currently The aspect of considering only opinion-based information for generating word embeddings is one of the peculiarity of our system. While embeddings currently available are created from big corpora of general purpose texts (like news archives or Wikipedia pages), ours are generated by using a smaller corpus containing documents strongly related to the problem that the model will be thought for. On the one hand, this aspect may be considered a limitation of the proposed solution due to the requirement of training a new model in case of problem change. However, on the other hand, the usage of dedicated resources would lead to the construction of more effective models.
Word embeddings have been generated by the Word2Vec implementation integrated into the Deeplearning4j library. The algorithm has been set up with the following parameters: the size of the vector to 64, the size of the window used as input of the skip-gram algorithm to 5, and the minimum word frequency was set to 1. The reason for which we kept the minimum word frequency set to 1 is to avoid the loss of rare but important words that can occur in domain specific documents.

Learning of The Sentiment Model
The sentiment model is built by starting from the word embeddings generated during the previous phase.
The first step consists in converting each textual sentence contained within the dataset into the corresponding numerical matrix S where we have in each row the word vector representing a single word of the sentence, and in each column an embedding feature. Given a sentence s, we extract all tokens t i , with i ∈ [0, n], and we replace each t i with the corresponding embedding w. During the conversion of each word in its corresponding embedding, if such embedding is not found, the word is discarded. At the end of this step, each sentence contained in the training set is converted in a matrix S = [w 1 , . . . , w n ].
Before giving all matrices as input to the neural network, we need to include both padding and masking vectors in order to train our model correctly. Padding and masking allows us to support different training situations depending on the number of the input vectors and on the number of predictions that the network has to provide at each time step. In our scenario, we work in a manyto-one situation where our neural network has to provide one prediction (sentence polarity and domain overlap) as result of the analysis of many input vectors (word embeddings).
Padding vectors are required because we have to deal with the different length of sentences. Indeed, the neural network needs to know the number of time steps that the input layer has to import. This problem is solved by including, if necessary, into each matrix S k , with k ∈ [0, z] and z the number of sentences contained in the training set, null word vectors that are used for filling empty word's slots. These null vectors are accompanied by a further vector telling to the neural network if data contained in a specific positions has to be considered as an informative embedding or not.
A final note concerns the back propagation of the error. Training recurrent neural networks can be quite computationally demanding in cases when each training instance is composed by many time steps. A possible optimization is the use of truncated back propagation through time (BPTT) that was developed for reducing the computational complexity of each parameter update in a recurrent neural network. On the one hand, this strategy allows to reduce the time needed for training our model. However, on the other hand, there is the risk of not flowing backward the gradients for the full unrolled network. This prevents the full update of all network parameters. For this reason, even if we work with recurrent neural networks, we decided to do not implement a BPTT approach but to use the default backpropagation implemented into the DL4J library.
Concerning information about network structure, the input layer was composed by 64 neurons (i.e. embedding vector size), the hidden RNN layer was composed by 128 nodes, and the output layers with a different number of nodes based on the addressed subtask. The network has been trained by using the Stochastic Gradient Descent with 1000 epochs and a learning rate of 0.002.

The Tasks
The SemEval 2018 Task 3 is composed by two different subtasks for the automatic detection of irony on Twitter. For the first subtask, participants should determine whether a tweet is ironic or not by simply assigning a binary value 0 or 1. While, for the second subtask, participants have to distinguish, among the ironic tweets, one of the three classes which tweets are further split.
Subtask #1: Ironic vs. Non-ironic The first subtask is a binary classification task where the system has to predict whether a tweet is ironic or not. Example of an ironic and non-ironic tweet are presented below, respectively: • I just love when you test my patience!! #not

• Had no sleep and have got school now #not happy
The output layer of our neural network is composed by a single neuron implementing the SIG-MOID activation function.
Subtask #2: Different types of irony The second subtask is a multiclass classification task where the system has to predict one out of four labels describing: i. verbal irony realized through a polarity contrast; ii. verbal irony without such a polarity contrast; iii. descriptions of situational irony; and, iv. non-irony i Instances of the category Verbal irony by means of a polarity contrast contains an evaluative expression whose polarity (positive, negative) is inverted between the literal and the intended evaluation. An example of this category is the following: "I really love this year's summer; weeks and weeks of awful weather." Instead, instances of the category Verbal irony without such a polarity contrast do not show polarity contrast between the literal and the intended evaluation, but are nevertheless ironic. An example of this category is the following: "Human brains disappear every day. Some of them have never even appeared." Then, instances of the Situational irony category describe situations that fail to meet some expectations. An example is the following: "Just saw a non-smoking sign in the lobby of a tobacco company." Finally, the Non-ironic category contains instances that are clearly not ironic, or which lack context to be sure that they are ironic.
The output layer of our neural network is composed by four neurons and the SOFTMAX strategy has been implemented for selecting the most candidate class.
The NeuroSent system has been applied to both subtasks. In Section 5, we report the preliminary results obtained by NeuroSent on the training set compared with a set of baselines. The NeuroSent approach have been preliminarily evaluated by adopting the Dranziera protocol (Dragoni et al., 2016).

In-Vitro Evaluation
The validation procedure leverages on a fivefold cross evaluation setting in order to validate the robustness of the proposed solution. The approach has been compared with four baselines: Support Vector Machine (SVM) (Chang and Lin, 2011), Naive Bayes (NB) and Maximum Entropy (ME) (McCallum, 2002), and Convolutional Neural Network (Chaturvedi et al., 2016).
In Table 1, we provide average Pearson correlation obtained on the five folds in which the training set has been split.
The obtained results demonstrated the suitability of NeuroSent with respect to the adopted baselines. We may also observed how solutions based on neural networks obtained a significant improvement with respect to the others for both tasks.
We performed a detailed error analysis concerning the performance of NeuroSent. In general, we observed how our strategy tends to provide false negative predictions. An in depth analysis of some incorrect predictions highlighted that the embedded representations of some positive opinion words are very close to the space region of negative opinion words. Even if we may state that the confidence about positive predictions is very high, this scenario leads to have a predominant negative classification for borderline instances.
On the one hand, a possible action for improving the effectiveness our strategy is to increase the granularity of the embeddings (i.e. augmenting the size of the embedding vectors) in order to increase the distance between the positive and negative polarities space regions. On the other hand, by increasing the size of embedding vectors, the computational time for building, or updating, the model and for evaluating a single instance increases as well. Part of the future work, will be the analysis of more efficient neural network architectures able to manage augmented embedding vectors without negatively affecting the efficiency of the platform.

Conclusion
In this paper, we described the NeuroSent system presented at SemEval 2018 Task 3. Our system makes use of artificial neural networks to classify tweets by polarity or for detecting emotion levels. The results obtained on the training set demonstrated that the adopted solution is promising and worthy of investigation. Therefore, future work will focus on improving the system by exploring the integration of sentiment knowledge bases (Dragoni et al., 2015a) in order to move toward a more cognitive approach.
Twitter-based recommender system to address coldstart: A genetic algorithm based trust modelling and probabilistic sentiment analysis.