Sentiment Analysis on Monolingual, Multilingual and Code-Switching Twitter Corpora

We address the problem of performing polarity classiﬁcation on Twitter over different languages, focusing on English and Spanish, comparing three techniques: (1) a monolingual model which knows the language in which the opinion is written, (2) a monolingual model that acts based on the decision provided by a language iden-tiﬁcation tool and (3) a multilingual model trained on a multilingual dataset that does not need any language recognition step. Results show that multilingual models are even able to outperform the monolingual models on some monolingual sets. We introduce the ﬁrst code-switching corpus with sentiment labels, showing the robustness of a multilingual approach.


Introduction
Noisy social media, such as Twitter, are especially interesting for sentiment analysis (SA) and polarity classification tasks, given the amount of data and their popularity in different countries, where users simultaneously publish opinions about the same topic in different languages (Cambria et al., 2013a;Cambria et al., 2013b). Some expressions are written in different languages, making the polarity classification harder. In this context, handling texts in different languages becomes a real need. We evaluate three machine learning models, considering Spanish (es), English (en) and its multilingual version, English-Spanish (en-es): 1. Multilingual approach (en-es model): A model does not need to recognise the language of the text. The en and es training and development corpora are merged to train an unique en-es sentiment classifier.

Monolingual approach (en and es models):
The ideal case where the language of the text is known and the right model is executed. Each language model is trained and tuned on a monolingual corpus. 3. Monolingual pipeline with language detection (pipe model): Given an unknown text, we first identify the language of the message through lang.py (Lui and Baldwin, 2012). The output language set was constrained to Spanish and English to make sure every tweet is classified and guarantee a fair comparison with the rest of the approaches. The training was done in the same way as in the monolingual approach, as we know the language of the texts. Lang.py is just needed for evaluation. The language is predicted, the corresponding monolingual classifier is called and the outputs are joined to compare them to the gold standard.
The approaches are evaluated on: (1) an English monolingual corpus, (2) a Spanish monolingual corpus (3) a multilingual corpus which combines the two monolingual collections and (4) a codeswitching (Spanish-English) corpus, that is introduced together with this paper.

Related work
The problem of multilingual polarity classification has already been addressed from different perspectives, such as monolingual sentiment analysis in a multilingual setting (Boiy and Moens, 2009), cross-lingual sentiment analysis (Brooke et al., 2009) or multilingual sentiment analysis . Banea et al. (2010) shows that including multilingual information can improve by almost 5% the performance of subjectivity classification in English. Davies and Ghahramani (2011) propose a languageindependent model for sentiment analysis of Twitter messages, only relying on emoticons; that outperformed a bag-of-words Naive Bayes approach. Cui et al. (2011) consider that not only emoticons, but also character and punctuation repetitions are language-independent emotion tokens. A different way of evaluating multilingual SA systems is posed by . They translate the English SemEval 2013 corpus (Nakov et al., 2013) into Spanish, Italian, French and German by means of machine translation (MT) systems. The resulting datasets were revised by non-native and native speakers independently, finding that the use of machine translated data achieves similar results as the use of native-speaker translations.

Multilingual sentiment analysis
Our goal is to compare the performance of supervised models based on bag-of-words, often used in SA tasks. We trained our classifiers using a L2regularised logistic regression (Fan et al., 2008).

Feature Extraction
We apply Natural Language Processing (NLP) techniques for extracting linguistic features, using their total occurrence as the weighting factor (Vilares et al., 2014). Four atomic sets of features are considered: • Words (W): Simple statistical model that counts the frequencies of words in a text. • Lemmas (L): Each term is lemmatised to reduce sparsity, using lexicon-based methods that rely on the Ancora corpus (Taulé et al., 2008) for Spanish and Multext (Ide and Véronis, 1994) and a set of rules 1 for English. • Psychometric properties (P): Emotions, psychological concepts (e.g. anger) or topics (e.g. job) that commonly appear in messages. We rely on the LIWC dictionaries (Pennebaker et al., 2001) to detect them. • Part-of-speech tags (T): The grammatical categories were obtained using the Stanford Maximum Entropy model (Toutanova and Manning, 2000). We trained an en and an es tagger using the Google universal PoS tagset (Petrov et al., 2011) and joined the Spanish and English corpora to train a combined enes tagger. The aim was to build a model that does not need any language detection to tag samples written in different languages, or even code-switching sentences. Table 1 shows how the three taggers work on a real code-switching sentence from Twitter, illustrating how the en-es tagger effectively tackles them. The accuracy of the en and es taggers was 98.12% 2 and 96.03% respectively. The multilingual tagger obtained 98.00% and 95.88% over the monolingual test sets.
These atomic sets of features can be combined to obtain a rich linguistic model that improves performance (Section 4).

Syntactic features
Dependency parsing is defined as the process of obtaining a dependency tree given a sentence. Let S = [s 1 s 2 ...s n−1 s n ] be a sentence 3 of length n, where s i indicates the token at the i th position; a dependency tree is a graph of binary relations, G = {(s j , m jk , s k )}, where s j and s k are the head and dependent tokens, and m jk represents the syntactic relation between them. To obtain such trees, we trained an en, es and an en-es parser (Vilares et al., 2015b) using MaltParser (Nivre et al., 2007). In order to obtain competitive results for a specific language, we relied on MaltOptimizer (Ballesteros and Nivre, 2012). The parsers were trained on the Universal Dependency Treebanks v2.0 (McDonald et al., 2013) and evaluated against the monolingual test sets. The Labeled Attachment Score (LAS) of the Spanish and English monolingual parsers was 80.54% and 88.35%, respectively. The multilingual model achieved a LAS of 78.78% and 88.65% (significant improvement with respect to the monolingual model, using Bikel's randomised parsing evaluation comparator and p < 0, 05). Figure 1 shows an example how the en, es and en-es parsers work on a codeswitching sentence.
In the next step, words, lemmas, psychometric properties and PoS tags are used to extract enriched generalised triplet features (Vilares et al., 2015a). Let (s j , m ij , s k ) be a triplet with s j , s k ∈ W and a generalisation function, g : W → {W, L, P, T }, a generalised triplet is defined as (g(s j ), m ij , g(s k )).

Experimental framework
The proposed sets of features and models are evaluated on standard monolingual corpora, taking accuracy as the reference metric. These monolingual collections are then joined to create a multilingual corpus, which helps us compare the performance of the approaches when tweets come from two different languages. An evaluation over a codeswitching test set is also carried out.

Monolingual corpora
Two corpora are used to compare the performance of monolingual and multilingual models:

Multilingual corpora
These two test sets were merged to create a synthetic multilingual corpus. The aim was to compare the multilingual and the monolingual approach with language detection under this configuration. The unbalanced sizes of the test sets result in a higher performance when correctly classifying the majority language. We do not consider that as a methodological problem, but rather as a challenge of monitoring social networks in real environments, where the number of tweets in each language is not necessarily balanced.

Code-switching corpus
We created a polarity corpus with code-switching tweets based on the training collection 6 (en-es) presented by Solorio et al. (2014). Each word in the corpus is labelled with its language, serving as the starting point to obtain a collection of multilingual tweets. We first filtered the tweets containing both Spanish and English words, obtaining 3,062 tweets. Those were manually labelled by three annotators according to the SentiStrength strategy, a dual score (p,n) from 1 to 5 where p and n indicate the positive and the negative sentiment (Thelwall et al., 2010). Krippendorf's alpha coefficient indicated an inter-annotator agreement from 0.629 to 0.664 for negative sentiment and 0.500 to 0.693 for positive sentiment. To obtain the final score, we applied an average strategy with regular round: if p > n then the tweet is labelled as positive, if p < n then it is labelled as negative and otherwise it is labelled as none. After the transformation to the trinary scheme, we obtained a corpus where: the positive class represents the 31.45% of the corpus, the negative one represents a 25.67% and the remaining 42.88% belongs to the none class.
To the best of our knowledge, this is the first code-switching corpus with sentiment annotations. 7 , which presents several challenges. It is an especially noisy corpus, were many grammatical errors occur in each tweet. There is also an overuse of subjective clauses and abbreviations (e.g. 'lol', 'lmao', . . . ) whose subjectivity was considered a controversial issue by the annotators. Finally, a predominant use of English was detected (lang.py classified 59.29% of the tweets as English). We believe this is because the Solorio et al. (2014) corpus was collected by downloading tweets for people from Texas and California.   Table 3: Accuracy (%) on the TASS test sets This is due to the high performance of lang.py on this corpus, where only 6 tweets were misclassified as Spanish tweets. Despite of this issue, the en-es classifier performs very competitively on the English monolingual test sets, and the differences with respect to the en model range from 0.2 to 1.05 percentage points. With certain sets of features, consisting of triplets, the multilingual model even outperforms both monolingual models, reinforcing the validity of this approach.

Results on the Spanish corpus
With respect to the evaluation on the TASS 2014 corpus, the tendency seems to remain on the TASS 2014-1k, as illustrated in Table 3. It general terms the es model obtains the best results, followed by the pipe and the en-es models. In this version of the corpus, the system misclassified 17 of the manually labelled tweets, and the impact of the monolingual model with language detection is also small. Results obtained on the TASS 2014 general set give us more information, since a significant number of tweets from this collection (842) were classified as English tweets. Some of these tweets actually were short phrases in English, some presented code-switching and some others were simply misclassified. Under this configuration, the multilingual model outperforms monolingual models with most of the proposed features. This suggests that multilingual models present advantages when messages in different languages need to be analysed.
Experimental results allow us to conclude that the multilingual models proposed in this work are a competitive option when applying polarity classification to a medium where messages in different  Table 4: Accuracy (%) on the multilingual test set languages might appear. The results are coherent across different languages and corpora, and also robust on a number of sets of features. In this respect, for contextual features the performance was low in all cases, due to the small size of the employed training corpus. Vilares et al. (2015a) explain how this kind of features become useful when the training data becomes larger. Table 4 shows the performance both of the multilingual approach and the monolingual pipeline with language detection when analysing texts in different languages. On the one hand, the results show that using a multilingual model is the best option when Spanish is the majority language, probably due to a high presence of English words in Spanish tweets. On the other hand, combining monolingual models with language detection is the best-performing approach when English is the majority language. The English corpus contains only a few Spanish terms, suggesting that the advantages of having a multilingual model cannot be exploited under this configuration. Table 5 shows the performance of the three proposed approaches on the code-switching test set. The accuracy obtained by the proposed models on this corpus is lower than on the monolingual corpora. This suggests that analysing subjectivity on tweets with code-switching presents additional challenges. The best performance (59.34%) is obtained by the en-es model using lemmas and psychometric properties as features. In general terms, atomic sets of features such as words, psychometric properties or lemmatisation, and their com-  Table 5: Accuracy (%) on the code-switching set binations, perform competitively under the en-es configuration. The tendency remains when the atomic sets of features are combined, outperforming the monolingual approaches in most cases.

Results on the code-switching corpus
The pipeline model performs worse on the code-switching test set than the multilingual one for most of the sets of features. These results, together with the ones obtained on the monolingual corpora, indicates that a multilingual approach like the one proposed in this article is more robust on environments containing code-switching tweets and tweets in different languages. The es model performs poorly, probably due to the smaller presence of Spanish words in the corpus. The annotators also noticed that Spanish terms present a larger frequency of grammatical errors than the English ones. Surprisingly, the en model performed really well in many of the cases. We hypothesise this is due to the higher presence of English phrases, that made it possible to extract the sentiment of the texts in many of the cases.

Conclusions
We compared different machine learning approaches to perform multilingual polarity classification in three different environments: (1) where monolingual tweets are evaluated separately, (2) where texts in different languages need to be analysed and (3) where code-switching texts appear. The proposed approaches were: (a) a purely monolingual model, (b) a simple pipeline which used language identification techniques to determine the language of unseen texts (c) a multilingual model trained on a corpus that joins the two monolingual corpora. Experimental results reinforces the robustness of the multilingual approach under the three configurations.