Towards a Universal Sentiment Classifier in Multiple languages

Existing sentiment classifiers usually work for only one specific language, and different classification models are used in different languages. In this paper we aim to build a universal sentiment classifier with a single classification model in multiple different languages. In order to achieve this goal, we propose to learn multilingual sentiment-aware word embeddings simultaneously based only on the labeled reviews in English and unlabeled parallel data available in a few language pairs. It is not required that the parallel data exist between English and any other language, because the sentiment information can be transferred into any language via pivot languages. We present the evaluation results of our universal sentiment classifier in five languages, and the results are very promising even when the parallel data between English and the target languages are not used. Furthermore, the universal single classifier is compared with a few cross-language sentiment classifiers relying on direct parallel data between the source and target languages, and the results show that the performance of our universal sentiment classifier is very promising compared to that of different cross-language classifiers in multiple target languages.


Introduction
Nowadays, a large amount of user-generated content (UGC) appears online everyday, such as tweets, comments and product reviews. Sentiment classification on these data has become a popular research topic over the past few years (Pang et al., 2002;Blitzer et al., 2007;Agarwal et al., 2011;Liu, 2012). Distributed representations of words or word embeddings have been widely explored, and have proved its great usability for the sentiment classification task Xu et al., 2015;Bollegala et al., 2016;Ferreira et al., 2016).
Most existing sentiment classifiers rely on labeled training data and the data are usually language-dependent. In other words, a sentiment classifier is learned from a labeled dataset in a specific language and this sentiment classifier can be used for sentiment classification in this language. However, labeled training data for sentiment classification are not available or not easy to obtain in many languages in the world (e.g., Malaysian, Mongolian, Uighur). Without reliable labeled data, it is hard to build a sentiment classifier in these resource-poor languages.
Fortunately, there are a few studies investigating the task of cross-language sentiment classification (Banea et al., 2008;Wan, 2009;Meng et al., 2012;Xiao and Guo, 2013;Gao et al., 2015;Li et al., 2017;Zhou et al., 2016a,b), which aims to make use of the labeled data in a source language (English in most cases) to build a sentiment classifier in a target language. However, cross-language sentiment classification methods rely on parallel data between the source and target languages 1 In a resource-poor language, the parallel data between this language and the source language may not be available or is not easy to obtain. In this circumstance, previous cross-language sentiment classification meth-ods will fail to work.
Another shortcoming of previous crosslanguage sentiment classification researches is that we have to build an individual cross-language sentiment classifier for each target language, even when we want to perform sentiment classification in a couple of languages at the same time.
In this study, instead of building a sentiment classifier for each target language, we aim to build a universal sentiment classifier in multiple languages and this universal sentiment classifier only learns one single sentiment classification model and it can be applied for sentiment classification in many languages.
In order to achieve this goal, we propose an approach to learn multilingual sentiment-aware word embeddings simultaneously based only on the labeled reviews in English and unlabeled parallel data available in a few language pairs. As mentioned earlier, in some resource-poor languages, there do not exist direct parallel data between these languages and the source English language. In order to address this problem, we propose a pivot-based model to transfer the sentiment information from the source language to any resource-poor language via pivot languages. Finally, a universal sentiment classifier can be built because the multilingual word embeddings are in the same semantic space.
We build three different models (Bilingual Model, Pivot-Driven Bilingual Model and Universal Multilingual Model) and compare them empirically in order to answer two questions in this paper: 1) Can pivot-based models learn bilingual sentiment-aware word embeddings effectively? 2) Can an effective universal sentiment classifier be built for multiple languages?
Without loss of generality, we present and compare the evaluation results of the models in five languages. Evaluation results show that pivotdriven bilingual models perform as well as the bilingual model using direct parallel data, which lays the solid foundation of our universal model. Moreover, it is very promising that our universal sentiment classifier can work well in five languages, and it can achieve very promising classification results as compared to several typical crosslanguage sentiment classification models.
The main contributions of our study in this paper are summarized as follows: • We are the first to build a universal sentiment classifier in multiple languages by learning multilingual sentiment-aware word embeddings, which can not be addressed by previous researches on cross-language sentiment classification.
• We propose pivot-based models to bridge two languages in which there do not exist parallel data, and thus the sentiment information can be transferred to any target language.
• Evaluation results on five languages demonstrate the efficacy of our proposed pivotbased models and the universal sentiment classifier.

Our Approach
In order to build a universal sentiment classifier, we propose an approach to learn multilingual sentiment-aware word embeddings simultaneously, and then train a universal sentiment classification model in the embedding space by averaging the word embeddings in a document as the document representation. Note that in this study, we focus on only using the labeled data in English and do not make use of any labeled data in other languages, which makes the task more challenging 2 . Formally, we aims to build a single sentiment classifier which can perform sentiment classification in many languages {S, T 1 , T 2 , ..., T N }, where S refers to English language, and T 1 to T N refer to other N languages. In our approach, the multilingual sentimentaware word embeddings play the key role in building the universal sentiment classifier, and now the question is how to learn the multilingual sentiment-aware word embeddings? Inspired by previous studies on cross-lingual sentiment classification and bilingual word embedding learning, we can leverage the labeled data in S (i.e., English) and unlabeled parallel data between S and language T to learn bilingual sentiment-aware word embeddings in both English and T languages with a bilingual model. However, such unlabeled parallel data are not always easy to obtain for all other languages. For a specific language T , if the unlabeled parallel data between T and S do not exist, the bilingual model cannot be applied. In order to address this problem, we propose a pivotdriven bilingual model to leverage pivot languages to bridge T and S. We choose a pivot language P where the parallel data between P and S, and the parallel data between P and T are easy to obtain, and then leverage them to learn the multilingual sentiment-aware word embeddings in the three languages P , T and S. Furthermore, we can leverage more parallel data between multiple languages, some of which are parallel data between S and other languages, and some of which are parallel data within other languages, to build an universal multilingual model. The sentiment information will be directly or indirectly transfered to each language as well, and thus we obtain multilingual sentiment-aware word embeddings in many languages.
The bilingual model, pivot-driven bilingual model and universal multilingual model will be described in next sections, respectively.

Bilingual Model
The bilingual model tries to induce bilingual word embeddings from a parallel corpus, and in the meantime make similar words from the two languages share adjacent vector representations in the same vector space.
Formally, we assume a source language S with |S| words and a target language T with |T | words. We use s and t to represent a word from S and T , respectively. Given the bilingual parallel corpus C between language S and T , it can be divided into a corpus C S in language S and a corpus C T in language T . And we use a notation S − T to indicate a parallel corpus between languages S and T .
Previous studies have proposed some bilingual models for learning bilingual word embeddings, so we extend the well-behaved BiSkip model  to Bilingual Model (BM). This model requires word alignment information, and in this study word alignment is automatically obtained from parallel sentences by using a word alignment tool. In our bilingual model, every word s in language S is required to predict the adjacent words of itself and the aligned word t in the target language T . For corpus C S , the monolingual constraint on itself (C S → C S ) is: and the cross-lingual constraint on C T (C S → C T ) is: where s↔t means word s(∈ C S ) is aligned to word t(∈ C T ) and adj(s) or adj(t) mean the adjacent words of word s or t.
Similarly, for corpus C T we can obtain: and Combining equations 1, 2, 3 and 4, we get the objective for obtaining bilingual word embeddings from parallel corpus: where α 1 , α 2 , α 3 and α 4 are scalar parameters.
We still have to incorporate the sentiment information into the bilingual word embeddings. Similar to previous studies , we make use of the sentiment polarity of texts as supervision in the learning process. Given a labeled sentimental corpus C L 3 , we use S * to represent a sentence in C L and w as a word in S * . And x T is a sum of word embeddings in S * . We simply adopt the logistic regression classifier to enforce the sentiment constraint, and thus make the bilingual word embeddings absorb the corresponding sentiment information. The objective function is: where y is the label of the sentence S * , W is a weight vector and b is a bias. The overall objective function for inducing bilingual sentiment-aware word embeddings is to maximize:

Pivot-Driven Bilingual Model
For some resource-poor target language T , it is quite expensive to get direct parallel corpus between T and the source language S. Without such parallel corpus, it is not possible to apply the above bilingual model to learn bilingual sentiment-aware word embeddings. In order to address this problem we propose our Pivot-Driven Bilingual Model (PDBM) by using a pivot language to bridge T and S. The model is inspired by (Wu and Wang, 2007), in which pivot languages are used for phrase-based SMT. A pivot language P is chosen if the parallel corpus between P and S, and the parallel corpus between P and T are easy to obtain. Given two parallel corpora: S-P and P -T , our PDBM model tries to get the trilingual word embeddings by putting constrains on the two corpora. Under the well-designed constraint, the pivot language P can pass the sentiment information from the source language S to the target language T . Similarly, we further assume the pivot language P with |P | words, and use C P to denote the corpus in language P .
We design constraints on the two parallel corpora S-P and P -T , instead of direct constraints on S and T . Derived from the BM model, we can get three monolingual constraints C S → C S , C T →C T , C P →C P and four bilingual constraints C S →C P , C T →C P , C P →C S and C P →C T . The final objective function for learning the trilingual word embeddings can be summarized as: where β 1 , β 2 , β 3 , β 4 , β 5 , β 6 , β 7 are scalar parameters. Similarly, the objective for enforcing the sentiment constraint is the same as equation 5, so we combine them together to get the overall objective function: Through the pivot language, the sentiment information can be passed from a source language to a target language by maximizing the above objective function.

Universal Multilingual Model
The bilingual model and the pivot-driven bilingual model lay the foundations of build a universal multilingual model for sentiment classification in many languages. Given a source language S and a few other languages {T 1 , T 2 , ..., T N }. If there exist parallel data between a language T i and S, then the bilingual sentiment-aware word embeddings can be learned by the bilingual model. If the parallel data between languages T i and S are not available, a pivot language can be selected and the pivot-driven model can be applied to learn the trilingual sentiment-aware word embeddings. Even when a single pivot language cannot be found for languages T i and S, we still can find two or more pivot languages {P 1 , P 2 , ..., P M } to form a pivot chain and the sentiment information in the source language can be passed through the pivot chain (S − P 1 − ... − P M − T i ) to the target language.
Therefore, in this model, we will make use of all parallel corpora between any pair of languages (including parallel corpora between the source language and any other language, and parallel corpora between other languages) and learn the sentiment-aware word embeddings in all the languages simultaneously. The monolingual objective in each language and the cross-lingual objective for any available parallel corpus are defined in the same way as in the above models, and we sum all the objectives and denote it as Obj universal (C), and this objective is then combined with the sentiment constraint as follows: Obj universal (C) + L(C L ) By maximizing the above objective function, the sentiment-aware word embeddings in all the languages will be learned.

Dataset
Without loss of generality, we evaluate our models in five languages (including three western languages and two Asian languages): English (en), German (de), French (fr), Japanese (jp) and Chinese (en/zh). Among these languages, the English language is the source language with labeled training data, and we do no use any labeled data in the other languages.
Particularly, we use the multilingual multidomain Amazon review dataset 4 provided by (Prettenhofer and Stein, 2010) and the NLPC-C2013 dataset 5 . The review dataset provided by (Prettenhofer and Stein, 2010) contains labeled data in four languages: English, German, French and Japanese, and the NLPCC2013 dataset further provides labeled data in Chinese. The reviews in each language are divided into three domains: dvd, music and books. Each domain of product reviews contains a balanced training set and test set, each of which consists of 1000 positive and 1000 negative reviews for each language except for Chinese. While for Chinese language, the test set consists of 2000 positive and 2000 negative reviews. We only use English training data as the labeled data in the experiments.
For the PDBM model, we use en-fr (∈ UN v1.0) with fr-de (∈ Eu v7) to get the case en-fr-de (fr acts as a pivot), en-zh (∈ CN-EN) with zh-jp (∈ CN-JP) to build the case en-zh-jp (zh acts as a pivot), en-zh (∈ CN-EN) with zh-fr (∈ UN v1.0) to build en-zh-fr (zh acts as a pivot), and en-fr (∈ Eu v7) with zh-fr (∈ UN v1.0) to build en-fr-zh (fr acts as a pivot). Note that any pivot language can be selected if the parallel corpora between the pivot language and other languages can be obtained, but in our experiments, we only use one pivot language in each test case to validate the feasibility of our proposed model. In practice, a popular language (such as English, Chinese) can be used as the pivot because it can act as a link between two unpopular languages.
While for the UMM model, we use all the corpora used in PDBM to build a universal model. All the details can be found in Table 1.

Comparison Methods
In addition to the comparison between our models, we further compare them with popular crosslingual (CL) sentiment classification methods.
For comparison in German, French and Japanese, we adopt a few typical CL classification methods, and the results are directly borrowed from the corresponding published papers: MT-BOW: It is a simple model to train a linear classifier based on the bag-of-words features, and it uses a machine translator to translate the test data into the source language (Prettenhofer and Stein, 2010) .
CL-SCL: It is the cross-lingual structural correspondence learning algorithm proposed by (Prettenhofer and Stein, 2010) and the features in the two languages are mapped to a unified space.
BSE: It is introduced in (Tang and Wan, 2014) by forcing the representations of words from both the source and target languages to share the same feature space. In this way, bilingual word embeddings are learned for cross-lingual sentiment classification.
CR-RL: It is the bilingual word representation learning method of (Xiao and Guo, 2013). It learns different representations for words in different languages. Part of the word vector is shared among different languages and the rest is language dependent. The document representation is calculated by taking average over all words in the document.
Bi-PV: It extends the paragraph vector model into bilingual setting by sharing the document representation of a pair of parallel documents .
For comparison in Chinese, we adopt several typical CL classification methods: MT-LR and MT-SVM: We use logistic regres-
BSWE: It uses the bilingual sentiment word embedding algorithm based on denoising autoencoders  to learns word representations. Each document is then represented by the sentiment words and the corresponding negation words.

Settings and Preprocessing
We utilize cdec (Dyer et al., 2010) as an alignment tool to get word-level alignment, and we also use it to lowercase the characters in western languages. We use the stanford-segmenter 11 to segment Chinese words, and use Mecab 12 to segment Japanese words. The SnowNLP 13 is used to convert traditional words to simplified ones. Besides, we remove all the irregular characters (e.g., c , £, ♥) in the texts.
For all the three models, we use stochastic gradient descent (SGD) for learning, with a default learning rate of 0.025, negative sampling with 30 samples, skip-gram with context window of size 5, and a subsampling rate of value 1e-4. The embedding size is set to 400. The training epochs are all set to 10. All the parameters of α and β used in the three models are simply set to 1. The word embeddings in a document are averaged to get the document representation, and then the logistic regression classier is adopted for sentiment classification. 11 http://nlp.stanford.edu/software/segmenter.shtml 12 http://taku910.github.io/mecab/ 13 https://github.com/isnowfy/snownlp

Results
The sentiment classification results of our three models and the CL classification methods in the three domains and in the German, French and Japanese languages are presented in Table 2. The results in the Chinese language are presented in Table 3. Note that the results of the CL methods are not reported on English test sets, and we only compare our three models on English test sets in Table 3.
First and most importantly, we compare our three models. The BM model relies on the direct parallel data between the source and target languages, and it generally works slightly better than the other models, including the PMDB model and the UMM model. The reason is that direct parallel data can be used for transferring the sentiment information from the source language to the target language directly. However, the performance achieved by the PDBM model is very close to the BM model in most test cases. In some cases (DE-DVD, JP-book and EN-music), the PDBM model can even outperform the BM model. Note that the PDBM model does not leverage the direct parallel data between the source and target languages, but uses a pivot language as a bridge. The results demonstrate that the pivot-driven model is very effective for learning bilingual / trilingual sentiment-aware word embeddings. The results also verify the feasibility of using pivot languages to address the problem of sentiment classification in resource-poor languages, which lays a good foundation for building a universal sentiment classifier in multiple languages. When comparing the UMM model with BM and PDBM, the results of UMM are very close to that of BM and PDBM in most cases, Note that the UMM model does not use the direct parallel corpora of en-de  and en-jp, but relies on pivot-based methods for bridging language gaps. We also find that the different parallel corpora used by the UMM model are of different quality and genres, and if they are used at the same time, they may have some negative influence on each other and thus the learned word embeddings are not always better than the BM and PDBM models using only one or two parallel corpora. What's more, the available parallel data in different language pairs are of various sizes (0.12M ∼ 2.0M). Considering all these issues, the results of UMM are promising because the learned single sentiment classifier can work generally well in multiple languages. We believe that if more high-quality and balanced parallel data are used, the performance of the universal sentiment classifier will be improved.
Second, we compare our models with typical CL classification methods. In Table 2, we can see our models can outperform MT-BOW, CL-SCL, and CR-RL in most test cases, and outperform BSE in the German language. Our models can achieve very close results with the other sophisticated CL methods, including Bi-PV. In Table 3, we can see our models can generally outperform MT-LR and MT-SVM, and achieve very competitive results with other strong CL methods, including Bi-PV and BSWE. Most CL classification meth-ods rely on commercial machine translation systems (e.g. Google Translate) for translating the reviews (including the training reviews, the test reviews and additional unlabeled reviews) to get parallel data. Compared with the large amount of parallel data used by commercial machine translation systems, the parallel data used by our models are of a very small size. Though our models are simply based on word embeddings, and the parallel data used by our models are in a small scale, the performance achieved by our models are very competitive.
In Figure 1, we show the visualization of word embeddings learned by the UMM model for some example words. We can see that similar sentiment words in different languages appear nearby with each other. The figure demonstrate that the UMM model are successful in learning sentiment-aware word embeddings in multiple languages.

Related Work
The most closely related work is cross-lingual sentiment classification, which aims to leverage the labeled sentiment data from a language with rich sentiment resources (e.g., English) to perform sentiment classification in a target language lacking sentiment resources (e.g., Japanese). Some studies tried to transfer labeled data from the source language to the target language (Banea et al., 2008;Wan, 2009;Gao et al., 2015;, and some other studies tried to build a unified feature/semantic space in both two languages (Prettenhofer and Stein, 2010;Xiao and Guo, 2013;Zhou et al., , 2016bLi et al., 2017). In the latter case, the sentiment classifier learned in the source language can be used for sentiment classification in both languages. Particularly, Wan (2009) used machine translation to translate the source language to the target language to bridge the gap and applied the co-training approach. Prettenhofer and Stein (2010) provided a CL-SCL model based on structural correspondence learning (SCL) for sentiment classification. Lu et al. (2011) explored to increase the labeled data in both the source and target languages by applying an extra unlabeled parallel data. Xiao and Guo (2013) expected to get cross-lingual discriminative word embeddings to perform the multiple document classification tasks. Their intuitive thought is based on a delicate log-losses function, which aims to increase the probability of the documents with their labels. Like Lu et al. (2011), Meng et al. (2012 also proposed their cross-lingual mixture model to leverage an unlabeled parallel dataset. They intended to learn the previously unseen sentimental words from the big parallel corpus. Some studies have attempted to address multi-lingual sentiment classification (Deriu et al., 2017), but different from our study, they directly leverage training data in multiple languages, by assuming the training data can be ob-tained directly or in a distant supervision way in each language, and they did not consider the resource or data transfer problem at all.
Word embeddings have shown its great practicable usability in plenty of natural language processing tasks, such as information retrieval (Diaz et al., 2016;Zuccon et al., 2015), machine translation (Shi et al., 2016;Zhang et al., 2014), sentiment analysis Xu et al., 2015; and so on. Bilingual word embeddings have been induced for cross-lingual NLP tasks (Vulić and Moens, 2015;Guo et al., 2014;Zou et al., 2013;. In particular,  proposed the BiSkip model to induce bilingual word embeddings, which is extended from the monolingual skip-gram model in word2vec to a bilingual model. They added constraint mutually on both the source language and the target language, while the monolingual model only has constraint on a single language.  proposed an approach to learning bilingual sentiment word embeddings by using sentiment information of text as supervision, based on labeled corpora and their translations. Ferreira et al. (2016) used a single optimization problem by combining a co-regularizer for the bilingual embeddings with a task-specific loss. However, these methods for inducing bilingual word embeddings usually rely on directly parallel corpus.

Conclusion and Future Work
In this paper, we proposed an approach to build a universal sentiment classifier in multiple languages. Particularly we proposed a pivot-based model to transfer the sentiment information from the source language to any resource-poor language via pivot languages. Evaluation results show that the pivot-based model can learn bilingual sentiment-aware word embeddings as well as the bilingual model using direct parallel data. Moreover, the universal sentiment classifier built in the five languages can achieve promising results.
In future work, we will investigate using more advanced document embedding techniques (e.g., CNN, RNN) to directly model document-level sentiment information. We will also extend our model to other languages.