Bilingual Sentiment Embeddings: Joint Projection of Sentiment Across Languages

Sentiment analysis in low-resource languages suffers from a lack of annotated corpora to estimate high-performing models. Machine translation and bilingual word embeddings provide some relief through cross-lingual sentiment approaches. However, they either require large amounts of parallel data or do not sufficiently capture sentiment information. We introduce Bilingual Sentiment Embeddings (BLSE), which jointly represent sentiment information in a source and target language. This model only requires a small bilingual lexicon, a source-language corpus annotated for sentiment, and monolingual word embeddings for each language. We perform experiments on three language combinations (Spanish, Catalan, Basque) for sentence-level cross-lingual sentiment classification and find that our model significantly outperforms state-of-the-art methods on four out of six experimental setups, as well as capturing complementary information to machine translation. Our analysis of the resulting embedding space provides evidence that it represents sentiment information in the resource-poor target language without any annotated data in that language.


Introduction
Cross-lingual approaches to sentiment analysis are motivated by the lack of training data in the vast majority of languages. Even languages spoken by several million people, such as Catalan, often have few resources available to perform sentiment analysis in specific domains. We therefore aim to harness the knowledge previously collected in resource-rich languages. Previous approaches for cross-lingual sentiment analysis typically exploit machine translation based methods or multilingual models. Machine translation (MT) can provide a way to transfer sentiment information from a resource-rich to resourcepoor languages (Mihalcea et al., 2007;Balahur and Turchi, 2014). However, MT-based methods require large parallel corpora to train the translation system, which are often not available for underresourced languages.
Examples of multilingual methods that have been applied to cross-lingual sentiment analysis include domain adaptation methods (Prettenhofer andStein, 2011), delexicalization (Almeida et al., 2015), and bilingual word embeddings (Mikolov et al., 2013;Hermann and Blunsom, 2014;Artetxe et al., 2016). These approaches however do not incorporate enough sentiment information to perform well cross-lingually, as we will show later.
We propose a novel approach to incorporate sentiment information in a model, which does not have these disadvantages. Bilingual Sentiment Embeddings (BLSE) are embeddings that are jointly optimized to represent both (a) semantic information in the source and target languages, which are bound to each other through a small bilingual dictionary, and (b) sentiment information, which is annotated on the source language only. We only need three resources: (i) a comparably small bilingual lexicon, (ii) an annotated sentiment corpus in the resourcerich language, and (iii) monolingual word embeddings for the two involved languages.
We show that our model outperforms previous state-of-the-art models in nearly all experimental settings across six benchmarks. In addition, we offer an in-depth analysis and demonstrate that our model is aware of sentiment. Finally, we provide a qualitative analysis of the joint bilingual sentiment space. Our implementation is publicly available at https://github.com/jbarnesspain/blse.

Related Work
Machine Translation: Early work in cross-lingual sentiment analysis found that machine translation (MT) had reached a point of maturity that enabled the transfer of sentiment across languages. Researchers translated sentiment lexicons (Mihalcea et al., 2007;Meng et al., 2012) or annotated corpora and used word alignments to project sentiment annotation and create target-language annotated corpora (Banea et al., 2008;Duh et al., 2011;Demirtas and Pechenizkiy, 2013;Balahur and Turchi, 2014).
Several approaches included a multi-view representation of the data (Banea et al., 2010;Xiao and Guo, 2012) or co-training (Wan, 2009;Demirtas and Pechenizkiy, 2013) to improve over a naive implementation of machine translation, where only the translated data is used. There are also approaches which only require parallel data (Meng et al., 2012;Zhou et al., 2016;Rasooli et al., 2017), instead of machine translation.
All of these approaches, however, require large amounts of parallel data or an existing high quality translation tool, which are not always available. A notable exception is the approach proposed by Chen et al. (2016), an adversarial deep averaging network, which trains a joint feature extractor for two languages. They minimize the difference between these features across languages by learning to fool a language discriminator, which requires no parallel data. It does, however, require large amounts of unlabeled data.
Bilingual Embedding Methods: Recently proposed bilingual embedding methods (Hermann and Blunsom, 2014;Chandar et al., 2014; offer a natural way to bridge the language gap. These particular approaches to bilingual embeddings, however, require large parallel corpora in order to build the bilingual space, which are not available for all language combinations.
An approach to create bilingual embeddings that has a less prohibitive data requirement is to create monolingual vector spaces and then learn a projection from one to the other. Mikolov et al. (2013) find that vector spaces in different languages have similar arrangements. Therefore, they propose a linear projection which consists of learning a rotation and scaling matrix. Artetxe et al. (2016Artetxe et al. ( , 2017 improve upon this approach by requiring the projection to be orthogonal, thereby preserving the monolingual quality of the original word vectors. Given source embeddings S, target embeddings T , and a bilingual lexicon L, Artetxe et al. (2016) learn a projection matrix W by minimizing the square of Euclidean distances where S ∈ S and T ∈ T are the word embedding matrices for the tokens in the bilingual lexicon L. This is solved using the Moore-Penrose pseudoinverse S + = (S T S ) −1 S T as W = S + T , which can be computed using SVD. We refer to this approach as ARTETXE. Gouws and Søgaard (2015) propose a method to create a pseudo-bilingual corpus with a small taskspecific bilingual lexicon, which can then be used to train bilingual embeddings (BARISTA). This approach requires a monolingual corpus in both the source and target languages and a set of translation pairs. The source and target corpora are concatenated and then every word is randomly kept or replaced by its translation with a probability of 0.5. Any kind of word embedding algorithm can be trained with this pseudo-bilingual corpus to create bilingual word embeddings.
These last techniques have the advantage of requiring relatively little parallel training data while taking advantage of larger amounts of monolingual data. However, they are not optimized for sentiment.
Sentiment Embeddings: Maas et al. (2011) first explored the idea of incorporating sentiment information into semantic word vectors. They proposed a topic modeling approach similar to latent Dirichlet allocation in order to collect the semantic information in their word vectors. To incorporate the sentiment information, they included a second objective whereby they maximize the probability of the sentiment label for each word in a labeled document. Tang et al. (2014) exploit distantly annotated tweets to create Twitter sentiment embeddings. To incorporate distributional information about tokens, they use a hinge loss and maximize the likelihood of a true n-gram over a corrupted n-gram. They include a second objective where they classify the polarity of the tweet given the true n-gram. While these techniques have proven useful, they are not easily transferred to a cross-lingual setting. Zhou et al. (2015) create bilingual sentiment embeddings by translating all source data to the target language and vice versa. This requires the existence of a machine translation system, which is a prohibitive assumption for many under-resourced languages, especially if it must be open and freely accessible. This motivates approaches which can use smaller amounts of parallel data to achieve similar results.

Model
In order to project not only semantic similarity and relatedness but also sentiment information to our target language, we propose a new model, namely Bilingual Sentiment Embeddings (BLSE), which jointly learns to predict sentiment and to minimize the distance between translation pairs in vector space. We detail the projection objective in Section 3.1, the sentiment objective in Section 3.2, and the full objective in Section 3.3. A sketch of the model is depicted in Figure 1.

Cross-lingual Projection
We assume that we have two precomputed vector spaces S = R v×d and T = R v ×d for our source and target languages, where v (v ) is the length of the source vocabulary (target vocabulary) and d (d ) is the dimensionality of the embeddings. We also assume that we have a bilingual lexicon L of length n which consists of word-to-word translation pairs L = {(s 1 , t 1 ), (s 2 , t 2 ), . . . , (s n , t n )} which map from source to target.
In order to create a mapping from both original vector spaces S and T to shared sentimentinformed bilingual spaces z andẑ, we employ two linear projection matrices, M and M . During training, for each translation pair in L, we first look up their associated vectors, project them through their associated projection matrix and finally minimize the mean squared error of the two projected vectors. This is very similar to the approach taken by Mikolov et al. (2013), but includes an additional target projection matrix.
The intuition for including this second matrix is that a single projection matrix does not support the transfer of sentiment information from the source language to the target language. Without M , any signal coming from the sentiment classifier (see Section 3.2) would have no affect on the target embedding space T , and optimizing M to predict sentiment and projection would only be detrimental to classification of the target language. We analyze this further in Section 6.3. Note that in this con-figuration, we do not need to update the original vector spaces, which would be problematic with such small training data.
The projection quality is ensured by minimizing the mean squared error 12 where z i = S s i · M is the dot product of the embedding for source word s i and the source projection matrix andẑ i = T t i · M is the same for the target word t i .

Sentiment Classification
We add a second training objective to optimize the projected source vectors to predict the sentiment of source phrases. This inevitably changes the projection characteristics of the matrix M , and consequently M and encourages M to learn to predict sentiment without any training examples in the target language.
To train M to predict sentiment, we require a source-language corpus For classification, we use a two-layer feedforward averaging network, loosely following Iyyer et al. (2015) 3 . For a sentence x i we take the word embeddings from the source embedding S and average them to a i ∈ R d . We then project this vector to the joint bilingual space z i = a i · M . Finally, we pass z i through a softmax layer P to get our To train our model to predict sentiment, we minimize the cross-entropy error of our predictions

Joint Learning
In order to jointly train both the projection component and the sentiment component, we combine the two loss functions to optimize the parameter where α is a hyperparameter that weights sentiment loss vs. projection loss.

Target-language Classification
For inference, we classify sentences from a targetlanguage corpus C target . As in the training procedure, for each sentence, we take the word embeddings from the target embeddings T and average them to a i ∈ R d . We then project this vector to the joint bilingual spaceẑ i = a i · M . Finally, we pass  z i through a softmax layer P to get our prediction y i = softmax(ẑ i · P ).

OpeNER and MultiBooked
To evaluate our proposed model, we conduct experiments using four benchmark datasets and three bilingual combinations. We use the OpeNER English and Spanish datasets (Agerri et al., 2013) and the MultiBooked Catalan and Basque datasets (Barnes et al., 2018). All datasets contain hotel reviews which are annotated for aspect-level sentiment analysis. The labels include Strong Negative (−−), Negative (−), Positive (+), and Strong Positive (++). We map the aspect-level annotations to sentence level by taking the most common label and remove instances of mixed polarity. We also create a binary setup by combining the strong and weak classes. This gives us a total of six experiments. The details of the sentence-level datasets are summarized in Table 1. For each of the experi- ments, we take 70 percent of the data for training, 20 percent for testing and the remaining 10 percent are used as development data for tuning.

Monolingual Word Embeddings
For BLSE, ARTETXE, and MT, we require monolingual vector spaces for each of our languages. For English, we use the publicly available GoogleNews vectors 4 . For Spanish, Catalan, and Basque, we train skip-gram embeddings using the Word2Vec toolkit 4 with 300 dimensions, subsampling of 10 −4 , window of 5, negative sampling of 15 based on a 2016 Wikipedia corpus 5 (sentence-split, tokenized with IXA pipes (Agerri et al., 2014) and lowercased). The statistics of the Wikipedia corpora are given in Table 2.

Bilingual Lexicon
For BLSE, ARTETXE, and BARISTA, we also require a bilingual lexicon. We use the sentiment lexicon from Hu and Liu (2004) (to which we refer in the following as Bing Liu) and its translation into each target language. We translate the lexicon using Google Translate and exclude multi-word expressions. 6 This leaves a dictionary of 5700 translations in Spanish, 5271 in Catalan, and 4577 in Basque. We set aside ten percent of the translation pairs as a development set in order to check that the distances between translation pairs not seen during training are also minimized during training.

Setting
We compare BLSE (Sections 3.1-3.3) to ARTETXE (Section 2) and BARISTA (Section 2) as baselines, which have similar data requirements and to machine translation (MT) and monolingual (MONO) upper bounds which request more resources. For all models (MONO, MT, ARTETXE, BARISTA), we take the average of the word embeddings in the source-language training examples and train a linear SVM 7 . We report this instead of using the same feed-forward network as in BLSE as it is the stronger upper bound. We choose the parameter c on the target language development set and evaluate on the target language test set. Upper Bound MONO. We set an empirical upper bound by training and testing a linear SVM on the target language data. As mentioned in Section 5.1, we train the model on the averaged embeddings from target language training data, tuning the c parameter on the development data. We test on the target language test data.
Upper Bound MT. To test the effectiveness of machine translation, we translate all of the sentiment corpora from the target language to English using the Google Translate API 8 . Note that this approach is not considered a baseline, as we assume not to have access to high-quality machine translation for low-resource languages of interest.
Baseline ARTETXE. We compare with the approach proposed by Artetxe et al. (2016) which has shown promise on other tasks, such as word similarity. In order to learn the projection matrix W , we need translation pairs. We use the same word-to-word bilingual lexicon mentioned in Section 3.1. We then map the source vector space S to the bilingual spaceŜ = SW and use these embeddings.
Baseline BARISTA. We also compare with the approach proposed by Gouws and Søgaard (2015). The bilingual lexicon used to create the pseudobilingual corpus is the same word-to-word bilingual lexicon mentioned in Section 3.1. We follow the authors' setup to create the pseudo-bilingual corpus. We create bilingual embeddings by training skip-gram embeddings using the Word2Vec toolkit on the pseudo-bilingual corpus using the same parameters from Section 4.2.
Our method: BLSE. We implement our model BLSE in Pytorch (Paszke et al., 2016) and initialize the word embeddings with the pretrained word embeddings S and T mentioned in Section 4.2. We use the word-to-word bilingual lexicon from Section 4.3, tune the hyperparameters α, training epochs, and batch size on the target development set and use the best hyperparameters achieved on the development set for testing. ADAM (Kingma and Ba, 2014) is used in order to minimize the average loss of the training batches.  Table 3: Precision (P), Recall (R), and macro F 1 of four models trained on English and tested on Spanish (ES), Catalan (CA), and Basque (EU). The bold numbers show the best results for each metric per column and the highlighted numbers show where BLSE is better than the other projection methods, ARTETXE and BARISTA (** p < 0.01, * p < 0.05).
Ensembles We create an ensemble of MT and each projection method (BLSE, ARTETXE, BARISTA) by training a random forest classifier on the predictions from MT and each of these approaches. This allows us to evaluate to what extent each projection model adds complementary information to the machine translation approach.

Results
In Figure 2, we report the results of all four methods. Our method outperforms the other projection methods (the baselines ARTETXE and BARISTA) on four of the six experiments substantially. It performs only slightly worse than the more resourcecostly upper bounds (MT and MONO). This is especially noticeable for the binary classification task, where BLSE performs nearly as well as machine translation and significantly better than the other methods. We perform approximate randomization tests (Yeh, 2000) with 10,000 runs and highlight the results that are statistically significant (**p < 0.01, *p < 0.05) in Table 3.
In more detail, we see that MT generally performs better than the projection methods (79-69 F 1 on binary, 52-44 on 4-class). BLSE (75-69 on binary, 41-30 on 4-class) has the best performance of the projection methods and is comparable with MT on the binary setup, with no significant difference on binary Basque. ARTETXE (67-46 on binary, 35-21 on 4-class) and BARISTA (61-55 on binary, 40-34 on 4-class) are significantly worse than BLSE on all experiments except Catalan and Basque 4-class. On the binary experiment, ARTETXE outperforms BARISTA on Spanish (67.1 vs. 61.2) and Catalan (60.7 vs. 60.1) but suffers more than the other methods on the four-class experiments, with a maximum F 1 of 34.9. BARISTA   is relatively stable across languages.
ENSEMBLE performs the best, which shows that BLSE adds complementary information to MT. Finally, we note that all systems perform successively worse on Catalan and Basque. This is presumably due to the quality of the word embeddings, as well as the increased morphological complexity of Basque.

Model and Error Analysis
We analyze three aspects of our model in further detail: (i) where most mistakes originate, (ii) the effect of the bilingual lexicon, and (iii) the effect and necessity of the target-language projection matrix M .

Phenomena
In order to analyze where each model struggles, we categorize the mistakes and annotate all of the test phrases with one of the following error classes: vocabulary (voc), adverbial modifiers (mod), negation (neg), external knowledge (know) or other. Table 4 shows the results.
Vocabulary: The most common way to express sentiment in hotel reviews is through the use of polar adjectives (as in "the room was great) or the mention of certain nouns that are desirable ("it had a pool"). Although this phenomenon has the largest total number of mistakes (an average of 71 per model on binary and 167 on 4-class), it is mainly due to its prevalence. MT performed the best on the test examples which according to the annotation require a correct understanding of the vocabulary (81 F 1 on binary /54 F 1 on 4-class), with BLSE (79/48) slightly worse. ARTETXE (70/35) and BARISTA (67/41) perform significantly worse. This suggests that BLSE is better ARTETXE and BARISTA at transferring sentiment of the most important sentiment bearing words.
Negation: Negation is a well-studied phenomenon in sentiment analysis (Pang et al., 2002;Wiegand et al., 2010;Zhu et al., 2014;Reitan et al., 2015). Therefore, we are interested in how these four models perform on phrases that include the negation of a key element, for example "In general, this hotel isn't bad". We would like our models to recognize that the combination of two negative elements "isn't" and "bad" lead to a Positive label.
Given the simple classification strategy, all models perform relatively well on phrases with negation (all reach nearly 60 F 1 in the binary setting). However, while BLSE performs the best on negation in the binary setting (82.9 F 1 ), it has more problems with negation in the 4-class setting (36.9 F 1 ).
Adverbial Modifiers: Phrases that are modified by an adverb, e. g., the food was incredibly good, are important for the four-class setup, as they often differentiate between the base and Strong labels. In the binary case, all models reach more than 55 F 1 . In the 4-class setup, BLSE only achieves 27.2 F 1 compared to 46.6 or 31.3 of MT and BARISTA, respectively. Therefore, presumably, our model does currently not capture the semantics of the target adverbs well. This is likely due to the fact that it assigns too much sentiment to functional words (see Figure 6).
External Knowledge Required: These errors are difficult for any of the models to get correct. Many of these include numbers which imply positive or negative sentiment (350 meters from the beach is Positive while 3 kilometers from the beach is Negative). BLSE performs the best (63.5 F 1 ) while MT performs comparably well (62.5). BARISTA performs the worst (43.6).
Binary vs. 4-class: All of the models suffer when moving from the binary to 4-class setting; an average of 26.8 in macro F 1 for MT, 31.4 for ARTETXE, 22.2 for BARISTA, and for 36.6 BLSE. The two vector projection methods (ARTETXE and BLSE) suffer the most, suggesting that they are currently more apt for the binary setting.

Effect of Bilingual Lexicon
We analyze how the number of translation pairs affects our model. We train on the 4-class Spanish setup using the best hyper-parameters from the previous experiment. . The x-axis shows training epochs. We see that BLSE is able to learn that sentiment synonyms should be close to one another in vector space and sentiment antonyms should not.
Research into projection techniques for bilingual word embeddings (Mikolov et al., 2013;Lazaridou et al., 2015;Artetxe et al., 2016) often uses a lexicon of the most frequent 8-10 thousand words in English and their translations as training data. We test this approach by taking the 10,000 wordto-word translations from the Apertium Englishto-Spanish dictionary 9 . We also use the Google Translate API to translate the NRC hashtag sentiment lexicon (Mohammad et al., 2013) and keep the 22,984 word-to-word translations. We perform the same experiment as above and vary the amount of training data from 0, 100, 300, 600, 1000, 3000, 6000, 10,000 up to 20,000 training pairs. Finally, we compile a small hand translated dictionary of 200 pairs, which we then expand using target language morphological information, finally giving us 657 translation pairs 10 . The macro F 1 score for the Bing Liu dictionary climbs constantly with the increasing translation pairs. Both the Apertium and NRC dictionaries perform worse than the translated lexicon by Bing Liu, while the expanded hand translated dictionary is competitive, as shown in Figure 3.
While for some tasks, e. g., bilingual lexicon induction, using the most frequent words as translation pairs is an effective approach, for sentiment analysis, this does not seem to help. Using a translated sentiment lexicon, even if it is small, gives better results. 9 http://www.meta-share.org 10 The translation took approximately one hour. We can extrapolate that hand translating a sentiment lexicon the size of the Bing Liu lexicon would take no more than 5 hours. . "Translation" lines show the average cosine similarity between translation pairs. The remaining lines show F 1 scores for the source and target language with both varints of BLSE. The modified model cannot learn to predict sentiment in the target language (red lines). This illustrates the need for the second projection matrix M .

Analysis of M
The main motivation for using two projection matrices M and M is to allow the original embeddings to remain stable, while the projection matrices have the flexibility to align translations and separate these into distinct sentiment subspaces. To justify this design decision empirically, we perform an experiment to evaluate the actual need for the target language projection matrix M : We create a simplified version of our model without M , using M to project from the source to target and then P to classify sentiment.
The results of this model are shown in Figure 5. The modified model does learn to predict in the source language, but not in the target language. This confirms that M is necessary to transfer sentiment in our model.

Qualitative Analyses of Joint Bilingual Sentiment Space
In order to understand how well our model transfers sentiment information to the target language, we perform two qualitative analyses. First, we collect two sets of 100 positive sentiment words and one set of 100 negative sentiment words. An effective cross-lingual sentiment classifier using embeddings should learn that two positive words should be closer in the shared bilingual space than a positive word and a negative word. We test if BLSE is able to do this by training our model and after every epoch observing the mean cosine similarity between the sentiment synonyms and sentiment antonyms after projecting to the joint space. We compare BLSE with ARTETXE and BARISTA by replacing the Linear SVM classifiers with the same multi-layer classifier used in BLSE and observing the distances in the hidden layer. Figure 4 shows this similarity in both source and target language, along with the mean cosine similarity between a held-out set of translation pairs and the macro F 1 scores on the development set for both source and target languages for BLSE, BARISTA, and ARTETXE. From this plot, it is clear that BLSE is able to learn that sentiment synonyms should be close to one another in vector space and antonyms should have a negative cosine similarity. While the other models also learn this to some degree, jointly optimizing both sentiment and projection gives better results.
Secondly, we would like to know how well the projected vectors compare to the original space. Our hypothesis is that some relatedness and similarity information is lost during projection. Therefore, we visualize six categories of words in t-SNE (Van der Maaten and Hinton, 2008): positive sentiment words, negative sentiment words, functional words, verbs, animals, and transport.
The t-SNE plots in Figure 6 show that the positive and negative sentiment words are rather clearly separated after projection in BLSE. This indicates that we are able to incorporate sentiment information into our target language without any labeled data in the target language. However, the downside BLSE Original Figure 6: t-SNE-based visualization of the Spanish vector space before and after projection with BLSE. There is a clear separation of positive and negative words after projection, despite the fact that we have used no labeled data in Spanish.
of this is that functional words and transportation words are highly correlated with positive sentiment.

Conclusion
We have presented a new model, BLSE, which is able to leverage sentiment information from a resource-rich language to perform sentiment analysis on a resource-poor target language. This model requires less parallel data than MT and performs better than other state-of-the-art methods with similar data requirements, an average of 14 percentage points in F 1 on binary and 4 pp on 4-class crosslingual sentiment analysis. We have also performed a phenomena-driven error analysis which showed that BLSE is better than ARTETXE and BARISTA at transferring sentiment, but assigns too much sentiment to functional words. In the future, we will extend our model so that it can project multi-word phrases, as well as single words, which could help with negations and modifiers.