Improving Cross-Lingual Sentiment Analysis via Conditional Language Adversarial Nets

Sentiment analysis has come a long way for high-resource languages due to the availability of large annotated corpora. However, it still suffers from lack of training data for low-resource languages. To tackle this problem, we propose Conditional Language Adversarial Network (CLAN), an end-to-end neural architecture for cross-lingual sentiment analysis without cross-lingual supervision. CLAN differs from prior work in that it allows the adversarial training to be conditioned on both learned features and the sentiment prediction, to increase discriminativity for learned representation in the cross-lingual setting. Experimental results demonstrate that CLAN outperforms previous methods on the multilingual multi-domain Amazon review dataset. Our source code is released at https://github.com/hemanthkandula/clan.


Introduction
Recent success in sentiment analysis Sun et al., 2019;Howard and Ruder, 2018;Brahma, 2018) is largely due to the availability of large-scale annotated datasets (Maas et al., 2011;Zhang et al., 2015;Rosenthal et al., 2017). However, such success can not be replicated to lowresource languages because of the lack of labeled data for training Machine Learning models.
As it is prohibitively expensive to obtain training data for all languages of interest, cross-lingual sentiment analysis (CLSA) (Barnes et al., 2018;Zhou et al., 2016b;Xu and Wan, 2017;Wan, 2009;Demirtas and Pechenizkiy, 2013;Xiao and Guo, 2012;Zhou et al., 2016a) offers the possibility of learning sentiment classification models for a target language using only annotated data from a different source language where large annotated data is available. These models often rely on bilingual lexicons, pre-trained cross-lingual word embeddings, or Machine Translation to bridge the gap between the source and target languages.
CLIDSA/CLCDSA (Feng and Wan, 2019) is the first end-to-end CLSA model that does not require cross-lingual supervision which may not be available for low-resource languages.
In this paper, we propose Conditional Language Adversarial Network (CLAN) for end-toend CLSA. Similar to prior work, CLAN performs CLSA without using any cross-lingual supervision. Differing from prior work, CLAN incorporates conditional language adversarial training to learn language invariant features by conditioning on both learned feature representations (or features for short) and sentiment predictions, therefore increases the features' discriminativity in the crosslingual setting. Our contributions are three fold: • We develop Conditional Language Adversarial Network (CLAN) which is designed to learn language invariant features that are also discriminative for sentiment classification.
• Experiments on the multilingual multidomain Amazon review dataset (Prettenhofer and Stein, 2010) show that CLAN outperforms all previous methods for both in-domain and cross-domain CLSA tasks.
• t-SNE visualization of the held-out examples shows that the learned features align well across languages, indicating that CLAN is able to learn language invariant features.

Related Work
Cross-lingual sentiment analysis (CLSA): Several CLSA methods (Wan, 2009;Demirtas and Pechenizkiy, 2013;Xiao and Guo, 2012;Zhou et al., 2016a;Wan, 2009;Xiao and Guo, 2012) rely on Machine Translation (MT) for providing supervision across languages. MT, often trained from parallel corpora, may not be available for lowresource languages. Other CLSA methods (Barnes et al., 2018;Zhou et al., 2016b;Xu and Wan, 2017) Figure 1: CLAN architecture. We illustrate with a source language l s =English (solid line) and target language l t =French (dotted line). x ls , x lt are sentences in l s and l t , f ls , f lt are features extracted by the language model for x ls and x lt , and g ls , g lt are the sentiment predictions for x ls and x lt , respectively. The sentiment classification loss J ls senti is only trained on x ls for which the sentiment label is available, while the language discriminator is trained from both x ls and x lt . uses bilingual lexicons or cross-lingual word embeddings (CLWE) to project words with similar meanings from different languages into nearby spaces, to enable training cross-lingual sentiment classifiers. CLWE often depends on a bilingual lexicon (Barnes et al., 2018) or parallel or comparable corpora (Mogadala and Rettinger, 2016;Vulić and Moens, 2016). Recently, CLWE methods (Lample and ) that rely on no parallel resources are proposed, but they require very large monolingual corpora to train. The work that is most related to ours is (Feng and Wan, 2019), which does not rely on cross-lingual resources. Different from the language adversarial network used in (Feng and Wan, 2019), our work performs cross-lingual sentiment analysis using conditional language adversarial training, which allows the language invariant features to be specialized for sentiment class predictions.
Adversarial training for domain adaptation Our approach draws inspiration from Domain-Adversarial Training of Neural Networks (Ganin et al., 2016) and Conditional Adversarial Domain Adaptation (CDAN) (Long et al., 2018). DANN (Ganin et al., 2016) trains a feature generator to minimize the classification loss, and a domain discriminator to distinguish the domain where the input instances come from. It attempts to learn domain invariant features that deceive the domain discriminator while learning to predict the correct sentiment labels.CDAN (Long et al., 2018) additionally makes the discriminator conditioned on both extracted features and class predictions to improve discriminativity.

Conditional Language Adversarial
Networks for Sentiment Analysis Figure 1 shows the architecture of CLAN. It has three components: a multilingual language model (LM) that extracts features from the input sentences, a sentiment classifier built atop of the fea-tures extracted by the LM, and a conditional language adversarial trainer to force the features to be language invariant. All three components are jointly optimized in a single end-to-end neural architecture, allowing CLAN to learn cross-lingual features and to capture multiplicative interactions between the features and sentiment predictions. The resulting cross-lingual features are specialized for each sentiment class. CLAN aims at solving the cross-lingual multidomain sentiment analysis task. Formally, given a set of domains D and a set of languages L, CLAN consists of the following components: • Sentiment classifier: train on (l s , d s ) (sentiment labels are available) and test on (l t , d t ) (no sentiment labels), in which l s , l t ∈ L, l s = l t and d s , d t ∈ D. CLAN works for both variants of the CLSA problem: which l ∈ L and d ∈ D. The language IDs are known. Language Model (LM): For a sentence x, we compute the probability of seeing a word w k given the previous words: p(x) = |x| k=1 P (w k |w 1 , ..., w k−1 ): we first pass the input words through the embedding layer of language l parameterized by θ l emb . The embedding for word w k is w k . We then pass the word embeddings to two LSTM layer parameterized by θ 1 and θ 2 , that are shared across all languages and all domains, to generate hidden states (z 1 , z 2 , ..., z x ) that can be considered as features for CLSA: h k = LSTM(h k−1 , w k ; θ 1 ), and z k = LSTM(z k−1 , h k ; θ 2 ). We then use a linear decoding layer parameterized by θ l dec with a softmax for next word prediction. To summarize, the LM objective for l is: where x ∼ L l indicates that x is sampled from text in language l. Sentiment Classifier We use a linear classifier that takes the average final hidden states 1 |x| |x| k=1 z k as input features, and then a softmax to output sentiment labels. The objective is: where (x, y) ∼ C l senti indicates that the sentence x and its label y are sampled from the labeled examples in language l, and θ l senti denotes the parameters of the linear sentiment classifier.
Conditional Language Adversarial Training To force the features to be language invariant, we adopted conditional adversarial training (Long et al., 2018): a language discriminator is trained to predict language ID given the features by minimizing the cross-entropy loss, while the LM is trained to fool the discriminator by maximizing the loss: where f (x), g(x) and l ∈ L are features extracted by the LM for input sentence x, its sentiment prediction and its language ID respectively, θ emb = θ 1 emb ⊕θ 2 emb ⊕...⊕θ |L| emb denotes the parameters of all embedding layers and θ dis_lang denotes the parameters of the language discriminator. We use multilinear conditioning (Long et al., 2018) by conditioning l on the cross-covariance f (x) ⊗ g(x).
A key innovation is the conditional language adversarial training: the multilinear conditioning enables manipulation of the multiplicative interactions across features and class predictions. Such interactions capture the cross-covariance between the language invariant features and classifier predictions to improve discriminability.
The Full Model Putting all components together, the final objective function is the following: J (θ emb , θ lstm , θ dec , θ senti , θ dis_lang ) = (l,d) J l lm + αJ l senti − βJ l adv_lang where θ lstm = θ 1 ⊕ θ 2 denotes parameters of the LSTM layers, θ dec = θ 1 dec ⊕θ 2 dec ⊕...⊕θ |L| dec denotes the paramters of all decoding layers, α and β are hyperpameters controlling the relative importance of the sentiment classification and the language adversarial training objectives. Parameters θ dis_lang is trained to maximize the full objective function while the others are trained to minimize it:

Experiments
Datasets: We evaluate CLAN on the Websis-CLS-10 dataset (Prettenhofer and Stein, 2010) which consists of Amazon product reviews from 4 languages and 3 domains. Following prior work, we use English as the source language and other languages as the target languages. For each languagedomain pair there are 2,000 training documents, 2,000 test documents, and 9,000-50,000 unlabeled documents depending on the language-domain pair (details are in Prettenhofer and Stein, 2010). Implementation details: The models are implemented in PyTorch (Paszke et al., 2019). All models are trained on four NVIDIA 1080ti GPUs. We tokenized text using NLTK (Loper and Bird, 2002). For each language, we kept the most frequent 15000 words in the vocabulary since a bigger vocabulary leads to under-fitting and much longer training time. We set the word embedding size to 600 for the language model, and use 300 neurons for the hidden layer in the sentiment classifier. We set α = 0.02 and β = 0.1 for all experiments. All weights of CLAN were trained end-to-end using Adam optimizer with a learning rate of 0.03. We train the models with a maximum of 50,000 iterations with early stopping (typically stops at 3,000-4,000 iterations) to avoid over-fitting.
Experiment results: We follow the experiment setting described in (Feng and Wan, 2019). Table 1a and 1b show the accuracy of CLAN comparing to prior methods for the in-domain CLSA and cross-domain CLSA tasks, respectively. We compare CLAN to the following methods: CL-SCL, BiDRL, UMM, CLDFA, CNN-BE (Ziser and Reichart, 2018), PBLM-BE (Ziser and Reichart, 2018), A-SCL (Ziser and Reichart, 2018) are methods that require cross-lingual supervision    Table 1a and 1b, CLAN outperforms all prior methods in 11 out of 12 settings for cross-domain CLSA, and outperforms all prior methods in 8 out of 9 settings for in-domain CLSA. On average, CLAN achieves state-of-the-art performance on all language pairs for both in-domain and cross-domain CLSA tasks.
Analysis of results: To understand what features CLAN learned to enable CLSA, we probed CLAN by visualizing the distribution of features extracted from held-out examples from the language model through t-SNE (Maaten and Hinton, 2008). The plots are in Figure 2. The t-SNE plots show that the feature distributions for sentences in the source and target languages align well, indicating that CLAN is able to learn language-invariant features. To further look into what CLAN learns, we manually inspected 50 examples where CLAN classified correctly but the prior models failed: for example, in the books domain in German, CLAN classified "unterhaltsam und etwas lustig" ("entertaining and a little funny") correctly as positive, also classified the following text correctly as pos- itive: "ein buch dass mich gefesselt hat...Dieses Buch ist absolut nichts für schwache Nerven oder Moralisten" ("a book that captivated me...this book is absolutely not for the faint of heart or moralists!"). This indicates that CLAN is able to learn better lexical, syntactic and semantic features.

Conclusion
We present Conditional Language Adversarial Networks for cross-lingual sentiment analysis, and show that it achieves state-of-the-art performance.