Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text

Multilingual writers and speakers often alternate between two languages in a single discourse. This practice is called “code-switching”. Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is relatively readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting the scarce labeled code-switched text with plentiful synthetic labeled code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5%, 5.11% 7.20%) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). The improvement is even significant in hatespeech detection whereby we achieve a 4% improvement using only synthetic code-switched data (6% with data augmentation).


Introduction
Sentiment analysis on social media is critical for commerce and governance. Multilingual social media users often use code-switching, particularly to express emotion [28]. However, a basic requirement to train any sentiment analysis (SA) system is the availability of large sentiment-labeled corpora. These are extremely challenging to obtain [8,39,2], requiring volunteers fluent in multiple languages.
We present CSGen, a system which provides supervised SA algorithms with synthesized unlimited sentimenttagged code-switched text, without involving human labelers of code-switched text, or any linguistic theory or grammar for code-switching. These texts can then train state-of-the-art SA algorithms which, until now, primarily worked with monolingual text.
A common scenario in code-switching is that a resource-rich source language is mixed with a resource-poor target language. Given a sentiment-labeled source corpus, we first create a parallel corpus by translating to the target language, using a standard translator. Although existing neural machine translators (NMTs) can translate a complete source sentence to a target sentence with good quality, it is difficult to translate only designated source segments in isolation because of missing context and lack of coherent semantics.
Among our key contributions is a suite of approaches to automatic segment conversion. Broadly, given a source segment selected for code-switching, we propose intuitive ways to select a corresponding segment from the target sentence, based on maximum similarity or minimum dissimilarity with the source segment, so that the segment blends naturally in the outer source context. Finally, the generated synthetic sentence is tagged with the same sentiment label as the source sentence. The source segment to replace is carefully chosen based on an observation that, apart from natural switching points dictated by syntax, there is a propensity to code-switch between highly opinionated segments.
Extensive experiments show that augmenting scarce natural labeled code-switched text with plentiful synthetic text associated with 'borrowed' source labels enriches the feature space, enhances its coverage, and improves sentiment detection accuracy, compared to using only natural text. On four natural corpora having gold sentiment tags, we demonstrate that adding synthetic text can improve accuracy by 5.11% in English-Spanish, 7.20% in English-Bengali and (1.5%, 0.97%) in English-Hindi (Twitter, Facebook). The synthetic code-switch text, even when used by itself to train SA, performs almost as well as natural text in several cases. Hate speech is an extreme emotion expressed often on social media. On an English-Hindi gold-tagged hate speech benchmark, we achieve 6% absolute F1 improvement with data augmentation, partly because synthetic text mitigates label imbalance present in scarce real text.

Related Work
Recent SA systems are trained on labeled text [33,38,13]. For European and Indian code-switched sentiment analysis, several shared tasks have been initiated [2,27,24,32,34]. Some of these involve human annotations on code-switched text. [38] have annotated the data set released for POS tagging by [35]. [13] had Hindi-English code-switched Facebook text manually annotated and developed a deep model for supervised prediction.
In a different direction, synthetic monolingual text has been created by Generative Adversarial Networks (GAN) [14,41,42,18], or Variational Auto Encoders (VAE) [6]. Some of these models can be used to generate sentimenttagged synthetic text. However, most of them are not directly suitable for generating bilingual code-mixed text, due to the unavailability of sufficient volume of gold-tagged code-mixed text. [29] proposed a generative method using a handful of gold-tagged data; but they cannot produce sentence level tags. Recently, [25] used linguistic constraints arising from Equivalence Constraint Theory to design a code-switching grammar that guides text synthesis. Earlier, [4] presented similar ideas, but without empirical results. In contrast, CSGen uses a data-driven combination of word alignment weights, similarity of word embeddings between source and target, and attention [1].

Generation of code-switched text
CSGen takes a sentiment-labeled source sentence s and translates it into a target language sentence t. Then it generates text with language switches on particular constituent boundaries. This involves two sub-steps: select a segment in s ( §3.1), and then select text from t that can replace it ( §3.2- §3.3). This generation process is sketched in Algorithm 1.

Sentiment-oriented source segment selection
In this step, our goal is to select a contiguous segment from the source sentence that could potentially be replaced by some segment in the target sentence. (Allowing non-contiguous target segments usually led to unnatural sentences.) Code switching tends to occur at constituent boundaries [30], an observation that holds even for social media texts [3]. Therefore, we apply a constituency parser to the source sentence. Specifically, we use the Stanford CoreNLP shiftreduce parser [43] to generate a parse tree 1 . Then we select segments under non-terminals, i.e., subtrees, having certain properties, chosen using heuristics informed by patterns observed in real code-switched text.
NP and VP: We allow as candidates all subtrees rooted at NP (noun phrase) and VP (verb phrase) nonterminals, which may cover multiple words. Translating single-word spans is more likely to result in ungrammatical output [30].
SBAR: Bilingual writers often use a clause to provide a sentiment-neutral part and then switch to another language in another sentence-piece to express an opinion or vice-versa. An example is "Ramdhanu ended with tears kintu sesh ta besh onho rokom etar" (Ramdhanu ended with tears but the ending was quite different). Here the constituent "but the ending was quite different" comes under the subtree of SBAR.
Highly opinionated segments: We also include segments which have a strong opinion polarity, as detected by a (monolingual) sentiment analyzer [12]. E.g., the tweet "asimit khusi prasangsakako ke beech . . . as India won the world cup after 28 years" translates to "Unlimited happiness among fans . . . as India won the world cup after 28 years". a ← AttentionScore(s, t), g ← GizaScore(s, t) 8: / * Source segment selection * / 9: P ← SentimentOrientedSegmentSelection(s) 10: for each segment p s ∈ P to replace do 11: / * Target segment selection * / 12:q 1 ,q 2 ← MaxSimTargetSeg(s, p s , t, a, g) 13:q 3 ,q 4 ←MinDissimTargetSeg(s, p s , t, 1−a, 1−g) 14: / * Code-switched text generation * / 15: end for 18: end for 19: C ← Threshold(C) / * Retain only best replacements * / An example sentence, its parse tree, and its candidate replacement segments are shown in Figure 1. In Algorithm 1, p s ∈ P denotes the set of candidate replacement subtrees, which correspond to segments. For each candidate segment, we generate a code-switched version of the source sentence, as described next.

Target segment selection
Given a source sentence s, corresponding target t, and one (contiguous) source segment p s = {w i s · · · w i+x s }, the goal is to identify the best possible a contiguous target segment q t = {w j t · · · w j+y t } that could be used to replace p s to create a realistic code-switched sentence. We adopt two approaches to achieve this goal: (a) selecting a target segment that has maximum similarity with p s , and (b) selecting a target segment having minimum dissimilarity with p s , for various definitions of similarity and dissimilarity. Below, we describe methods that achieve this goal after describing several alignment scores which will be used in these methods. Overall, these lead to target segmentsq 1 t ,q 2 t , . . . shown in Algorithm 1, with t removed for clarity.

Word alignment signals
Signals based on word alignment methods are part of the recipe in choosing the best possible q t given the sentence pair and p s .
GIZA score: The standard machine translation word alignment tool Giza++ [22] uses IBM statistical word alignment models 1-5 [10,31,7,26]. This tool incorporates principled probabilistic formulations of the IBM models and gives a correspondence score G[w i t , w j s ] between target and source words for a given sentence pair. This word-pair score is used as a signal to find the bestq t .
NMT attention score: Given an attention-guided trained sequence-to-sequence neural machine translation (NMT) model [1,17] and sentence pair s, t, we use the attention score matrix A[w i t , w j s ] as an alignment signal.
Inverse document frequency (rarity): The inverse document frequency (IDF) of a word in a corpus signifies its importance in the sentence [36]. We define I(w) = σ(a IDF(w) − b) as a shifted, squashed IDF that normalizes the raw corpus-level score. Here σ is the sigmoid function and parameters a and b are empirically tuned. This IDF-based signal is optionally incorporated while choosingq t .

Target segment with maximum similarity
Given word-pair scores derived from either Giza++ or NMT attention described in §3.2.1, we formulate two methods for identifying target segments. First, we identify the best target segment given Giza++ scores, G[·, ·], as follows: For each word in q t , we compute the total attention score concentrated in p s and then multiply them as if they are independent. Second, we use the attention score learned by the NMT system of [16] (a bidirectional LSTM model with attention). Essentially, given the attention score A[·, ·] between target and source words, we intend to select the target segment q t whose maximum attention is concentrated in the given p s .
Initial exploration of the above method revealed that the attention of a target word may spread out over several related but less appropriate source words, and accrue better overall similarity than a single more appropriate word. Here IDF can come to the rescue, the intuition being that words w i t and w j s with very different IDFs are less likely to align, because (barring polysemy and synonymy) rare (common) words in one language tend to translate to rare (common) words in another. This intuition is embodied in the improved formulation: Informally, if a source segment contains many rare words, the target segment should also have a similar number of rare words from the target domain, and vice-versa.

Target segment with minimum dissimilarity
We examine an alternative method for identifying target segments that leverage the Earth Mover's Distance (EMD) [37]. [15] extended EMD to the Word Mover Distance to measure the dissimilarity between documents by 'transporting' word vectors from one document to the vectors of the other. In the same spirit, we define a dissimilarity measure between p s and candidate target segments using EMD. We present here EMD as a minimization over fractional trans-portation matrix F ∈ R |qt|×|ps| as below: where i F i,j = 1 |qt| and j F i,j = 1 |ps| and d i,j is a distance metric between a target and a source word pair, given suitable representations. Finally, we choose the target segment which is least dissimilar to a given source segment defined by the EMD. We compute d i,j in two ways, described below.
Attention-based distance: Here the distance between the embeddings is defined as: Giza-based distance: Similarly we can compute the distance using Giza score as: Given the two types of distances in Eq. (4)- (5) and the definition of EMD in Eq. (3) we can formulate two methods for identifying target segments: We can also use Euclidean distance as d i,j . However, this method requires multilingual word embeddings for every word to calculate the distance. The volume of labeled source text we can use is usually smaller than the vocabulary size, making it difficult to learn reliable word embeddings. Also, if these corpora contain informal social media text like the ones described in §4.1, then publicly available pretrained word embeddings exclude a significant percentage of them.

Projecting target segments
Given a source sentence s with designated segment p s to replace, and target sentence t, we have by now identified four possible target segmentsq k t where k ∈ {1, . . . , 4} as described in §3.2.2- §3.2.3. We now project the target segment to the source sentence, meaning, (a) replace the source segment with the target segment and (b) transliterate the replacement using the Google Transliteration API to the source script 2 . This creates four possible synthetic codeswitched sentences for each instance of (s, p s , t). Finally, we transfer the labels of the original monolingual corpora to the generated synthetic text corpora.

Best candidate via reverse translation
From these four code-switched sentences c 1 , . . . , c 4 , we wish to retain the one that retains most of the syntactic structures of the source sentence. Each code-switched sentence c k has an associated score as defined in §3.2. We use two empirically tuned thresholds: a lower cut-off for the similarity score of c 1 , c 2 and an upper cut-off for the dissimilarity score of c 3 , c 4 , to improve the quality of candidates retained. These scores are not normalized and cannot be compared across different methods. Therefore, we perform a reverse translation of each candidate back to the source language using the Google translation API to obtains. We retain the candidate whose retranslated versions has the highest BLEU score [23] wrt s. In case of a tie, we select the candidate with maximum word overlap with s.

Thresholding and stratified sampling
In addition to retaining only the best among code-switched candidates c 1,...,4 , we discard the winner if its BLEU score is below a tuned threshold. Further, we sample source sentences such that the surviving populations of sentiment labels of the code-switched sentences match the populations in the low-resource evaluation corpus. Another tuned system parameter is the amount of synthetic text to generate to supplement the gold text.
We do not depend on any domain coherence between the source corpus used to synthesize text and the gold 'payload' corpus -this is the more realistic situation. Our expectation, therefore, is that adding some amount of synthetic text should improve sentiment prediction, but excessive amounts of off-domain synthetic text may hurt it. In our experiments we grid search the synthetic:gold ratio between 1/4 and 2 using 3-fold cross validation.

Experiments
We demonstrate the effectiveness of augmenting gold code-switched text with synthetic code-switched text. We also measure the usefulness of synthetic text without gold text. In this section, we will first describe the data sets used to generate the synthetic text and then the resource-poor labeled code-switched text used for evaluation. Next, we will present the method used for sentiment detection, baseline performance, and finally our performance, along with a detailed comparative analysis.

Source corpora for text synthesis
We use publicly available monolingual sentiment-tagged (positive, negative or neutral) gold corpora in the source language.
Union: This is the union of above mentioned different data sets.
We picked Spanish, which is homologous to English, and Hindi and Bengali, which are comparatively dissimilar to English, for our experiments. We translated these monolingual tweets to Spanish, Hindi and Bengali using Google Translation API 3 and used as parallel corpus to train attention-based NMT models and statistical MT model (GIZA) to learn the word alignment signals as described in §3.2.1.

Preliminary qualitative analysis
Analysis of texts synthesized by various mechanisms proposed in §3.2 shows that similarity based methods contribute 82-85% of the best candidates and the rest come from dissimilarity based methods. Similarity-based methods using NMT attention and Giza perform well because the segments selected for replacement often constitute nouns and entity mentions, which have a very strong alignment in the corresponding target segment. NMT attention and Giza-EMD perform well when segments contract or expand in translation.

Low-resource evaluation corpora
To evaluate the usefulness of the generated synthetic tagged sentences as a training set for sentiment analysis, we have used three different code-switched language pair data sets. Each data set below was divided into 70% training, 10% validation and 20% testing folds. The training fold was (or was not) augmented with synthetic labeled text to train sentiment classifiers, which were then applied on the test fold judge the quality of synthesis.

BN-EN (Bengali-English): This is another shared task from ICON 2017 [24] with 2499 instances.
HI-EN, Hatespeech: [5] published 4000 manually-labeled code-switched Hindi-English tweets: 1500 exhibiting hate speech and 2500 normal. We also found a significant number of abusive tweets marked hate speech. For uniformity, we merged hate speech tweets and abusive tweets.

Sentiment classifier
We adopt the sub-word-LSTM system of [13]. We prefer this over feature-based methods because (a) feature extraction for code-switched text is very difficult, and varies widely across language pairs, and (b) the vocabulary is large and informal, with many tokens outside standard (full-) word embedding vocabularies and (c) sub-word-LSTM captures semantic features via convolution and pooling.
Loss functions: If the sentiment labels {−1, 0, +1} are regarded as categorical, cross-entropy loss is standard. However, prediction errors between the extreme polarities {−1, +1} need to be penalized more than errors between {-1,0} or {0,+1}. Hence, we use ordinal cross-entropy loss [21], introducing a weight factor proportional to the order of intended penalty multiplied with the cross entropy loss. On the test fold, we report 0/1 accuracy and per class microaveraged F1 score.
Baseline and prior art: Our baseline scenario is a self-contained train-dev-test split of the gold corpus. The primary prior art is the work of [25].
Feature space coherence: Our source corpora are quite unrelated to the gold corpora. Table 1 shows that the average Euclidean distance between feature space of gold training and testing texts is much lower than that between gold and synthetic texts. While this may be inescapable in a low-resource situation, the gold baseline does not pay for such decoherence, which can lead to misleading conclusions. Training regimes: Absence of coherent tagged gold text may lead to substantial performance loss. Hence, along with demonstrating the usefulness of augmenting natural with synthetic text, we also measure the efficacy of synthetic text on its own. We train the SA classifier with three labeled corpora: (a) limited gold code-switched text, (b) gold code-switched text augmented with synthetic text and (c) only synthetic text. Then we evaluate the resulting models on labeled gold code-switched test fold.  Comparison with [25]: They depend on finding correspondences between constituency parses of the source and target sentences. However, the common case is that a constituency parser is unavailable or ineffective for the target language, particularly for informal social media. They are thus restricted to synthesizing text from only a subset of monolingual data. Training SA with natural text augmented with their synthesized text leads to poorer accuracy, albeit by a small amount, than using CSGen. The performance is worse for target languages that are more resource-poor.

Sentiment detection accuracy
Ordinal vs. categorical loss: Table 2 shows that ordinal loss helps when the neutral label dominates. However, neither is a clear winner and the gains are small. Therefore, we use categorical loss henceforth.
Choice of monolingual corpus: Across all monolingual corpora, Election performs consistently well. Best test performance on HI EN,TW was obtained by synthesizing from the Mukherjee corpus. Text synthesized from Election provides the best results for HI EN,FB for both setups. The performance of Union is also good but not the best. This is because although a larger and diverse amount of data is available which ensures its quality, the Euclidean distance between test data and some individual corpora is still large.

Sentiment detection F1 score
Beyond 0/1 accuracy, Table 3 shows F1 score gains. Election yields consistently good results. We have reported the F1 score gain for different sentiment classes only for Election in Table 3 for brevity. Augmenting synthetic data with gold data yields better F1 score than training only with gold tagged data. Also, it is interesting to observe that there is a sharp drop of F1 score for HI EN,FB and BN EN data sets for Gold data while training with ordinal cross entropy function across all the sentiment labels. As described in §4.5, this is due to non-discriminative features. However, mixing them with synthetic data helps in achieving better results.  "tomorrow I will conquer the will" forever be an amazing song not because they dedicated it to me but because my momma always jams to it. twin brothers lost in fair reunited in adulthood amidst dramatic circumstances ei themer movie akhon ar viewers der attract kore na Neutral Positive twin brothers lost in fair reunited in adulthood amidst dramatic circumstances, this theme does not yet attract viewers. Ambiguous overall meaning elizaibq ellen quiere entrevistar julianna margulies clooney says she is tough cookie she is hard one to crack Negative Neutral elizaibq ellen wants to interview julianna margulies clooney says she is tough cookie she is hard one to crack hum kam se kam fight ker haaray lekin tum loog zillat ki maut maaray gaye Positive Negative We lost at least after a fight, but you died a terrible death.

Performance of standalone synthetic data
The accuracy of using only synthetic data as training is reported in Table 4. We can see that for EN HI,TW and EN ES the synthetic data is very close to the gold data performance (lagging by 5.04% and 2.84%). However, it performed poorly for HI EN,FB and BN EN dataset. This is because there is heavy mismatch between the synthetic text set generated and the test data distribution (Table 1) in these two datasets. The Pearson rank correlation coefficient between the distance (between test and training set) measures and relative accuracy gain is highly negative, −0.66.
To further establish the importance of domain coherence, we report on an experiment performed with HI EN,FB gold dataset. This dataset has texts corresponding to two different entities namely Narendra Modi and Salman Khan. Training SA with natural text corresponding to one entity and testing on the rest leads to a steep accuracy drop from 65.37% to 52.32%.

Error analysis
We found two dominant error modes where synthetic augmentation confuses the system. Table 5 shows a few examples. The first error mode can be triggered by the presence of words of different polarities, one polarity more common than the other, and the gold label being the minority polarity. The second error mode is prevalent when the emotion is weak or mixed. Either there is no strong opinion, or there are two agents, one regarded positively and the other negatively. Table 6 shows hate speech detection results. Training with only synthetic text after thresholding and stratified sampling outperforms training with only gold-tagged text by 4% F1, and using both gold and synthetic text gives a F1 boost of 6% beyond using gold alone. Remarkably, synthetic text alone outperforms gold text, because gold text has high class imbalance, leading to poorer prediction. Because we can create arbitrary amounts of synthetic text, we can balance the labels to achieve better prediction.  Table 6: Hate speech results (3-fold cross val.). In most cells we show performance without thresholding and stratification (within bracket with thresholding and stratification). Green: Best performance in each column.

Conclusion
Code-mixing is an important and rapidly evolving mechanism of expression among multilingual populations on social media. Monolingual sentiment analysis techniques perform poorly on code-mixed text, partly because code-mixed text often involves resource-poor languages. Starting from sentiment-labeled text in resource-rich source languages, we propose an effective method to synthesize labeled code-mixed text without designing switching grammars. Augmenting scarce natural text with synthetic text improves sentiment detection accuracy.