Subsentential Sentiment on a Shoestring: A Crosslingual Analysis of Compositional Classification

Sentiment analysis has undergone a shift from document-level analysis, where labels expresses the sentiment of a whole document or whole sentence, to subsentential approaches, which assess the contribution of individual phrases, in particular including the composition of sentiment terms and phrases such as negators and intensiﬁers. Starting from a small sentiment treebank modeled after the Stanford Sentiment Treebank of Socher et al. (2013), we investigate suitable methods to perform compositional sentiment classiﬁcation for German in a data-scarce setting, harnessing cross-lingual methods as well as existing general-domain lexical resources.


Introduction
In sentiment classification, we find a general tendency from document-level classification towards more fine-grained approaches that yield a more detailed appraisal of the judgement performed in the text -in particular, using composition over syntactic structure to get a more detailed approach over phrases.
For English movie reviews, work using the Stanford Sentiment Treebank (SSTb) has shown that such subsentential sentiment information can yield approaches with both very high accuracy (Socher et al., 2013;Dong et al., 2014;Hall et al., 2014) and precise information about the role of each phrase -information which can subsequently used for extracting or summarizing the sentiment expressed in the text.
The effort for creating a sentiment treebank such as the SSTb, however, seems prohibitive if we wanted to create such a resource for each pair of relevant domain and language: Compared to document-level annotations for sentiment, which are easy to come by (e.g., star ratings), annotating individual syntactic phrases requires considerable effort.
The main focus of this paper is the question if and how it is possible to reach sensible performance for compositional sentiment classification when we only have limited resources to spend on an in-language, in-domain sentiment treebank. For this goal, we use a new resource, the Heidelberg Sentiment Treebank (HeiST), which is a German-language counterpart to the Stanford Sentiment Treebank in the sense that it makes explicit the composition of sentiment expression over syntactic phrases. Our experiments on HeiST provide a direct comparison of different techniques for harnessing cross-lingual, cross-domain, or cross-task information, and are the first of this kind to specifically target compositional sentiment analysis. Figure 1 (next page) shows a schematic overview of the experiments: beyond supervised baseline experiments using SVM classification and a supervised RNTN model (section 3), we evaluated crosslingual projection (section 4), lexicon-based approaches (section 5), as well as semi-supervised approaches based on word clusters (section 6).

Related Work
The starting point for our research is the idea that the sentiment of larger stretches of text can be calculated through composition over smaller stretches of text, which was investigated in a learning framework by both Yessenalina and Cardie (2011) and Socher et al. (2011Socher et al. ( , 2012, both learning in a compositional  Wang and Manning (2012) later demonstrated that unigram and bigram features in an SVM-based classification framework can reach a greater accuracy than the earlier recursive neural network approach of Socher et al. (2011Socher et al. ( , 2012, which calls into question the assumption that sentiment composition can be learned purely from sentence-level annotations. Compositionality through Tensors In subsequent work, Socher et al. (2013) introduce the Stanford Sentiment Treebank, which contains detailed annotations of sentiment values for individual syntactic phrases in a binarized tree, and an approach based on recursive neural tensor networks (RNTN) which yields significant improvements over the earlier approaches using token-level features.
The RNTN represents the contribution of individual nodes as vectors of reals and achieves its precision by using a tensor V [1:d] ∈ R 2d×2d×d as well as a matrix W ∈ R 2d×d to capture second-order dependencies between the two children of a node in the tree (with vectors a, b), yielding first a vector h by then using a monotonic nonlinear function on h (here: tanh) to yield the vector for this node. The sentiment label of a node is then gained by multiplying these hidden vectors by a matrix W s , yielding a five-dimensional vector with the classification. Using hidden vectors for each node and capturing second-order interaction between the two child vectors a and b, the RNTN model achieves descriptive power greater than that of TreeCRFs (Nakagawa et al., 2010), and similar to latent-variable models that have been very successful in syntactic parsing (Petrov et al., 2006). In later work, Zhu et al. (2014) show that the RNTN's lexicalized modeling of negators and their behaviour leads to increased descriptive power of the model, which results in an improved treatment of negation. Dong et al. (2014) introduce an approach that chooses between multiple composition tensors (AdaMC-RNTN), which yields further gains with respect to RNTN performance.
In contrast to the lexicalized and highdimensional RNTN model, there are several lines of work that attempt to work in a more data-scarce setting.
Lexicon-based approaches The classical approach for performing sentiment classification in a setting where training data is sparse can be seen in the SO-CAL approach of Taboada et al. (2011): Using a manually curated dictionary with sentiment values for multiple parts of speech, and a set of heuristics that predict how intensifiers, nega-tors/shifters as well as nonveridical moods affect the sentiment of a phrase, they show that it is possible to reach good results across different domains. Choi and Cardie (2009) show that it is possible to adapt an existing general-domain sentiment lexicon to a specific domain using an approach that optimizes a joint objective of classification loss, sparsity of the changes made to the lexicon, and ambiguity of lexicon entries. Their approach yields appreciable gains over the general-domain lexicon, both with CRF-based machine learning classification and with a simpler "vote & flip" algorithm that is based on majority voting and negators.
Crosslingual Sentiment Analysis involves the usage of a dataset in one language to perform sentiment analysis in another language; in their work, Banea et al. (2013) show that translating text in the target language to the source language and applying a well-tuned sentiment classification system works better than either translating the training corpus or the lexicon used by the system.
In research by other groups, Wan (2009) advocates a bootstrapping approach that combines source-side and target-side features in one classifier; Duh et al. (2011) note that crosslingual sentiment analysis techniques always incur a loss due to the shift in language from the source language texts to the target language even though the general domain is the same. Popat et al. (2013) argue that full machine translation is not useful for resource-scarce languages, and propose to use cross-lingual clustering both to improve the generalization capability within a single language as well as for crosslingual projection, which works better than machine translation with the English-Hindi language pair.
It should be noted that most of the work presented in the last two paragraph works with document-level sentiment, or (in the case of Choi and Cardie) with shallower annotations, and offers additional challenges in the case of sentiment composition.

Low-Budget Treebanking for Sentiment
For both supervised training and for evaluation, we created a German dataset that is close in domain to the Stanford Sentiment treebank (Socher et al., 2013), covering opinionated sentences from movie reviews with phrase-level sentiment annotations. The original Stanford Sentiment Treebank is based on the dataset of Pang and Lee (2005), which includes 10,662 sentences from excerpts of movie reviews published on rottentomatoes.com. It should be noted that these excerpts are much more likely to express an opinion than general text or even the main body of a movie review since they contain precisely a summary of the opinion.
In order to match both domain and role of these sentences most precisely, we collected creativecommons-licensed reviews from a German movie review site, filmrezensionen.de, and used only the summary part of these documents, yielding 1184 sentences, for which we crowdsourced annotation for each individual phrase in the binary tree (see Figure 2 for an example tree fragment).
For the purpose of getting binary phrase trees, sentences were processed with the Berkeley Parser (Petrov et al., 2006), NP nodes were added inside PPs (Samuelsson and Volk, 2004) and the resulting parse trees binarized using the head table in CoreNLP (Rafferty and Manning, 2008), yielding 14,321 unique phrases.
Annotation was outsourced via the CrowdFlower service, which collects three judgements for each phrase and computes an end result through voting, using unambiguous test items (which we composed from strongly positive or strongly negative adjective-noun combinations) to filter out annotators lacking the requisite understanding of German. The HeiST treebank, as well as the code used in these experiments, are available for research purposes. 1

Selecting Subjective Phrases
One possible approach to reduce annotation effort would be to annotate only those phrases that a classification model deems to have sentiment content in the first place. As a more extreme example of such an approach, consider the MLSA sentiment dataset for German, where 270 sentences were selected that already contained two words from an existing sentiment lexicon (Clematide et al., 2012), with the goal of getting sentences with interesting interactions between sentiment words. Given the potential benefits (getting more data for the same annotation effort), an approach that filters out non-interesting (confidently objective) phrases would be highly appealing.
For the pre-classification experiment, we used cross-validation on 20 to assess the potential impact of strategies for saving. For the corresponding classifier, we used features from a German generaldomain sentiment lexicon, a regression model for document-level sentiment (see section 5.2), as well as part-of-speech tag features in a gradient boosting classifier. As seen in table 1, the sentiment lexicon, especially in conjunction with the regression model and a POS-based filter, would allow to detect uninteresting (objective) phrases with high accuracy. We limit ourselves to the 50% of most confident classifications, and as a measure of caution, the filter is bypassed for any phrase that contains a word in one of several sentiment dictionaries (see section 5). The classifier has a precision of 96.5% for objective   phrases while catching about 66.7% of all objective nodes. While this would correspond to substantial savings (about a quarter of all nodes would be assigned the "neutral" label and not annotated), we would also lose a fraction of non-neutral phrase and introduce an unwanted bias (towards lexicon-based resources) into our dataset.

Baseline results
We use the existing RNTN implementation of Socher et al. (2013) to train and test supervised learning for sentiment composition, using crossvalidation. For parameter tuning, we varied the number of vector dimensions as well as the size of the minibatches used in training, and found that the resulting classifier yields very sensible results compared to a similarly-sized sample from SSTb (see Tables 2 and 3). We evaluate our results as in Socher et al. (2013): we consider the recall of positive and negative nodes while ignoring both neutral nodes and the difference between positive (+) and strongly positive (++) or between negative (-) and strongly negative (--) nodes, respectively. Socher et al. remove sentences with neutral overall sentiment in training as well as in testing, which seems to worsen the RNTN performance on our dataset (see Table  2), although other methods seem to be less affected by it. For comparability reasons, all other reported figures are based on Socher's non-neutral-sentencesonly setting. In comparison results on SSTb (see Table 3), classification experiments from the English data also show poor results for the RNTN classifier at small data sizes, in parallel with anecdotal evidence on recurrent neural networks having trouble with small dataset sizes. 2

Crosslingual Projection for Compositional Sentiment
Our crosslingual approach follows Banea et al. (2013) in assuming that machine translation of the target documents to the source language, then applying a source-language sentiment analysis, and finally projecting the result back to the target side will yield usable sentiment classification. In difference to previous approaches for cross-lingual sentiment analysis, however, our annotation transfer concerns not just analysis results for the complete sentence, but for individual syntactic nodes. After translating the target-language trees using the Google Translate API, we parsed the sentences using the English model of the Stanford parser, and applied the RNTN model of Socher et al. (2013) trained on the English Stanford Sentiment Treebank, yielding a labeling for each syntactic node with a sentiment value. We then performed word alignment using the PostCAT word aligner (Ganchev et al., 2008) with a model trained on the OPUS version of the EuroParl corpus (Tiedemann, 2012), and alignment of syntactic nodes using the Lingua::Align toolbox for tree alignment Tiedemann (2010) with a model trained on the Smultron parallel treebank (Volk et al., 2010).
2 Alec Radford (2015): General Sequence Learning using Recurrent Neural Nets, https://indico.io/blog/ general-sequence-learning-using-recurrent-neural-nets/  Using the word alignment and our Lingua::Align model, we are able to map 98.6% of the target-language nodes to a corresponding node on the source (English) side, whereas the remaining nodes are assigned the same sentiment label as the root. As can be seen in table 2, a model that uses simpler features for Lingua::Align works less well than the full feature model. Considering that the RNTN on the Stanford Sentiment Treebank reaches 87.6% node accuracy and 85.4% root accuracy, we see that the crosslingual projection step induces a loss in accuracy, but still performs well in comparison to the approaches that use the HeiST training data.

Lexicon-based Approaches
Considering that the size of HeiST creates a sparse data problem for the RNTN learner, it is natural to ask whether we can improve the generalization capabilities of the system by either using a lesssupervised approach or by generalizing over individual word forms to alleviate the sparse data problem.

General-domain lexicon
Several general-domain sentiment lexicons exist for German, including those of Klenner et al. (2009), Waltinger (2010a), Remus et al. (2010), and Emerson and Declerck (2014). Klenner et al. (2009) created their polarity lexicon by semiautomatic extension of an existing one: starting from a set of 2866 adjective seeds, they looked for adjectives that often co-occur in coordinations with known sentiment-bearing adjectives, which were added to the lexicon after a manual filtering step. The current version of Klenner et al.'s PolArt lexicon also contains other parts of speech, and a list of shifters and intensifiers that interact with subjective terms.
The GermanPolarityClues lexicon of Waltinger (2010a) combines translation from English lexicons with a semi-automatic approach for merging and manually correcting lexicon entries.
The SentiWS lexicon (Remus et al., 2010) contains translations of the English General Inquirer (Stone et al., 1966), which have been translated via Google Translate, as well as a small number of terms that were mined from positive and negative product reviews, expanded using a collocation dictionary.
Finally, the SentiMerge lexicon (Emerson and Declerck, 2014) has been constructed as a Bayesian combination (i.e., averaging with imputation for missing entries) of the three resources above together with the German SentiSpin resource of Waltinger (2010b), which contains automatic (dictionary-based) translations of the SentiSpin lexicon of Tamura et al. (2005).
We tested all lexicons using two approaches: In the vote-only approach, the sentiment of a phrase is determined by the sum of the scores of the words in that phrase as they are assigned in the sentiment lexicon. In the vote-and-flip approach, we consider the average of the sentiment terms, but invert the sentiment value whenever a term from the shifter category of Klenner et al.'s lexicon is found within the yield of the node. A similar strategy was used in many papers on sentiment composition, usually with a performance rather close to the best system (see e.g. the CompoMC baseline in Cardie, 2008, or the Vote-and-Flip baseline in Choi andCardie, 2009).

Near-domain lexicon construction
While the filmrezensionen web site offers a good number of reviews, the final collection is rather small.
To complement our small in-domain dataset we use the most common way of get-ting text with document-level annotations, namely customer-written reviews from the movies section of amazon.de web site.
Perhaps expectedly, customer reviews do not focus exclusively on the film and its performance. Rather, it often occurs that customer reviews include a discussion of the physical (or other) medium that the film came on: 3 (1) I am with Lovefilm (now Prime) and tried to stream the series. Terrible! Always [issues with] loading time and loss of the stream. It seems that Amazon hasn't come to terms with the technology yet.
Other reviews on Amazon match our domain fairly well, as in the following: (2) If this is truly a sequel to "Speed", it only shows in the second hour of the film. It's only then that deBont shows why he would be an action [film] specialist. Admittedly, even then we don't get the same tension as in the predecessor, but in any case it's better than the first hour of the film.
While we found that a small quantity of data (20+20 hand-classified sentences) together with a 300-class LDA representation was sufficient to reach 100% accuracy in separating content-related versus mediarelated text, we found that filtering out the irrelevant texts made no difference for the mean square error, in sharp contrast to L1/Lasso regularization, which allows to learn a sparse lexicon.

Variants of the RNTN Model
While the RNTN model certainly performs well on the full Stanford Sentiment Treebank, it is likely that its performance on HeiST is suffering from sparse data problems, and that both words and particular constructions can be novel and unseen. In syntactic parsing, Koo et al. (2008) and Candito and Seddah (2010) have shown that using Brown clusters can be beneficial for alleviating sparse data problems in parsing. In a similar vein, Popat et al. (2013)   In our experiments, we follow Candito and Seddah (2010) in simply replacing words by clusters: in their experiments, even this simple procedure can yield an improvement, with improved results when the unlabeled data stems from the target domain. Since Brown clusters are mostly syntactic/semantic in nature and do not automatically distinguish positive or negative sentiment, we additionally performed multiple experiments to use clusters while incorporating additional sentiment information: On one hand, we try to incorporate the judgements on the Amazon near-domain dataset more directly into the clusters by using the repeated bisecting K-Means algorithm as implemented in CLUTO (Zhao and Karypis, 2005), with previous/next word, part-of-speech tag, and the score of the containing review as features. On the other hand, we split the Brown clusters according to the sentiment value that they have in a particular sentiment lexicon (e.g. Sen-tiMerge), yielding three clusters 01101+, 01101and 01101? instead of the original cluster 01101.
As a final experiment, we consider replacing only sentiment words by a concatenation of their partof-speech and the sentiment class (turning "a great film" into "a JJ++ film"), and leaving neutral words intact. As an upper baseline for this approach, we can get words' sentiment polarity directly from training and testing data, which yields the replacegold entry in table 5. rule type # in SSTb # in HeiST  AVG  119468  19228  INV  2158  289  INT  6614  646  MWE  18235  1936   Table 6: Rule types in SSTb and HeiST

Results and Error analysis
Looking at the results in tables 2, 4 and 5, we see that simple support vector machine classification is very effective for reproducing the positive/negative sentiment of nodes and complete sentences, followed by crosslingual projection and a simple averaging approach; we also see that handling negation in the vote-and-flip approach seems to lower the score, just as the best model with word clusters and splitting (using the GermanPolarityClues lexicon) performs better than the word-based RNTN approach, but less well than the lexicon by itself. Even the replacegold upper baseline -replacing sentiment-carrying words by their sentiment label, which raises the performance substantially -gives results below the simpler SVM approach, which is counterintuitive.

Is it about Compositionality?
One motivation for using sub-sentence structure both in approaches for rule-based composition (as, e.g. in Taboada et al. (2011) and other lexicon-based approaches) as well as in more complex learning approaches such as RNN (Socher et al., 2011) and RNTN (Socher et al., 2013) is the idea that such approaches are able to model the interaction between sentiment-bearing words and sentiment-modifying words. An example for investigations based on this assumption is the work by Zhu et al. (2014), who contrast different lexicon-based approaches for handling negation with an RNTN model of negation and a modification of said model. Given the results using a lexicon-based approach implementing the vote-and-flip heuristic in comparison to the vote-only heuristic, we found it worth investigating what specific types of interaction exist in compositional sentiment treebanks, also considering that Zhu et al.'s investigations yielded a more precise picture of the sentiment-shifting action of negators as a highly lexicalized phenomenon.
For our analysis, we grouped the production rules s p → s l s r in a sentiment treebank into one of the following categories: AVG A production is said to be averaging if the parent category is within the range of either daughter category. (e.g. mind-numbingly good would be the composition of a negative term and a positive term to a positive term, which still fits the averaging heuristic).
INV A production is said to be inverting if one daughter category is neutral and the other daughter category is on the other side on the spectrum (e.g. "not great" landing on the negative side) INT A production is said to be intensifying if the parent category is on the same side of the scale as the daughters but more extreme.
MWE A production is said to be a multi-word production if the daughter categories are classified as neutral while the parent category is not. 4 As can be seen in table 6, the number of inverting and intensifying productions is dwarfed, both for the SST and for HeiST, by the number of multi-word rules. While it is likely that these counts are slightly distorted by noise in the annotation (as both datasets are the product of crowdsourcing), this fact is remarkable and merits further investigation.
Types of multiword expressions If we try to group the nodes with a "multiword" production, we can distinguish at least the following categories: • aspect descriptions: In some cases, an adjective is specifically used to describe a (positive or negative) aspect of the movie, such as an elaborate continuation, or an expanded vision, where individual words have a neutral sentiment label (and conceivable could have been used in a non-aspect-specific way to convey a neutral or negative sentiment, such as an elaborate perversion, or an expanded nightshift). Similarly, wenig Handlung (not much action) has a negative meaning as a construction despite "wenig" (few/not much) not having a negative meaning itself.
• expression strengthening is a phenomenon that occurs when a term is judged as neutral by annotators by itself, but gains a sentiment value when paired with an intensifier or negator. For example, intrusive was labeled as neutral in SSTb, but simply intrusive as negative.
• comparatives are a very regular construction where too much of something is almost always bad: too long, too insistent, too much, too many are all negative in SSTb, just as zu viel (too many) and zu wenig (not enough) and other counterparts in HeiST are negative.
• true constructions such as plot holes or historically significant in SSTB, or ruhigen Gewissens (with a calm conscience) and Finger weg (don't touch it) in HeiST are both a problem for approaches relying purely on composition and not regular enough that we would expect to model it as a regular construction.
Some of the neutral-to-positive or neutral-tonegative transitions don't seem well-motivated and may be regarded as artifacts from the crowdsourcing, as does n't, is n't and are n't are negative in SSTb whereas 's not, do n't and did n't get a neutral label. In HeiST, nicht immer (not always) as well as nicht ganz (not quite) are negative, whereas auch nicht (neither) and nicht so (not as) or nicht unbedingt (not necessarily) are neutral.
The MWE productions seem to overlap with wellknown linguistic phenomena -consider Fahrni and Klenner (2008) and their claim that most adjectives have a polarity that is dependent on the target they modify instead of having a 'prior' polarity that holds independently of the target, or the observation of Su and Markert (2009) that sentiment should be dependent on word senses instead of word forms (which would capture a large number of examples within the expression strengthening category). Yet, others may be idiosyncracies introduced by the crowdsourcing process, and powerful learners such as RNTN or the approach of Hall et al. (2014) will gain performance from simply memorizing the idiosyncracies of the data when there is enough of it -because of the way the Stanford Sentiment Treebank is constructed, phrases always have the same (context-independent) label, while we may get a more accurate (and possibly different) picture from introducing additional means of quality control (which in turn increases the necessary investment for such a sentiment treebank).

Contrasting SVM and RNTN behaviour
In table 7, we tabulated the classification accuracy for the parent node in different types of productions in HeiST. In this evaluation, we counted a production as correct whenever the parent node has the right sentiment label (in parallel with the labeled recall in syntactic evaluation), ignoring for the moment the question whether the production produced by a system falls into the same category. It is easy to see that AVG-type productions are the least errorprone for all classifiers, whereas MWE and INV productions pose a significant challenge for the models.
We also see that on these challenging production, the RNTN performs better than the other methods.

Summary
We presented a novel dataset for subsentential sentiment classification, which uses the same conventions as the Stanford Sentiment Treebank (SSTb), which is the only German resource of this type besides the smaller (270 sentences) MLSA corpus (Clematide et al., 2012). We performed a systematic exploration into supervised, cross-lingual, and lexicon-based approaches on this dataset and found that, paradoxically, the performance of the stateof-the-art recursive neural tensor network (RNTN) models are severely impeded in this data-sparse situation, unlike latent-variable models for syntax which can deal with such conditions quite well: Lavelli and Corazza (2009), for example, reports that the best results for parsing on the very small TUT treebank (slightly more than 2000 sentences) can be achieved using a PCFG-LA model. We showed that a wide variety of models -from lexicon-based sentiment prediction over SVM with unigram features to crosslingual classification -performs better than the RNTN, and that methods to improve RNTN performance that work in other settings (syntax) do not offer any easy fix.
In a second step, we took a closer look at the crowdsourced data in order to explain certain counterintuitive results (such as the fact that most sentiment lexicons do not benefit from negation handling, or that the upper baseline achievable with the RNTN by getting gold-standard information on positive and negative words is at about the same level as our SVM classifier), and found that SSTb-type resources show marked differences from e.g., the MLSA dataset as they incorporate multiword items, but seem to be challenging for the study of compositionality due to noise that is not present in expert-annotated resources.