Neural Networks for Integrating Compositional and Non-compositional Sentiment in Sentiment Composition

This paper proposes neural networks for integrating compositional and non-compositional sentiment in the process of sentiment composition , a type of semantic composition that optimizes a sentiment objective. We enable individual composition operations in a recursive process to possess the capability of choosing and merging information from these two types of sources. We propose our models in neural network frameworks with structures, in which the merging parameters can be learned in a principled way to optimize a well-deﬁned ob-jective. We conduct experiments on the Stanford Sentiment Treebank and show that the proposed models achieve better results over the model that lacks this ability.


Introduction
Automatically determining the sentiment of a phrase, a sentence, or even a longer piece of text is still a challenging problem. Data sparseness encountered in such tasks often requires to factorize the problem to consider smaller pieces of component words or phrases, for which much research has been performed on bag-of-words or bag-of-phrases models (Pang and Lee, 2008;Liu and Zhang, 2012). More recent work has started to model sentiment composition (Moilanen and Pulman, 2007;Choi and Cardie, 2008;Socher et al., 2012;Socher et al., 2013), a type of semantic composition that optimizes a sentiment objective. In general, the composition process is critical in the formation of the sentiment of a span of text, which has not been well modeled yet and there is still scope for future work.
Compositionality, or non-compositionality, of the senses of text spans is important for language understanding. Sentiment, as one of the major semantic differential categories (Osgood et al., 1957), faces the problem as well. For example, the phrase must see or must try in a movie or restaurant review often indicates a positive sentiment, which, however, may be hard to learn from the component words. More extreme examples, e.g., slangs like bad ass, are not rare in social media text. This particular example can actually convey a very positive sentiment even though its component words are very negative. In brief, a sentiment composition framework that can consider both compositional and non-compositional sentiment is theoretically interesting.
From a more pragmatical viewpoint, if one is able to reliably learn the sentiment of a text span (e.g., an ngram) holistically, it would be desirable that a composition model has the ability to decide the sources of knowledge it trusts more: the composition from the component words, the noncompositional source, or a soft combination of them. In such a situation, whether the text span is actually composable may be blur or may not be a concern.
In general, the composition of sentiment is a rather complicated process. As a glimpse of evidence, the effect of negation words on changing sentiment of their scopes appears to be a complicated function (Zhu et al., 2014). The recently proposed neural networks (Socher et al., 2013;Socher et al., 2011) are promising, for their capability of modeling complicated functions (Mitchell, 1997) in 1 general, handling data sparseness by learning lowdimensional embeddings at each layer of composition, and providing a framework to optimize the composition process in principled way. This paper proposes neural networks for integrating compositional and non-compositional sentiment in the process of sentiment composition. To achieve this, we enable individual composition operations in a recursive process to possess the capability of choosing and merging information from these two types of sources. We propose our models in neural network frameworks with structures (Socher et al., 2013), in which the merging parameters can be learned in a principled way to optimize a welldefined objective. We conduct experiments on the Stanford Sentiment Treebank and show that the proposed models achieve better results over the model that does not consider this property.

Related work
Composition of sentiment Early work on modeling sentiment does not examine semantic composition closely (Pang and Lee, 2008;Liu and Zhang, 2012), as mentioned above. Recent work has considered sentiment-oriented semantic composition (Moilanen and Pulman, 2007;Choi and Cardie, 2008;Socher et al., 2012;Socher et al., 2013), or simply called sentiment composition in this paper. For example, Moilanen and Pulman (2007) used a collection of handwritten compositional rules to assign sentiment values to different granularities of text spans. Choi and Cardie (2008) proposed a learning-based framework. The more recent work of (Socher et al., 2013) proposed models based on neural networks that do not rely on any heuristic rules. Such models work in a bottom-up fashion over a tree to infer the sentiment label of a phrase or sentence as a composition of the sentiment expressed by its constituting parts. The approach leverages a principled method, the forward and backward propagation, to optimize the system performance. In this paper, we follow the neural network approach to integrate compositional and non-compositional sentiment in sentiment composition.
Prior knowledge of sentiment Integrating noncompositional sentiment into the composition pro-cess can be viewed as introducing some prior sentiment knowledge, as in general the sentiment of a word or a phrase perceived independent of its context is often referred to as prior sentiment. Wordlevel prior sentiment is typically annotated in manual sentiment lexicons (Wilson et al., 2005;Hu and Liu, 2004;Mohammad and Turney, 2010), or learned in an unsupervised or semisupervised way (Hatzivassiloglou and McKeown, 1997;Esuli and Sebastiani, 2006;Turney and Littman, 2003;Mohammad et al., 2009). More recently, sentiment indicators, such as emoticons and hashtags, are utilized (Go et al., 2009;Davidov et al., 2010;Kouloumpis et al., 2011;Mohammad, 2012;Mohammad et al., 2013a). With enough data, such freely available (but noisy) annotation can be used to learn the sentiment of ngrams. In our study, we will investigate in the proposed composition models the effect of automatically learned sentimental ngrams.

Prior-enriched semantic networks
In this paper, we propose several neural networks that enable each composition operation to possess the ability of choosing and merging sentiment from lower-level composition and that from non-compositional sources. We call the networks Prior-Enriched Semantic Networks (PESN). We present several specific implementations based on RNTN (Socher et al., 2013); the latter has showed to be a state-of-the-art sentiment composition framework. However, the realization of a PESN node is not necessarily only tied with RNTN. Figure 1 shows a piece of PESN. Each of the three big nodes, i.e., N 1 , N 2 , and N 3 , corresponds to a node in a constituency parse tree; e.g., N 3 may correspond to the phrase not a must try, where N 1 and N 2 are not and a must try, respectively. We extend each of the nodes to possess the ability to consider sentiment from lower-level composition and non-compositional sources. In node N 3 , knowledge from the lower-level composition is represented in the hidden vector i 3 , which is merged with noncompositional knowledge represented in e 3 , and the merged information is saved in m 3 . The black box in the center performs the actual merging, which integrates the two knowledge sources in order to min-2 imize an overall objective function that we will discuss in detail later. The recursive neural networks and the forward-backward propagation over structures (Socher et al., 2013;Goller and Kchler, 1996) provide a principled way to optimize the whole network. Figure 1: A prior-enriched semantic network (PESN) for sentiment composition. The three nodes, N 1 , N 2 , and N 3 , correspond to three nodes in a constituency parse tree, and each of them consider sentiment from lowerlevel composition (i 1 , i 2 , i 3 ) and from non-compositional sentiment (e 1 , e 2 , e 3 ).

Regular bilinear merging
The most straightforward way of implementing a PESN node is probably through a regular bilinear merging. Take node N 3 in Figure 1 as an example; the node vector m 3 will be simply merged from i 3 and e 3 as follows: Again, vector i 3 contains the knowledge from the lower-level composition; e 3 is a vector representing non-compositional sentiment information, which can be either from human annotation or automatically learned resources. Note that in the network, all hidden vectors m and i (including word embedding vectors) have the same dimensionality d, but the non-compositional nodes, i.e., the nodes e , do not necessarily have to have the same number of elements, and we let l be their dimensionality. The merging matrix W m is d-by-(d+l).
As in this paper we discuss PESN in the framework of RNTN, computation outside the nodes N 1 , N 2 , N 3 follows that for the standard three-way tensors in RNTN. That is, the hidden vector i 3 is computed with the following formula: where, W r ∈ R d×(d+d) and V r ∈ R (d+d)×(d+d)×d are the matrix and tensor of the composition function used in RNTN, respectively, each of which is shared over the whole tree in computing vectors i 1 , i 2 , and i 3 .

Explicitly gated merging
Compared to the regular bilinear merging model, we here further explicitly control the input of the compositional and non-compositional semantics. Explicitly gating neural network has been studied in the literature. For example, the long short-term memory (LSTM) utilizes input gates, together with output gates and forget gates, to guide memory blocks to remember/forget history (Hochreiter and Schmidhuber, 1997). For our purpose here, we explore an input gate to explicitly control the two different input sources. As shown in Figure 2, an additional gating layer g 3 is used to control i 3 , e 3 explicitly.
The sign ⊗ is a Hadamard product; σ is a logistic sigmoid function instead of a tanh activation, which makes the gating signal g 3 to be in the range of [0, 1] and serve as a soft switch (not a hard binary 0/1 switch) to explicitly gate i 3 and e 3 . Note that elsewhere in the network, we still use tanh as our activation function. In addition, W ge ∈ R d×l and W g i ∈ R l×d are the weight matrices used to calculate the gate vector.

Confined-tensor-based merging
The third approach we use for merging compositional and non-compositional knowledge employs tensors, which are able to explore multiplicative combination among variables. Tensors have already been successfully used in a wide range of NLP tasks in capturing high-order interactions among variables. The forward computation of m 3 follows: ∈ R (d+l)×(d+l)×d is the tensor m that defines multiple bilinear forms, and the matrix W m is as defined in the previous models.
As we focus on the interaction between i 3 and e 3 , we force each slice of tensor, e.g. V  m (k ∈ {1...d}) and the bottom-left l-by-d block are non-zero parameters, used to capture multiplicative, element-pair interactions between i 3 and e 3 , while the rest block are set to be zero, to ignore interactions between those variables within i 3 and those within e 3 . This does not only make the model focus on the interaction between vector i and e, it also helps significantly reduce the number of parameters to estimate, which, otherwise, could potentially lead to overfitting. We call this model confined-tensor-based merging.

Learning and inference
Objective The overall objective function in learning PESN, following (Socher et al., 2013), minimizes the cross-entropy error between the predicted distribution y sen i ∈ R c×1 at a node i and the target distribution t i ∈ R c×1 at that node, where c is the number of sentiment categories. PESN learns the parameters that are used to merge the compositional and non-compositional sentiment so that the merging operations integrate the two sources in minimizing prediction loss. The neural network over structures provides a principled framework to optimize these parameters.
More specifically, the error over an entire sentence is calculated as a regularized sum: where, λ is the regularization parameter, j ∈ c denotes the j-th element of the multinomial target distribution, θ are model parameters that will be discussed below, and i iterates over all nodes i x (e.g., i 1 , i 2 , and i 3 ) in Figure 1, where the model predicts sentiment labels.
Backpropagation over the structures To minimize E(θ), the gradient of the objective function with respect to each of the parameters in θ is calculated efficiently via backpropagation through structure (Socher et al., 2013;Goller and Kchler, 1996), after computing the prediction errors in forward propagation with formulas described above.

Regular bilinear merging
The PESN implemented with simple bilinear merging has the following model parameters: θ = (V r , W r , W m , W label , L). As discussed above, V r and W r are the tensor and matrix in RNTN; W m is the weight matrix for merging the compositional and non-compositional sentiment vectors. L denotes the vector representations of the word dictionary, and W label is sentiment classification matrix used to predict sentiment label at a 4 node. Backpropagation on the regular bilinear merging node follows a standard derivative computation in a regular feed-forward network, which we skip here.
Explicitly gated merging In this model, in addition to W m , we further learn two weight matrices W g i and W ge , as introduced in Formula 3 and 4 above. Consider Figure 2 and let δ m 3 denote the error messages passed down to node m 3 . The error messages are passed back to i 3 directly through the Hadamard product and also through the gate node g 3 . The former, denoted as δ i 3 ,dir , is calculated with: where, g 3 is calculated with Formula 3 above in the forward process; [1 : d] means taking the first d elements of the vector yielded by the Hadamard product; the rest [d + 1 : d + l] elements of the Hadamard production are discarded, as we do not update e 3 , which is given as our prior knowledge. The error messages passed down to gate vector g 3 is computed with where, s (.) is the element-wise derivative of logistic function, which can be calculated only using s(.), as s(.)(1 − s(.)). The derivative of W g i can be calculated with: Similarly, partial derivatives over W g i can be calculated. These values will be summed to the total derivative of W g i and W ge , respectively. With these notations, the error messages passed down to i 3 through the gate can then be computed with: and the total error messages to node i 3 is then: where δ i 3 ,local is the local error message from the sentiment prediction errors performed at the node i 3 itself to obtain the total error message for i 3 , which is in turn passed down through regular RNTN tensor to the lower levels. f (.) is the element-wise derivative of tanh function.
Confined-tensor-based merging In confined-tensorbased merging, the error messages passed to the two children i 3 and e 3 is computed with: where, where the error messages to i 3 are the first d numbers of elements of δ i 3 ,e 3 . The rest elements of δ i 3 ,e 3 are discarded; as mentioned above, we do not update e 3 as it is given as the prior knowledge. We skip the derivative for the W m 3 . While the derivative of each slice k(k = 1, . . . , d) of the tensor V is calculated with: Again, the full derivative for V m and W m is the sum of their derivatives over the trees. After the error message passing from m 3 to i 3 is obtained, it can be summed up with the local error message from the sentiment prediction errors at the node i 3 itself to obtain the total error message for i 3 , which is in turn used to calculate the error messages passed down as well as the derivative in the lower-level tree.

Data
We use the Stanford Sentiment Treebank (Socher et al., 2013) in our experiments. The data contain about 11,800 sentences from the movie reviews that were originally collected by Pang and Lee (2005).
The sentences were parsed with the Stanford parser (Klein and Manning, 2003). Phrases at all the tree nodes were manually annotated with sentiment values. We use the same split of the training and test data as in (Socher et al., 2013) to predict the sentiment categories of the roots (sentences) and the phrases, and use the same evaluation metric, classification accuracy, to measure the performances.

Obtaining non-compositional sentiment
In our experiments, we explore in sentiment composition the effect of two different types of noncompositional sentiment: (1) sentiment of ngrams automatically learned from an external, much larger corpus, and (2) sentiment of ngrams assigned by human annotators.
Following the method proposed in (Mohammad et al., 2013b), we learn sentimental ngrams from Tweets. The unsupervised approach utilizes hashtags, which can be regarded as conveying freely available (but noisy) human annotation of sentiment. More specifically, certain words in tweets are specially marked with the hash character (#) to indicate the topic, sentiment polarity, or emotions such as joy, sadness, angry, and surprised. With enough data, such artificial annotation can be used to learn the sentiment of ngrams by their likelihood of cooccurring with such hashtagged words.
More specifically, a collection of 78 seed hashtags closely related to positive and negative such as #good, #excellent, #bad, and #terrible were used (32 positive and 36 negative). These terms were chosen from entries for positive and negative in the Roget's Thesaurus. A set of 775,000 tweets that contain at least a positive hashtag or a negative hashtag were used as the learning corpus. A tweet was considered positive if it had one of the 32 positive seed hashtags, and negative if it had one of the 36 negative seed hashtags. The association score for an ngram w was calculated from these pseudo-labeled tweets as follows: score(w) = P M I(w, positive) − P M I(w, negative) where PMI stands for pointwise mutual information, and the two terms in the formula calculate the PMI between the target ngram and the pseudo-labeled positive tweets as well as that between the ngram and the negative tweets, respectively. Accordingly, a positive score(.) indicates association with positive sentiment, whereas a negative score indicates association with negative sentiment. We use in our experiments the bigrams and trigrams learned from the dataset with the occurrences higher than 5. We assign these ngrams into one of the 5 bins according to their sentiment scores obtained with Formula 15: (−∞, −2], (−2, −1], (−1, 1), [1,2), and [2, +∞). Each ngram is now given a one-hot vector, indicating the polarity and strength of its sentiment. For example, a bigram with a score of -1.5 will be assigned a 5-dimensional vector [0, 1, 0, 0, 0], indicating a weak negative. Note that PESN can also take into other forms of sentiment embeddings, such as those learned in (Tang et al., 2014).
In addition, the Stanford Sentiment Treebank contains manually annotated sentiment for each individual phrase in a parse tree, so we use such annotation but not other manual lexicons, by assuming such annotation fits the corpus itself the best. Specifically, we use bigram and trigram annotation in the treebank. Note that even longer ngrams are much sparser and probably less useful in general, one may learn sentiment for multi-word expressions of a larger length, which we will leave as future work.

Results
Overall prediction performance Table 1 shows the accuracies of different models on Stanford Sentiment Treebank. We evaluate the models on 5category sentiment prediction at both the sentence (root) level and at all nodes (including roots). 1 The results reported in Table 1  RNTN models with the default parameter 4 and run the training from 5 different random initializations, and report the best results we observed. The rows in the table marked with auto are models using the automatically learned ngrams, and those marked with manu using manually annotated sentiment for bigrams and trigrams. Note that the noncompositional sentiment of a node is only used to predict the sentiment of phrases above it in the tree. For example, in Figure 1 discussed earlier, the effect of e 1 and e 2 will be used to predict the sentiment of i 3 and other node i above, but not that of i 1 and i 2 themselves, avoiding the concern of using the annotation of a tree node to predict the sentiment of itself.
The models in general benefit from incorporating the non-compositional knowledge. The numbers in the bold font are the best performance achieved on the two tasks. While using the simple regular bilinear merging shows some gains, the more complicated models achieve further improvement.
Above we have seen the general performance of the models. Below, we take a closer look at the prediction errors at different depths of the sentiment treebank. The depth here is defined as the longest distance between a tree node and its descendant leafs. In Figure 3, the x-axis corresponds to different depths and y-axis is the accuracy. The figure was drawn with the RNTN and the model (7) in Table 1, so as to study the compositional property in the ideal situation where the lexical has a full coverage of bigrams and trigrams. The figure shows that using the confined tensor to combine holistic sentiment information outperforms the original RNTN model that does not consider this, starting from depth 3, showing the benefit of using holistic bigram sentiment. The improvement increases at depth 4 (indicating the benefit of using trigram sentiment), and then was propagated to the higher levels of the tree. As discussed above, we only use non-compositional sentiment of a node to predict the sentiment of the phrases above it in the tree but not the node itself. And the system still needs to balance which source it trusts more, by optimizing the overall objective.
Although the empirical improvement may depend on the percentage of non-compositional instances in a data set or the sentiment that need to be learned holistically, we present here the first effort, according to our knowledge, on studying the concern of in-tegrating compositional and non-compositional sentiment in the semantic composition process.

Conclusions and future work
This paper proposes models for integrating compositional and non-compositional sentiment in the process of sentiment composition. To achieve this, we enable each composition operation to be able to choose and merge information from these two types of sources. We propose to implement such models within neural network frameworks with structures (Socher et al., 2013), in which the merging parameters can be optimized in a principled way, to minimize a well-defined objective. We conduct experiments on the Stanford Sentiment Treebank and show that the proposed models achieve better results over the model that does not consider this property.
Although the empirical improvement may depend on the percentage of non-compositional instances in a data set or the sentiment that need to be learned holistically, we present here the first effort, according to our knowledge, on studying the basic concern of integrating compositional and non-compositional sentiment in composition. While we focus on sentiment in this paper, investigating compositional and non-compositional semantics for general semantic composition with neural networks is interesting to us as an immediate future problem, as such models provide a principled way to optimize the overall objective over the sentence structures when we consider both compositional and non-compositional semantics.