Learning Tag Embeddings and Tag-specific Composition Functions in Recursive Neural Network

Recursive neural network is one of the most successful deep learning models for natural language processing due to the compositional nature of text. The model recursively composes the vector of a parent phrase from those of child words or phrases, with a key component named composition function. Although a variety of composition functions have been proposed, the syntactic information has not been fully encoded in the composition process. We pro-pose two models, T ag G uided RNN (TG-RNN for short) which chooses a composition function according to the part-of-speech tag of a phrase, and T ag E mbedded RNN/RNTN (TE-RNN/RNTN for short) which learns tag embeddings and then combines tag and word embeddings together. In the ﬁne-grained sentiment classiﬁcation, experiment results show the proposed models obtain remarkable improvement: TG-RNN/TE-RNN obtain remarkable improvement over baselines, TE-RNTN obtains the second best result among all the top performing models, and all the proposed models have much less parameters/complexity than


Introduction
Among a variety of deep learning models for natural language processing, Recursive Neural Network (RNN) may be one of the most popular models. Thanks to the compositional nature of natural text, recursive neural network utilizes the recursive structure of the input such as a phrase or sentence, and has shown to be very effective for many natural language processing tasks including semantic relationship classification (Socher et al., 2012), syntactic parsing (Socher et al., 2013a), sentiment analysis (Socher et al., 2013b), and machine translation (Li et al., 2013).
The key component of RNN and its variants is the composition function: how to compose the vector representation for a longer text from the vector of its child words or phrases. For instance, as shown in Figure 2, the vector of 'is very interesting' can be composed from the vector of the left node 'is' and that of the right node 'very interesting'. It's worth to mention again, the composition process is conducted with the syntactic structure of the text, making RNN more interpretable than other deep learning models.  is composed from the vectors of node 'very' and node 'interesting'. Similarly, the node 'is very interesting' is composed from the phrase node 'very interesting' and the word node 'is' .
There are various attempts to design the composition function in RNN (or related models). In RNN (Socher et al., 2011), a global matrix is used to linearly combine the elements of vectors. In RNTN (Socher et al., 2013b), a global tensor is used to compute the tensor products of dimensions to favor the association between different el-ements of the vectors. Sometimes it is challenging to find a single function to model the composition process. As an alternative, multiple composition functions can be used. For instance, in MV-RNN (Socher et al., 2012), different matrices is designed for different words though the model is suffered from too much parameters. In AdaMC RNN/RNTN (Dong et al., 2014), a fixed number of composition functions is linearly combined and the weight for each function is adaptively learned. In spite of the success of RNN and its variants, the syntactic knowledge of the text is not yet fully employed in these models. Two ideas are motivated by the example shown in Figure 2: First, the composition function for the noun phrase 'the movie/NP' should be different from that for the adjective phrase 'very interesting/ADJP' since the two phrases are quite syntactically different. More specifically to sentiment analysis, a noun phrase is much less likely to express sentiment than an adjective phrase. There are two notable works mentioned here: (Socher et al., 2013a) presented to combine the parsing and composition processes, but the purpose is for parsing; (Hermann and Blunsom, 2013) designed composition functions according to the combinatory rules and categories in CCG grammar, however, only marginal improvement against Naive Bayes was reported. Our proposed model, tag guided RNN (TG-RNN), is designed to use the syntactic tag of the parent phrase to guide the composition process from the child nodes. As an example, we design a function for composing noun phrase (NP) and another one for adjective phrase (ADJP). This simple strategy obtains remarkable improvements against strong baselines.  Second, when composing the adjective phrase 'very interesting/ADJP' from the left node 'very/RB' and the right node 'interesting/JJ', the right node is obviously more important than the left one. Furthermore, the right node 'interest-ing/JJ' apparently contributes more to sentiment expression. To address this issue, we propose Tag embedded RNN/RNTN (TE-RNN/RNTN), to learn an embedding vector for each word/phrase tag, and concatenate the tag vector with the word/phrase vector as input to the composition function. For instance, we have tag vectors for DT,NN,RB,JJ,ADJP,NP, etc. and the tag vectors are then used in composing the parent's vector. The proposed TE-RNTN obtain the second best result among all the top performing models but with much less parameters and complexity. To the best of our knowledge, this is the first time that tag embedding is proposed.
To summarize, the contributions of our work are as follows: • We propose tag-guided composition functions in recursive neural network, TG-RNN. Tag-guided RNN allocates a composition function for a phrase according to the partof-speech tag of the phrase.
• We propose to learn embedding vectors for part-of-speech tags of words/phrases, and integrate the tag embeddings in RNN and RNTN respectively. The two models, TE-RNN and TE-RNTN, can leverage the syntactic information of child nodes when generating the vector of parent nodes.
• The proposed models are efficient and effective. The scale of the parameters is well controlled. Experimental results on the Stanford Sentiment Treebank corpus show the effectiveness of the models. TE-RNTN obtains the second best result among all publicly reported approaches, but with much less parameters and complexity.
The rest of the paper is structured as follows: in Section 2, we survey related work. In Section 3, we introduce the traditional recursive neural network as background. We present our ideas in Section 4. The experiments are introduced in Section 5. We summarize the work in Section 6.

Related Work
Different kinds of representations are used in sentiment analysis. Traditionally, the bag-ofwords representations are used for sentiment analysis (Pang and Lee, 2008). To exploit the relationship between words, word co-occurrence (Turney et al., 2010) and syntactic contexts (Padó and Lapata, 2007) are considered. In order to distinguish antonyms with similar contexts, neural word vectors (Bengio et al., 2003) are proposed and can be learnt in an unsupervised manner. Word2vec (Mikolov et al., 2013a) introduces a simpler network structure making computation more efficiently and makes billions of samples feasible for training.
Semantic composition deals with representing a longer text from its shorter components, which is extensively studied recently. In many previous works, a phrase vector is usually obtained by average (Landauer and Dumais, 1997), addition, element-wise multiplication (Mitchell and Lapata, 2008) or tensor product (Smolensky, 1990) of word vectors. In addition to using vector representations, matrices can also be used to represent phrases and the composition process can be done through matrix multiplication (Rudolph and Giesbrecht, 2010;Yessenalina and Cardie, 2011).
Recursive neural models utilize the recursive structure (usually a parse tree) of a phrase or sentence for semantic composition. In Recursive Neural Network (Socher et al., 2011), the tree with the least reconstruction error is built and the vectors for interior nodes is composed by a global matrix. Matrix-Vector Recursive Neural Network (MV-RNN) (Socher et al., 2012) assigns matrices for every words so that it could capture the relationship between two children. In Recursive Neural Tensor Networks (RNTN) (Socher et al., 2013b), the composition process is performed on a parse tree in which every node is annotated with fine-grained sentiment labels, and a global tensor is used for composition. Adaptive Multi-Compositionality (Dong et al., 2014) uses multiple weighted composition matrices instead of sharing a single matrix.
The employment of syntactic information in RNN is still in its infant. In (Socher et al., 2013a), the part-of-speech tag of child nodes is considered in combining the processes of both composition and parsing. The main purpose is for better parsing by employing RNN, but it is not designed for sentiment analysis. In (Hermann and Blunsom, 2013), the authors designed composition functions according to the combinatory rules and categories in CCG grammar. However, only marginal improvement against Naive Bayes was reported. Unlike (Hermann and Blunsom, 2013), our TG-RNN obtains remarkable improvements against strong baselines, and we are the first to propose tag embedded RNTN which obtains the second best result among all reported approaches.

Background: Recursive Neural Models
In recursive neural models, the vector of a longer text (e.g., sentence) is composed from those of its shorter components (e.g., words or phrases). To compose a sentence vector through word/phrase vectors, a binary parse tree has to be built with a parser. The leaf nodes represent words and interior nodes represent phrases. Vectors of interior nodes are computed recursively by composition of child nodes' vectors. Specially, the root vector is regarded as the sentence representation. The composition process is shown in Figure 1.
where v l i and v r i are child vectors, g is a composition function, and f is a nonlinearity function, usually tanh. Different recursive neural models mainly differ in composition function. For example, the composition function for RNN is as below: where W ∈ R d×2d is a composition matrix and b is a bias vector. And the composition function for RNTN is as follows: where W and b are defined in the previous model and T [1:d] ∈ R 2d×2d×d is the tensor that defines multiple bilinear forms. The vectors are used as feature inputs to a softmax classifier. The posterior probability over class labels on a node vector v i is given by The parameters in these models include the word table L, a composition matrix W in RNN,and W and T [1:d] in RNTN, and the classification matrix W s for the softmax classifier.

Incorporating Syntactic Knowledge into Recursive Neural Model
The central idea of the paper is inspired by the fact that words/phrases of different part-of-speech tags play different roles in semantic composition. As discussed in the introduction, a noun phrase (e.g., a movie/NP) may be composed different from a verb phrase (e.g., love movie/VP). Furthermore, when composing the phrase a movie/NP, the two child words, a/DT and movie/NN, may play different roles in the composition process. Unfortunately, the previous RNN models neglect such syntactic information, though the models do employ the parsing structure of a sentence. We have two approaches to improve the composition process by leveraging tags on parent nodes and child nodes. One approach is to use different composition matrices for parent nodes with different tags so that the composition process could be guided by phrase type, for example, the matrix for 'NP' is different from that for 'VP' . The other approach is to introduce 'tag embedding' for words and phrases, for example, to learn tag vectors for 'NP, VP, ADJP', etc., and then integrate the tag vectors with the word/phrase vectors during the composition process.

Tag Guided RNN (TG-RNN)
We propose Tag Guided RNN (TG-RNN) to respect the tag of a parent phrase during the composition process. The model chooses a composition function according to the part-of-speech tag of a phrase. For example, 'the movie' has tag NP, 'very interesting' has tag ADJP, the two phrases have different composition matrices.
More formally, we design composition functions g with a factor of the phrase tag of a parent node. The composition function becomes where t i is the phrase tag for node i, W t i and b t i are the parameters of function g t i , as defined in Equation 2. In other words, phrase nodes with various tags have their own composition functions such as g N P , g V P , and so on. There are totally k composition function in this model where k is the number of phrase tags. When composing child vectors, a function is chosen from the function pool according to the tag of the parent node.
The process is depicted in Figure 3. We term this model Tag   is composed with highlighted g ADJP and 'is very interesting' with g V P .
But some tags have few occurrences in the corpus. It is hard and meaningless to train composition functions for those infrequent tags. So we simply choose top k frequent tags and train k composition functions. A common composition function is shared across phrases with all infrequent tags. The value of k depends on the size of the training set and the occurrences of each tag. Specially, when k = 0, the model is the same as the traditional RNN.

Tag Embedded RNN and RNTN (TE-RNN/RNTN)
In this section, we propose tag embedded RNN (TE-RNN) and tag embedded RNTN (TE-RNTN) to respect the part-of-speech tags of child nodes during composition. As mentioned above, tags of parent nodes have impact on composition. However, some phrases with the same tag should be composed in different ways. For example, 'is interesting' and 'like swimming' have the same tag VP. But it is not reasonable to compose the two phrases using the previous model because the partof-speech tags of their children are quite different. If we use different composition functions for children with different tags like TG-RNN, the number of tag pairs will amount to as many as k×k, which makes the models infeasible due to too many parameters.
In order to capture the compositional effects of the tags of child nodes, an embedding e t ∈ R de is created for every tag t, where d e is the dimension of tag vector. The tag vector and phrase vector are concatenated during composition as illustrated in Figure 4.
Formally, the phrase vector is composed by the function where t l i and t r i are tags of the left and the right nodes respectively, e t l i and e t r i are tag vectors, and W ∈ R d×(2de+2d) is the composition matrix. We term this model Tag embedded RNN, TE-RNN for short. Similarly, this idea can be applied to Recursive Neural Tensor Network (Socher et al., 2013b). In RNTN, the tag vector and the phrase vector can be interweaved together through a tensor. More specifically, the phrase vectors and tag vectors are multiplied by the composed tensor. The composition function changes to the following: where the variables are similar to those defined in equation 3 and equation 7. We term this model Tag embedded RNTN, TE-RNTN for short.
The phrase vectors and tag vectors are used as input to a softmax classifier, giving the posterior probability over labels via

Model Training
Let y i be the target distribution for node i,ŷ i be the predicted sentiment distribution. Our goal is to minimize the cross-entropy error between y i and y i for all nodes. The loss function is defined as follows: where j is the label index, λ is a L 2 -regularization term, and θ is the parameter set. Similar to RNN, the parameters for our models include word vector table L, the composition matrix W , and the sentiment classification matrix W s . Besides, our models have some additional parameters, as discussed below: TG-RNN: There are k composition matrices for top k frequent tags. They are defined as W t ∈ R k×d×2d . The original composition matrix W is for all infrequent tags. As a result, the parameter set of TG-RNN is θ = (L, W, W t , W s ).
TE-RNN: The parameters include the tag embedding table E, which contains all the embeddings for part-of-speech tags for words and phrases. And the size of matrix W ∈ R d×(2d+2de) and the softmax classifier W s ∈ R N ×(de+d) . The parameter set of TE-RNN is θ = (L, E, W, W s ).

Dataset and Experiment Setting
We evaluate our models on Stanford Sentiment Treebank which contains fully labeled parse trees. It is built upon 10,662 reviews and each sentence has sentiment labels on each node in the parse tree. The sentiment label set is {0,1,2,3,4}, where the numbers mean very negative, negative, neutral, positive, and very positive, respectively. We use standard split (train: 8,544 dev: 1,101, test: 2,210) on the corpus in our experiments. In addition, we add the part-of-speech tag for each leaf node and phrase-type tag for each interior node using the latest version of Stanford Parser. Because the newer parser generated trees different from those provided in the datasets, 74/11/11 reviews in train/dev/test datasets are ignored. After removing the broken reviews, our dataset contains 10566 reviews (train: 8,470, dev: 1,090, test: 2,199).
The word vectors were pre-trained on an unlabeled corpus (about 100,000 movie reviews) by word2vec (Mikolov et al., 2013b) as initial values and the other vectors is initialized by sampling from a uniform distribution U(−ϵ, ϵ) where ϵ is 0.01 in our experiments. The dimension of word vectors is 25 for RNN models and 20 for RNTN models. Tanh is chosen as the nonlinearity function. And after computing the output of node i so that the resulting vector has a limited norm. Backpropagation algorithm (Rumelhart et al., 1986) is used to compute gradients and we use minibatch SGD with momentum as the optimization method, implemented with Theano (Bastien et al., 2012). We trained all our models using stochastic gradient descent with a batch size of 30 examples, momentum of 0.9, L 2 -regularization weight of 0.0001 and a constant learning rate of 0.005.

System Comparison
We compare our models with several methods which are evaluated on the Sentiment Treebank corpus. The baseline results are reported in (Dong et al., 2014) and (Kim, 2014).
We make comparison to the following baselines: • SVM. A SVM model with bag-of-words representation (Pang and Lee, 2008).
• MV-RNN. Matrix Vector Recursive Neural Network (Socher et al., 2012) represents each word and phrase with a vector and a matrix. As reported, this model suffers from too many parameters.
• RNTN. Recursive Neural Tenser Network (Socher et al., 2013b)  for composition function which could model the meaning of longer phrases and capture negation rules.
• AdaMC. Adaptive Multi-Compositionality for RNN and RNTN (Dong et al., 2014) trains more than one composition functions and adaptively learns the weight for each function.
• DCNN/CNN. Dynamic Convolutional Neural Network (Kalchbrenner et al., 2014) and a simple Convolutional Neural Network (Kim, 2014), though these models are of different genres to RNN, we include them here for fair comparison since they are among top performing approaches on this task.
• Para-Vec. A word2vec variant (Le and Mikolov, 2014) that encodes paragraph information into word embedding learning. A simple but very competitive model.
The comparative results are shown in Table 1. As illustrated, TG-RNN outperforms RNN, RNTN, MV-RNN, AdMC-RNN/RNTN. Compared with RNN, the fine-grained accuracy and binary accuracy of TG-RNN is improved by 3.8% and 3.9% respectively. When compared with AdaMC-RNN, the accuracy of our method rises by 1.2% on the fine-grained prediction. The results show that the syntactic knowledge does facilitate phrase vector composition in this task.
As for TE-RNN/RNTN, the fine-grained accuracy of TE-RNN is boosted by 4.8% compared with RNN and the accuracy of TE-RNTN by 3.2% compared with RNTN. TE-RNTN also beat the AdaMC-RNTN by 2.2% on the fine-grained classification task. TE-RNN is comparable to CNN and DCNN, another line of models for this task. TE-RNTN is better than CNN, DCNN, and Para-Vec, which are the top performing approaches on this task. TE-RNTN is worse than DRNN, but the complexity of DRNN is much higher than TE-RNTN, which will be discussed in the next section. Furthermore, TE-RNN is also better than TG-RNN. This implies that learning the tag embeddings for child nodes is more effective than simply using the tags of parent phrases in composition.
Note that the fine-grained accuracy is more convincible and reliable to compare different approaches due to the two facts: First, for the binary classification task, some approaches train another binary classifier for positive/negative classification while other approaches, like ours, directly use the fine-grained classifier for this purpose. Second, how the neutral instances are processed is quite tricky and the details are not reported in the literature. In our work, we simply remove neural instances from the test data before the evaluation. Let the 5-dimension vector y be the probabilities for each sentiment label in a test instance. The prediction will be positive if arg max i,i̸ =2 y i is greater than 2, otherwise negative, where i ∈ {0, 1, 2, 3, 4} means very negative, negative, neutral, positive, very positive, respectively.

Complexity Analysis
To gain deeper understanding of the models presented in Table 1, we discuss here about the parameter scale of the RNN/RNTN models since the prediction power of neural network models is highly correlated with the number of parameters.
The analysis is presented in Table 2 (the optimal values are adopted from the cited papers). The parameters for the word table have the same size n × d across all recursive neural models, where n is the number of words and d is the dimension of word vector. Therefore, we ignore this part but focus on the parameters of composition functions, termed model size. Our models, TG-RNN/TE-RNN, have much less parameters than RNTN and AdMC-RNN/RNTN, but have much better performance. Although TE-RNTN is worse than DRNN, however, the parameters of DRNN are almost 9 times of ours. This indicates that DRNN is much more complex, which requires much more data and time to train. As a matter of a fact, our TE-RNTN only takes 20 epochs for training which is 10 times less than DRNN. . For AdaMC, c is the number of composition functions (15 is the optimal setting). For DRNN, l and h is the number of layers and the width for each layer (the optimal values l = 4, h = 174). For our methods, k is the number of unshared composition matrices and d e the dimension of tag embedding, for the optimal setting refer to Section 5.4.

Parameter Analysis
We have two key parameters to tune in our proposed models. For TG-RNN, the number of composition functions k is an important parameter, which corresponds to the number of distinct POS tags of phrases. Let's start from the corpus analysis. As shown in Table 3, the corpus contains 215,154 phrases but the distribution of phrase tags is extremely imbalanced. For example, the phrase tag 'NP' appears 60,239 times while 'NAC' appears only 10 times. Hence, it is impossible to learn a composi-  tion function for the infrequent phrase tags. Each of the top k frequent phrase tags corresponds to a unique composition function, while all the other phrase tags share a same function. We compare different k for TG-RNN. The accuracy is shown in Figure 5. Our model obtains the best performance when k is 6, which is accordant with the statistics in Table 3. For TE-RNN/RNTN, the key parameter to tune is the dimension of tag vectors. In the corpus, we have 70 types of tags for leaf nodes (words) and interior nodes (phrases). Infrequent tags whose frequency is less than 1,000 are ignored. There are 30 tags left and we learn an embedding for each of these frequent tags. We varies the dimension of the embedding d e from 0 to 30. Figure 6 shows the accuracy for TE-RNN and TE-RNTN with different dimensions of d e . Our model obtains the best performance when d e is 8 for TE-RNN and 6 for TE-RNTN. The results show that too small dimensions may not be sufficient to encode the syntactic information of tags and too large dimensions damage the perfor-mance.

Tag Vectors Analysis
In order to prove tag vectors obtained from tag embedded models are meaningful, we inspect the similarity between vectors of tags. For each tag vector, we find the nearest neighbors based on Euclidean distance, summarized in Table 4.  Adjectives and verbs are of significant importance in sentiment analysis. Although 'JJ' and 'ADJP' are word and phrase tag respectively, they have similar tag vectors, because of playing the same role of Adjective in sentences. 'VP', 'VBD' and 'VBN' with similar representations all represent verbs. What is more interesting is that the nearest neighbor of dot is colon, probably because both of them are punctuation marks. Note that tag classification is none of our training objectives and surprisingly the vectors of similiar tags are clustered together, which can provides additional information during sentence composition.

Conclusion
In this paper, we present two ways to leverage syntactic knowledge in Recursive Neural Networks.
The first way is to use different composition functions for phrases with different tags so that the composition processing is guided by phrase types (TG-RNN). The second way is to learn tag embeddings and combine tag and word embeddings during composition (TE-RNN/RNTN). The proposed models are not only effective (w.r.t competing performance) but also efficient (w.r.t wellcontrolled parameter scale). Experiment results show that our models are among the top performing approaches up to date, but with much less parameters and complexity.