Sentence Modeling with Gated Recursive Neural Network

Recently, neural network based sentence modeling methods have achieved great progress. Among these methods, the recursive neural networks (RecNNs) can effectively model the combination of the words in sentence. However, RecNNs need a given external topological structure, like syntactic tree. In this paper, we propose a gated recursive neural network (GRNN) to model sentences, which employs a full binary tree (FBT) structure to control the combinations in recursive structure. By introducing two kinds of gates, our model can better model the complicated combinations of features. Experiments on three text classiﬁcation datasets show the effectiveness of our model.

Among these methods, recursive neural networks (RecNNs) have shown their excellent abilities to model the word combinations in sentence. However, RecNNs require a pre-defined topological structure, like parse tree, to encode sentence, which limits the scope of its application.  proposed the gated recursive convolutional neural network (grConv) by utilizing the directed acyclic graph (DAG) structure instead of parse tree * Corresponding author. cannot agree with you I more agree with you I more cannot Figure 1: Example of Gated Recursive Neural Networks (GRNNs). Left is a GRNN using a directed acyclic graph (DAG) structure. Right is a GRNN using a full binary tree (FBT) structure. (The green nodes, gray nodes and white nodes illustrate the positive, negative and neutral sentiments respectively.) to model sentences. However, DAG structure is relatively complicated. The number of the hidden neurons quadraticly increases with the length of sentences so that grConv cannot effectively deal with long sentences.
Inspired by grConv, we propose a gated recursive neural network (GRNN) for sentence modeling. Different with grConv, we use the full binary tree (FBT) as the topological structure to recursively model the word combinations, as shown in Figure 1. The number of the hidden neurons linearly increases with the length of sentences. Another difference is that we introduce two kinds of gates, reset and update gates (Chung et al., 2014), to control the combinations in recursive structure. With these two gating mechanisms, our model can better model the complicated combinations of features and capture the long dependency interactions.
In our previous works, we have investigated several different topological structures (tree and directed acyclic graph) to recursively model the semantic composition from the bottom layer to the top layer, and applied them on Chinese word segmentation (Chen et al., 2015a) and dependency parsing (Chen et al., 2015b) tasks. However, these structures are not suitable for modeling sentences. In this paper, we adopt the full binary tree as the topological structure to reduce the model complexity.
Experiments on the Stanford Sentiment Treebank dataset (Socher et al., 2013b) and the TREC questions dataset (Li and Roth, 2002) show the effectiveness of our approach.

Architecture
The recursive neural network (RecNN) need a topological structure to model a sentence, such as a syntactic tree. In this paper, we use a full binary tree (FBT), as showing in Figure 2, to model the combinations of features for a given sentence.
In fact, the FBT structure can model the combinations of features by continuously mixing the information from the bottom layer to the top layer. Each neuron can be regarded as a complicated feature composition of its governed sub-sentence. When the children nodes combine into their parent node, the combination information of two children nodes is also merged and preserved by their parent node. As shown in Figure 2, we put all-zero padding vectors after the last word of the sentence until the length of 2 ⌈log n 2 ⌉ , where n is the length of the given sentence.
Inspired by the success of the gate mechanism of Chung et al. (2014), we further propose a gated recursive neural network (GRNN) by introducing two kinds of gates, namely "reset gate" and "update gate". Specifically, there are two reset gates, r L and r R , partially reading the information from Gate z Gate rL Gate rR h2j (l-1) h2j+1 (l-1) hĵ (l) hj (l) Figure 3: Our proposed gated recursive unit. left child and right child respectively. And the update gates z N , z L and z R decide what to preserve when combining the children's information. Intuitively, these gates seem to decide how to update and exploit the combination information.
In the case of text classification, for each given sentence x i = w (i) 1:N (i) and the corresponding class y i , we first represent each word w dicates the length of i-th sentence and d is dimensionality of word embeddings. Then, the embeddings are sent to the first layer of GRNN as inputs, whose outputs are recursively applied to upper layers until it outputs a single fixed-length vector. Next, we receive the class distribution P(·|x i ; θ) for the given sentence x i by a softmax transformation of u i , where u i is the top node of the network (a fixed length vectorial representation): where b s ∈ R |T | , W s ∈ R |T |×d . d is the dimensionality of the top node u i , which is same with the word embedding size and T represents the set of possible classes. θ represents the parameter set.

Gated Recursive Unit
GRNN consists of the minimal structures, gated recursive units, as showing in Figure 3. By assuming that the length of sentence is n, we will have recursion layer l ∈ [1, ⌈log n 2 ⌉+1], where symbol ⌈q⌉ indicates the minimal integer q * ≥ q. At each recursion layer l, the activation of the j- where z N , z L and z R ∈ R d are update gates for new activationĥ l j , left child node h l−1 2j and right child node h l−1 2j+1 respectively, and ⊙ indicates element-wise multiplication.
The update gates can be formalized as: where U ∈ R 3d×3d is the coefficient of update gates, and Z ∈ R d is the vector of the normalization coefficients, where 1 ≤ k ≤ d.
The new activationĥ l j is computed as: where Wĥ ∈ R d×2d , r L ∈ R d , r R ∈ R d . r L and r R are the reset gates for left child node h l−1 2j and right child node h l−1 2j+1 respectively, which can be formalized as: [ where G ∈ R 2d×2d is the coefficient of two reset gates and σ indicates the sigmoid function. Intuiativly, the reset gates control how to select the output information of the left and right children, which result to the current new activationĥ. By the update gates, the activation of a parent neuron can be regarded as a choice among the the current new activationĥ, the left child, and the right child. This choice allows the overall structure to change adaptively with respect to the inputs.
This gate mechanism is effective to model the combinations of features.

Training
We use the Maximum Likelihood (ML) criterion to train our model. Given training set (x i , y i ) and the parameter set of our model θ, the goal is to minimize the loss function: Initial learning rate α = 0.3 Regularization λ = 10 −4 Dropout rate on input layer p = 20% where m is number of training sentences.
For parameter initialization, we use random initialization within (-0.01, 0.01) for all parameters except the word embeddings. We adopt the pretrained English word embeddings from (Collobert et al., 2011) and fine-tune them during training.

Datasets
To evaluate our approach, we test our model on three datasets: • SST-1 The movie reviews with five classes in the Stanford Sentiment Treebank 1 (Socher et al., 2013b): negative, somewhat negative, neutral, somewhat positive, positive.
• QC The TREC questions dataset 2 (Li and Roth, 2002) involves six different question types. Table 1 lists the hyper-parameters of our model. In this paper, we also exploit dropout strategy (Srivastava et al., 2014) to avoid overfitting. In addition, we set the batch size to 20. We set word embedding size d = 50 on the TREC dataset and d = 100 on the Stanford Sentiment Treebank dataset. Table 2 shows the performance of our GRNN on three datasets.
Result Discussion Generally, our model is better than the previous recursive neural network based models (RecNTN, RAE, MV-RecNN and AdaSent), which indicates our model can better model the combinations of features with the FBT and our gating mechanism, even without an external syntactic tree.
Although we just use the top layer outputs as the feature for classification, our model still outperforms AdaSent.
Compared with the CNN based methods (MaxTDNN, DCNN and CNNs), our model achieves the comparable performances with much fewer parameters. Although CNN based methods outperform our model on SST-1 and SST-2, the number of parameters 2 of GRNN ranges from 40K to 160K while the number of parameters is about 400K in CNN.  proposed grConv to model sentences for machine translation. Unlike our model, grConv uses the DAG structure as the topological structure to model sentences. The number of the internal nodes is n 2 /2, where n is the length of the sentence. Zhao et al. (2015) uses the same structure to model sentences (called AdaSent), and utilizes the information of internal nodes to model sentences for text classification. Unlike grConv and AdaSent, our model uses full binary tree as the topological structure. The number of the internal nodes is 2n in our model. Therefore, our model is more efficient for long sentences. In addition, we just use the top layer neurons for text classification.

Related Work
Moreover, grConv and AdaSent only exploit one gating mechanism (update gate), which cannot sufficiently model the complicated feature combinations. Unlike them, our model incorporates two kind of gates and can better model the feature combinations. Hu et al. (2014) also proposed a similar architecture for matching problems, but they employed the convolutional neural network which might be coarse in modeling the feature combinations.

Conclusion
In this paper, we propose a gated recursive neural network (GRNN) to recursively summarize the meaning of sentence. GRNN uses full binary tree as the recursive topological structure instead of an external syntactic tree. In addition, we introduce two kinds of gates to model the complicated combinations of features. In future work, we would like to investigate the other gating mechanisms for better modeling the feature combinations.