Modularized Syntactic Neural Networks for Sentence Classification

This paper focuses on tree-based modeling for the sentence classiﬁcation task. In existing works, aggregating on a syntax tree usually considers local information of sub-trees. In contrast, in addition to the local information, our proposed Modularized Syntactic Neural Network (MSNN) utilizes the syntax category labels and takes advantage of the global context while modeling sub-trees. In MSNN, each node of a syntax tree is modeled by a label-related syntax module. Each syntax module aggregates the outputs of lower-level modules, and ﬁnally, the root module provides the sentence representation. We de-sign a tree-parallel mini-batch strategy for ef-ﬁcient training and predicting. Experimental results on four benchmark datasets show that our MSNN signiﬁcantly outperforms previous state-of-the-art tree-based methods on the sentence classiﬁcation task.


Introduction
Text classification is an important and fundamental problem in natural language processing (NLP). With the increasing spread of the Internet, there are numerous applications of classification of short texts with only one sentence, for example, classifying questions according to what product or which part of the product architecture the question regards, sentiment analysis of customer reviews or tweets, and fast category detection based on news titles. Different from document classification, in which there are more topic words and features of writing styles, a single sentence contains limited information. Thus, understanding the meaning of a sentence is vital.
Although sequential models like long short term memory (LSTM) (Hochreiter and Schmidhuber, * Corresponding Author: yingliu@mail.tsinghua.edu.cn 1997) and gated recurrent units (GRU) (Cho et al., 2014b) have been widely used and provide excellent performances, it is hard for them to capture the syntactic information, which is essential for understanding sentences (Linzen et al., 2016). To utilize the syntactic information, some works proposed models taking parse trees or dependency trees as inputs (Le and Zuidema, 2015;Teng and Zhang, 2017;Socher et al., 2013;Zhu et al., 2015;Bowman et al., 2016). Previous researchers have empirically verified that these methods help to model sentences (Li et al., 2015). However, to improve the model efficiency and simplify the implementation, these methods binarize trees (Wang et al., 2007;Huang, 2007) so that they can be traversed by recursive neural networks (RvNN) or as a sequence by RNN. Although some models, such as Tree-LSTM (Tai et al., 2015;Looks et al., 2017;Ran and Zhong, 2019), theoretically support original parse trees, child nodes are simply summed, but the relationships among them are not modeled. The authors only conduct experiments on binary trees. Binary trees weaken the syntactic information and conceal the relationships among nodes at the same level or different levels. Latent tree is another way of modeling sentences, but it does not take full advantage of the syntactic information (Cho et al., 2014a;Choi et al., 2018;Williams et al., 2018;Addi et al., 2020). Current graph-based models specifically focus on the dependency tree (Marcheggiani and Titov, 2017;Zhang et al., 2018). Besides, these models do not consider the context information in the bottom-up aggregation. Syntax category labels are also not fully utilized. However, they both can affect the meanings of words and phrases, which should be considered.
In this work, a novel Modularized Syntactic Neural Network (MSNN) is proposed to model syntax trees of sentences. Each node in the tree is transformed into a syntax module in MSNN. The num-ber of distinct syntax modules is the same as the number of distinct syntax category labels. A category label is the syntactic category of a subtree's root, e.g. "NP", "VP", etc. The modules are used to build networks according to tree structures, and there is a one-to-one correspondence between tree structures and network structures. The parameters of modules corresponding to the same category labels are shared. Note that there is no limitation to binary trees, and our implementation of treeparallel mini-batches based on the original parse trees provides excellent efficiency. Each syntax module aggregates outputs of lower-level modules and outputs a representation vector of the sub-tree. Syntax category labels and global context information are encoded to guide the propagation and better infer the meaning of the sentence. The root module finally outputs the sentence representation, which is further used for classification. We test MSNN on four benchmark datasets, and the results show that it outperforms previous state-of-the-art methods.
The main contributions of this work are listed as follows: • A novel Modularized Syntactic Neural Network (MSNN) is proposed to model syntax trees for sentence classification. Both category labels and global context information are utilized when modeling sub-trees.
• We provide a design of tree-parallel minibatches so that binarization of trees is not necessary and structural information is better reserved.
• Experimental results on four benchmark datasets show that MSNN significantly outperforms previous state-of-the-art methods on the sentence classification task.

Modularized Syntactic Neural Networks
An example of Modularized Syntactic Neural Networks (MSNN) is shown in Figure 1.

Global Context Bi-LSTM
Single words contain no sequential information.
Meanings of words can be inferred from their context. It is essential to represent words in a certain context. A global context bidirectional LSTM (Bi-LSTM) (Schuster and Paliwal, 1997) is used to generate context-enhanced word vectors and global context vector. Suppose the input sentence s consists of a sequence of words s = {w 1 , ..., w t , ..., w |s| }, where w t is the t-th word in the sentence and |s| is the sentence length. We use bold fonts to represent the vectors of words and other objects. The word vectors w t ∈ R d can either be randomly initialized or pre-trained vectors, and d is the dimension of word vectors. To enrich word vectors with the context information in the sentence, a Bi-LSTM is applied to the sequence of words {w t } t=1...|s| . Let h f t denote the hidden states of the forward LSTM at position t, in which the past context information is included. By another backward LSTM, hidden states containing the future context h b t are formed. The initial hidden states h 0 are zero-initialized. Then the enriched word vector e t at position t is The global context vector c s of sentence s is (2) e t and c s are inputs of syntax modules. Global context information is embedded into e t , so that the information propagation in higher layers is guided by the context.

Syntax Modules
Previous works use a constant structure to traverse over trees, which do not consider the diversity of syntax category labels. In MSNN, syntax modules are used to construct different network structures according to the syntax trees. Each category label (including POS tags of leaf nodes) l in the syntax tree is mapped to a module M l (·). Each module takes output vectors of its child nodes as inputs, and it outputs the representation of the sub-tree. The number of distinct syntax modules is the same as the number of distinct syntax category labels. For example, in Figure 1, seven different modules are used to assemble the network: "S, NP, VP, ., PRP$, NN, VBZ". According to whether a node is a leaf, modules are divided into two categories: leaf POS module and root/intermediate module.
Each word in the sentence w t has a POS p t , which shows the categories of words according to their function in a sentence. Words with the same POS have similar syntactic behaviors. In POS module, the enhanced word vector e t and the  POS vector p t ∈ R d are combined: Vectors of POS labels are randomly initialized and learned during the training. The output m pt will be the input of its parent module.
It is necessary to model relationships among sibling nodes because the information from syntactic nodes' sisterhood may reveal useful for sentence classification. Examples may be negation clauses or modifiers. When modeling the sentence as a sequence, it is not easy for RNN, CNN, or other structures to identify their lexical scope. And in a binary tree, coordinate relations among nodes are diluted. The influence of negation clauses or modifiers on some nodes may be hard to capture, especially in long phrases and sentences. In (4) where Bi-LSTM(·) have the similar structure as the global context Bi-LSTM in section 2.1 but different parameters. The local context Bi-LSTM shares the same parameters among different modules, and the outputs e c 1 ...e cn are the enriched representations of child nodes. The global context vector c s is used to initialize the hidden state and cell state in order to guide the information to propagate in the local syntactic node. The context can affect the semantic meanings of phrases, e.g., representations of syntactic nodes.
Different child nodes contribute more or less to the representation of their parent node. A syntax-aware attention network is then used to aggregate child nodes: where K ∈ R d×d is the global transformation weight matrix for attention keys, and b k ∈ R d is the bias vector. Label-related query vector q l is used to evaluate whether children are informative for the parent node in such sentence context c s and category label l. Q l ∈ R d×d is the query transformation weight matrix of category label l, b l is the bias vector, and l ∈ R d is the vector of category label. They are all label-related parameters so that syntax modules of labels have different parameters. δ(·) is the non-linear activation function and we use the LeakyRelu (Maas et al., 2013). a c i is the normalized attention weights by a softmax layer. The l-module outputs the representation of the subtree m l by a weighted sum. In this way, context and syntactic information guide the information to propagate in sub-trees.
The aggregation process goes from bottom to up, and finally, the root module outputs the sentence vector m S for further classification.
A fully-connected layer followed by a softmax function is used to give the final predictions of classification. The Cross-Entropy with 2 -regularization is the loss function to train the model.

Tree-Parallel Mini-Batch
Tree structures of sentences vary a lot. As a deep model, it is essential to construct mini-batches for effective and efficient training and testing. Previous tree-based models usually construct binary trees to simplify the implementation. However, to form the binarized tree, many intermediate nodes are inserted in the original tree. Some nodes or phrases with coordinate relation in the raw sentences or the original parse trees may now on different levels of binary trees, and their paths and path-lengths to the root vary a lot. In such a situation, the parent-children or brother-sister relationships among nodes are captured implicitly in the binary tree. It is not easy to design good ways to construct features encoding information about node sisterhood. In contrast, we design a way of tree-parallel mini-batches for MSNN, as shown in Figure 2. B is the batch size, e.g., the number of sentences in the batch. K is the maximum number of layers of all trees in the batch. In a batch running, the first step is all sentences going through the global context Bi-LSTM in parallel to obtain enriched word vectors. In the second step, all nodes on the last layer of different trees are calculated simultaneously. And then the last but one layer, and so on. As long as previous layers have been calculated, the required information of current layers is all available. For example, the tree in Figure 1 is shown as the B-th sentence in Figure 2. The number of iterations along layers depends on the maximum depth of trees in the batch. Finally, outputs of all root nodes are gathered for the output layer.
All models, including baselines, are trained with Adam (Kingma and Ba, 2014) in mini-batches at the size of 64. The learning rate is 1 × 10 −4 , and early-stopping is conducted according to the performance on the validation set. The weight of 2 -regularization λ is manually searched between {0, 1×10 −5 , 1×10 −4 , 1×10 −3 , 1×10 −2 } . Word vectors are randomly initialized. The dimension of word vectors and hidden layers d is 300. We run the experiments with 5 different random seeds and report the average accuracy and standard errors. All models are trained with a GPU (NVIDIA GeForce GTX 1080Ti).

Results and Discussion
The overall performances are shown in Table 2, in which "w/o" means "without". We can conclude that tree-based models like Tree-LSTM and Gumbel-Tree are better than sequential models LSTM and Bi-LSTM, which shows the superiority of modeling structures of the syntax tree for sentence classification. Gumbel-Tree is slightly better than Tree-LSTM because of a more flexible structure and a similar global context RNN as MSNN. However, it uses the Gumbel softmax (Jang et al., 2017) to form latent trees and does not take full advantage of the syntactic structure information. Although Gumbel-Tree is more flexible in integrating context information and constructing sentence representations, it produces unstable latent trees (Williams et al., 2018). MSNN outperforms these baselines because it utilizes the global context and syntax category labels to guide the information propagation in sub-trees. The meaning of words and phrases can be inferred by their context and syntactic roles. Besides, MSNN based on the treeparallel mini-batch design is not limited to binary trees, which fully retains the syntactic information. The local RNN and attention network capture the relationship between nodes with the same parent. The improvements of MSNN are larger on DBpedia than that on other datasets. The reason is that most of the sentences in the DBpedia are high-quality declarative sentences, and their tree structures are less complex compared to reviews in Amazon datasets, which contain much noise. Clean syntactic information on DBpedia results in wider discrepancy, not only between MSNN and Gumbel-Tree but also between tree-based methods and Bi-LSTM.
To study the ablation of different parts, we remove the global RNN, local RNN, attention mechanism, or category label information in MSNN. Results show that all these parts contribute to the excellent performance of MSNN. Global RNN enriches word vectors with context information and provides a global context representation of the sentence. Local RNN captures the relationships among nodes on the same level. The attention mechanism dynamically aggregates information from child nodes under the guidance of context and syntax category labels. Category labels also show the roles of words and phrases in the sentence.
The average training time per epoch on the largest ARF dataset under the same validating frequency is shown in Table 3. Generally, sequential methods are much faster than tree-based models because of simple computation graphs. With the help of tree-parallel mini-batch, MSNN largely reduces  redundant calculations and is more efficient compared with Gumbel-Tree and Tree-LSTM. Binary trees usually have many more nodes than original trees. Traverse them node by node is slower than calculating nodes in parallel. Besides, directly modeling of origin trees largely retains the structural information.

Conclusion
In this work, a novel model MSNN is proposed to model syntax trees for classification. It uses global context information and syntax category labels to help improve the modeling of sub-trees and thus better sentence representations. A tree-parallel mini-batch strategy is further designed for efficient running and support for non-binary trees. Our future work will include conducting experiments on dependency trees and more NLP tasks.