The Forest Convolutional Network: Compositional Distributional Semantics with a Neural Chart and without Binarization

According to the principle of compositionality, the meaning of a sentence is computed from the meaning of its parts and the way they are syntactically combined. In practice, however, the syntactic structure is computed by automatic parsers which are far-from-perfect and not tuned to the speciﬁcs of the task. Current recursive neural network (RNN) approaches for computing sentence meaning therefore run into a number of practical difﬁculties, including the need to carefully select a parser appropriate for the task, deciding how and to what extent syntactic context modiﬁes the semantic composition function, as well as on how to transform parse trees to conform to the branching settings (typically, binary branching) of the RNN. This paper introduces a new model, the Forest Convolutional Network, that avoids all of these challenges, by taking a parse forest as input, rather than a single tree, and by allowing arbitrary branching factors. We report improvements over the state-of-the-art in sentiment analysis and question classiﬁcation.


Introduction
For many natural language processing tasks we need to compute meaning representations for sentences from meaning representations of words. In a recent line of research on 'recursive neural networks' (e.g., Socher et al. (2010)), both the word and sentence representations are vectors, and the word vectors ("embeddings") are borrowed from work in distributional semantics or neural language modelling. Sentence representations, in this approach, are computed by recursively applying a neural network that combines two vectors into one (typically according to the syntactic structure provided by an external parser). The network, which thus implements a 'composition function', is optimized for delivering sentence representations that support a given semantic task: sentiment analysis (Irsoy and Cardie, 2014;Le and Zuidema, 2015), paraphrase detection (Socher et al., 2011), semantic relatedness (Tai et al., 2015) etc. Studies with recursive neural networks have yielded promising results on a variety of such tasks.
In this paper, we represent a new recursive neural network architecture that fits squarely in this tradition, but aims to solve a number of difficulties that have arisen in existing work. In particular, the model we propose addresses three issues: 1. how to make the composition functions adaptive, in the sense that they operate adequately for the many different types of combinations (e.g., adjective-noun combinations are of a very different type than VP-PP combinations); 2. how to deal with different branching factors of nodes in the relevant syntactic trees (i.e., we want to avoid having to binarize syntactic trees, 1 but also do not want ternary productions to be completely independent from binary productions); 3. how to deal with uncertainty about the correct parse inside the neural architecture (i.e., we don't want to work with just the best or k-best parses for a sentence according to an external model, but receive an entire distribution over possible parsers). To solve these challenges we take inspiration from two other traditions: the convolutional neural networks and classic parsing algorithms based on dynamic programming. Including convolution in our network provides a direct solution for issue (2), and turns out, somewhat unexpectedly, to also provide a solution for issue (1). Introducing the chart representation from classic parsing into our architecture then allows us to tackle issue (3). The resulting model, the Forest Convolutional Network, outperforms all other models on a sentiment analysis and question classification task.

Background
This section is to introduce the recursive neural network (RNN) and convolutional neural network (CNN) models, on which our work is based.

Recursive Neural Network
A recursive neural network (RNN) (Goller and Küchler, 1996) is a feed-forward neural network where, given a tree structure, we recursively apply the same weight matrices at each inner node in a bottom-up manner. In order to see how an RNN works, consider the following example. Assume that there is a constituent with parse tree (S I (V P like it)) (Figure 1), and that x I , x like , x it ∈ R d are the vectorial representations of the three words I, like and it, respectively. We use a neural network which consists of a weight matrix W 1 ∈ R d×d for left children and a weight matrix W 2 ∈ R d×d for right children to compute the vector for a parent node in a bottom up manner. Thus, we compute x V P where b is a bias vector and f is an (non-linear) activation function. Having computed x V P , we can then move one level up in the hierarchy and compute x S This process is continued until we reach the root node.
For classification tasks, we put a softmax layer on the top of the root node, and compute the probability of assigning a class c to an input x by where u(c 1 , y top ), ..., u(c |C| , y top ) T = W u y top + b u ; C is the set of all possible classes; W u ∈ R |C|×d , b u ∈ R |C| are a weight matrix and a bias vector.
Training an RNN uses the gradient descent method to minimize an objective function J(θ). The gradient ∂J/∂θ is efficiently computed thanks to the back-propagation through structure algorithm (Goller and Küchler, 1996).
Departing from the original RNN model, many extensions have been proposed to enhance its compositionality (Socher et al., 2013;Irsoy and Cardie, 2014;Le and Zuidema, 2015) and applicability (Le and Zuidema, 2014b). The model we are going to propose can be considered as an extension of RNN with an ability to solve the three issues introduced in Section 1.

Convolutional Neural Network
A convolutional neural network (CNN) (LeCun et al., 1998) is also a feed-forward neural network; it consists of one or more convolutional layers (often with a pooling operation) followed by one or more fully connected layers. This architecture was invented for computer vision. It then has been widely applied to solve natural language processing tasks (Collobert et al., 2011;Kalchbrenner et al., 2014;Kim, 2014).
To illustrate how a CNN works, the following example uses a simplified model proposed by Collobert et al. (2011) which consists of one convolutional layer with the max pooling operation, followed by one fully connected layer ( Figure 2). This CNN uses a kernel with window size 3; when we slide this kernel along the sentence " s I like it very much /s ", we get five vectors: is a bias vector. The max pooling operation is then applied to those resulted vectors in an element-wise manner:

Finally, a fully connected layer is employed
where W, b are a real weight matrix and bias vector, respectively; f is an activation function.
Intuitively, a window-size-k kernel extracts (local) features from k-grams, and is thus able to capture k-gram composition. The max pooling operation is for reducing dimension, forcing the network to discriminate important features from others by assigning high values to them. For instance, if the network is used for sentiment analysis, local features corresponding to k-grams containing the word "like" should receive high values in order to be propagated to the top layer.

Forest Convolutional Network
We now first propose a solution to the issues (1) and (2) (i.e., making the composition functions adaptive and dealing with different branching factors), called Recursive convolutional neural network (RCNN), and then a solution to the third issue (i.e., dealing with uncertainty about the correct parse), called Chart Neural Network (ChNN). A combination of them, Forest Convolutional Network (FCN), will be introduced lastly.

Recursive Convolutional Neural
Network 2 Given a subtree p → x 1 ... x l , an RCNN (Figure 3), like a CNN, slides a window-size-k kernel along the sequence of children (x 1 , ..., x l ) to compute a pool of vectors. The max pooling operation followed by a fully connected layer is then applied to this pool to compute a vector for the parent p. This RCNN differs from the CNN introduced in Section 2.2 at two points. First, we use a non-linear kernel: after linearly transforming input vectors, an activation function is applied. Second, we put k − 1 padding tokens <b> at the beginning of the children sequence and k−1 padding tokens <e> at the end. This thus guarantees that all the children contribute equally to the resulted vector pool, which now has l + k − 1 vectors.
It is obvious that this RCNN can solve the second issue (i.e., dealing with different branching factors), we now show how it can make the composition functions adaptive. We first see what happens if the window size k is larger than the number of children l, for instance k = 3 and l = 2. There are four vectors in the pool where W 1 , W 2 , W 3 are weight matrices, b c is a 2 While finalizing the current paper we discovered a paper by Zhu et al. (2015) proposing a similar model which is evaluated on syntactic parsing. Our work goes substantially beyond theirs, however, as it takes a parse forest rather than a single tree as input. bias vector, f is an activation function. These four resulted vectors correspond to four ways of composing the two children: (1) the first child stands alone (e.g., when the information of the second child is not important, it is better to ignore it), (2,3) the two children are composed with two different weight matrix sets, (4) the second child stands alone.
Now, imagine that we must handle binary syntactic rules with different head positions such as S → N P V P (e.g. "Jones runs") where the second child is the head and V P → V BD N P (e.g., "ate spaghetti") where the first child is the head. We can set those weight matrices such that when multiplying W 2 by the vector of a head, we have a vector with high-value entries. And when multiplying W 2 by the vector of a non-head, or when multiplying W 1 or W 3 by a vector, the resulted vector has low-value entries. This is possible thanks to the max pooling operation and that heads are often more informative than non-heads. If the window size k is smaller than the number of children l, the argument above is still valid in some cases such as head position. However, there is no longer a direct interaction between any two children whose distance is larger than k. 3 In practice, this problem is not serious because rules with a large number of children are very rare.

Chart Neural Network
Unseen sentences are always parsed by an automatic parser, which is far from perfect and taskindependent. Therefore, a good solution is to give 3 An indirect interaction can be set up through pooling.
the system a set of parses and let it decide which parse is the best or to combine some of them. The RNN model handles one extreme where this set contains only one parse. We now consider the other extreme where the set contains all possible parses. Because the number of all possible binary parse trees of a length-n sentence is the nth Catalan number, processing individual parses is not practical. We thus propose a new model working on charts in the CKY style (Younger, 1967), called Chart Neural Network (ChNN). We describe this model by the following example. Given a phrase "ate pho with Milos", a ChNN will process its parse chart as in Figure 4. Because any 2-word constituent has only one parse, the computation for p 1 , p 2 , p 3 is identical to Equation 1. For 3-word constituent p 4 , because there are two possible productions p 4 → ate p 2 and p 4 → p 1 with, we compute one vector for each production and then apply the max pooling operation to these two vectors to compute p 4 . We do the same to compute p 5 . Finally, at the top, there are three productions p 6 → ate p 5 , p 6 → p 1 p 3 and p 6 → p 4 M ilos. Similarly, we compute one vector for each production and employ the max pooling operation to compute p 6 . Because this ChNN processes a chart like the CKY algorithm, its time complexity is O(n 2 d 2 + n 3 d) where n and d are the sentence length and the dimension of vectors, respectively. 4 A ChNN is thus notably more complex than an RNN, whose complexity is O(nd 2 ). Like chart parsing, the complexity can be reduced significantly by pruning the chart before applying the ChNN. This will be discussed right below.

Forest Convolutional Network
We now introduce the Forest Convolutional Network (FCN) model, which is a combination of the RCNN and the ChNN. The idea is to use an automatic parser to prune a chart 5 , debinarize productions (if applicable), and then apply a ChNN Figure 5: Forest of parses (left) and Forest Convolutional Network (right). ⊗ denotes a convolutional layer followed by the max pooling operation and a fully connected layer as in Figure 3.
where the computation in Equation 3 is replaced by a convolutional layer followed by the max pooling operation and a fully connected layer as in the RCNN. Figure 5 shows an illustration how the FCN works on the phrase "ate pho with Milos".
A forest of parses, given by an external parser, comprises two parses (V P ate pho (P P with M ilos)) (solid lines) and (V P ate (N P pho (P P with M ilos))) (dash-dotted lines). The first parse is the preferred reading if Milos is a person, but the second one is a possible reading (for instance, if Milos is the name of a sauce). Instead of forcing the external parser to decide which one is correct, we let the FCN network do that because it has more information about the context and domain, which are embedded in training data. What the network should do is depicted in Figure 5-right.
Training Training an FCN is similar to training an RNN. We use the mini-batch gradient descent method to minimize an objective function J, which depends on which task this network is applied to. For instance, if the task is sentiment analysis, J is the cross-entropy over the training sentence set D plus an L2-norm regularization term where θ is the parameter set, c p is the sentiment class of phrase p, p is the vector representation at the node covering p, P r(c p |p) is computed by score on section 23 of the Penn Treebank whereas resulted forests are very compact: the average number of hyperedges per forest is 123.1. the softmax function, and λ is the regularization parameter.
The gradient ∂J/∂θ is computed efficiently thanks to the back-propagation through structure (Goller and Küchler, 1996). We use the AdaGrad method (Duchi et al., 2011) to automatically update the learning rate for each parameter.

Experiments
We evaluate the FCN model with two tasks: question classification and sentiment analysis. The evaluation metric is the classification accuracy.
Our networks were initialized with the 300-D GloVe word embeddings trained on a corpus of 840B words 6 (Pennington et al., 2014). The initial values for a weight matrix were uniformly sampled from the symmetric interval − 1 √ n , 1 √ n where n is the number of total input units. In each experiment, a development set was used to tune the model. We run the model ten times and chose the run with the highest performance on the development set. We employed early stopping: training is halted if performance on the development set does not improve after three consecutive epochs.

Sentiment Analysis
The Stanford Sentiment Treebank (SST) 7 (Socher et al., 2013)  also supports binary sentiment (positive, negative) classification by removing neutral labels, leading to: 6920 sentences for training, 872 for development, and 1821 for testing. All sentences were parsed by Liang Huang's dependency parser 8 (Huang and Sagae, 2010). We used this parser because it generates parse forests and that dependency trees are less deep than constituent trees. In addition, because the SST was annotated in a constituency manner, we also employed the Charniak's constituent parser (Charniak and Johnson, 2005) with Huang (2008)'s forest pruner. We found that the beam width 16 for the dependency parser and the log probability beam 10 for the other worked best. Lower values harmed the system's performance and higher values were not beneficial.
Our FCN has the dimension of vectors at inner nodes 200, a window size for the convolutional kernel of 7, and the activation function tanh. It was trained with the learning rate 0.01, the regularization parameter 10 −4 , and the mini batch size 5. To reduce the average depth of the network, the fully connected layer following the convolutional layer was removed (i.e., p = x, see Figure 3).
Constituent parsing is clearly more helpful than dependency parsing: the improvements that the FCN got are 0.6% in the fine-grained task and 0.9% in the binary task. We conjecture that, because sentences in the treebank were parsed by a constituent parser (here is the Stanford parser), training with constituent forests is easier.

Question Classification
In this task we used the TREC question dataset 10 (Li and Roth, 2002) which contains 5952 questions (5452 questions for training and 500 questions for testing). The task is to assign a question to one in six types: ABBREVIATION, EN-TITY, DESCRIPTION, HUMAN, LOCATION, NUMERIC. The average length of the questions in the training set is 10.2 whereas in the test set is 7.5. This difference is due to the fact that those questions are from different sources. All questions were parsed by Liang Huang's dependency parser with the beam width 16.
We randomly picked 5% of the training set (272 questions) for validation. Our FCN has the dimension of vectors at inner nodes 200, a window size for the convolutional kernel of 5, and the activation function tanh. It was trained with the learning rate 0.01, the regularization parameter 10 −4 , and the mini batch size 1. The vectors representing the two padding tokens <b>, <e> were fixed to 0.
We compare the FCN against the Convolutional neural network (CNN) (Kim, 2014), the Dynamic convolutional neural network (DCNN) (Kalch-9 LSTM-RNN and CT-LSTM are very similar: they are RNNs using LSTMs for composition. Their difference is that LSTM-RNN uses one input gate for each child where as CT-LSTM uses only one input gate for all children. 10  . We also include the LSTM-RNN (Le and Zuidema, 2015) whose accuracy was computed by running their published source code 11 on binary trees from the Stanford Parser 12 (Klein and Manning, 2003). This network was also initialized by the 300-D GloVe word embeddings. Table 2 shows the results. 13 The FCN achieved the second best accuracy, only lightly lower than SVM S (0.2%). This is a promising result because our network used only parse forests, unsupervisedly pre-trained word embeddings whereas SVM S used heavily engineered resources. The difference between FCN and the third best is remarkable (1.2%). Interestingly, LSTM-RNN did not perform well on this dataset. This is likely because the questions are short and the parse trees quite shallow, such that the two problems that the LSTM was invented for (long range dependency and vanishing gradient) do not play much of a role.

Visualization
We visualize the charts we obtained in the sentiment analysis task as in Figure 6. To identify how important each cell is for determining the final vector at the root, we compute the number of features of each that are actually propagated all the way to the root in the successive max pooling op-11 https://github.com/lephong/lstm-rnn 12 http://nlp.stanford.edu/software/lex-parser.shtml 13 While finalizing the current paper we discovered a paper by Ma et al. (2015) proposing a convolutional network model for dependency trees. They report a new state-of-the-art accuracy of 95.6%. erations. The circles in a graph are proportional to this number. Here, to make the contribution of each individual cell clearer, we have set the window size to 1 to avoid direct interactions between cells.
At the lexical level, we can see that the FCN can discriminate important words from the others. Two words "most" and "incoherent" are the key of the sentiment of this sentence: if one of them is replaced by another word (e.g. replacing "most" by "few" or "incoherent" by "coherent"), the sentiment will flip. The punctuation "." however also has a high contribution to the root. This happens to other charts as well. We conjecture that the network uses the vector of "." to store neutral features and propagate them to the root whenever it can not find more useful features in other vectors. Our future work is to examine this.
At the phrasal level, the network tends to group words in grammatical constituents, such as "most of the action setups", "are incoherent". Ill-formed constituents such as "of the action" and "incoherent ." receive little attention from the network.
Interestingly, we can see that the circle of "incoherent" is larger than the circles of any inner cells, suggesting that the network is able to make use of parses containing direct links from that word to the root. This is evidence that the network has an ability of selecting (or combining) parses that are beneficial to this sentiment analysis task.

Related Work
The idea that a composition function must be able to change its behaviour on the fly according to input vectors is explored by Socher et al. (2013), Le and Zuidema (2015), among others. The tensor in the former is multiplied with the vector representations of the phrases it is going to combine to define a composition function (a matrix) on the fly, and then multiplies again with these vectors to yield a compound representation. In the LSTM architecture of the latter, there is one input gate for each child in order to control how the vector of the child affects the composition at the parent node. Because the input gate is a function of the vector of the child, the composition function has an infinite number of behaviours. In this paper, we instead slide a kernel function along the sequence of children to generate different ways of composition. Although the number of behaviours is limited (and depends on the window size), it simultane- Some approaches try to overcome the problem of varying branching sizes. Le and Zuidema (2014b) use different sets of weight matrices for different branching sizes, thus requiring a large number of parameters. Because large branchingsize rules are rare, many parameters are infrequently updated during training. , for dependency trees, use a weight matrix for each relative position to the head word (e.g., first-left, second-right). Le and Zuidema (2014a) replace relative positions by dependency relations (e.g., OBJ, SUBJ). These approaches strongly depend on input parse trees and are very sensitive to parsing errors. The approach presented in this paper, on the other hand, does not need the information about the head word position and is less sensitive to parsing errors. Moreover, its number of parameters is independent from the maximal branching size.
Convolutional networks have been widely applied to solve natural language processing tasks. Collobert et al. (2011), Kalchbrenner et al. (2014, and Kim (2014) use convolutional networks to deal with varying length sequences. Recently, Zhu et al. (2015) and Ma et al. (2015) try to intergrate syntactic information by employing parse trees. Ma et al. (2015) extend the work of Kim (2014) by taking into acount dependency relations so that long range dependencies could be captured. The model proposed by Zhu et al. (2015), which is very similar to our Recursive convolutional neural network model, is to use a convolutional network for the composition purpose. Our work, although also employing a convolutional network and syntactic information, goes beyond them: we address the issue of how to deal with uncertainty about the correct parse inside the neural architecture. Therefore, instead of using a single parse, our proposed FCN model takes as input a forest of parses.
Related to our FCN is the Gated recursive convolutional neural network model proposed by Cho et al. (2014) which is stacking n − 1 convolutional neural layers using a window-size-2 gated kernel (where n is the sentence length). Mapping their network into a chart, each cell is only connected to the two cells right below it. What makes this network special is the gated kernel which is a 3gate switcher for choosing one of three options: directly transmit the left/right child's vector to the parent node, or compose the vectors of the two children. Thanks to this, the network can capture any binary parse trees by setting those gates properly. However, because only one gate is allowed to open in a cell, the network is not able to capture an arbitrary forest. Our FCN is thus more expressive and flexible than their model.

Conclusions
We proposed the Forest Convolutional Network (FCN) model that addresses the three issues: (1) how to make the composition functions adaptive, (2) how to deal with different branching factors of nodes in the relevant syntactic trees, (3) how to deal with uncertainty about the correct parse inside the neural architecture. The key principle is to carry out many different ways of computation and then choose or combine some of them. For more details, the two first issues are solved by employing a convolutional net for composition. To the third issue, the network takes input as a forest of parses instead of a single parse as in traditional approaches.
Our future work is to focus on how to choose/combine different ways of computation. For instance, we might replace the max pooling by different pooling operations such as mean pooling, k-max pooling (Kalchbrenner et al., 2014), and stochastic pooling (Zeiler and Fergus, 2013). We can even bias the selection/combination toward grammatical constituents by weighing cells by their inside probabilities.