Latent Tree Learning with Differentiable Parsers: Shift-Reduce Parsing and Chart Parsing

Latent tree learning models represent sentences by composing their words according to an induced parse tree, all based on a downstream task. These models often outperform baselines which use (externally provided) syntax trees to drive the composition order. This work contributes (a) a new latent tree learning model based on shift-reduce parsing, with competitive downstream performance and non-trivial induced trees, and (b) an analysis of the trees learned by our shift-reduce model and by a chart-based model.


Introduction
Popular recurrent neural networks in NLP, such as the Gated Recurrent Unit (Cho et al., 2014) and Long Short-Term Memory (Hochreiter and Schmidhuber, 1997), compute sentence representations by reading their words in a sequence. In contrast, the Tree-LSTM architecture (Tai et al., 2015) processes words according to an input parse tree, and manages to achieve improved performance on a number of linguistic tasks.
Recently, Yogatama et al. (2016), Maillard et al. (2017), and Choi et al. (2017) all proposed sentence embedding models which work similarly to a Tree-LSTM, but do not require any parse trees as input. These models function without the assistance of an external automatic parser, and without ever being given any syntactic information as supervision. Rather, they induce parse trees by training on a downstream task such as natural language inference. At the heart of these models is a mechanism to assign trees to sentences -effectively, a natural language parser. Williams et al. (2017a) have recently investigated the tree structures induced by two of these models, trained for a natural language inference task. Their analysis showed that Yogatama et al. (2016) learns mostly trivial left-branching trees, and has inconsistent performance; while Choi et al. (2017) outperforms all baselines (including those using trees from conventional parsers), but learns trees that do not correspond to those of conventional treebanks.
In this paper, we propose a new latent tree learning model. Similarly to Yogatama et al. (2016), we base our approach on shift-reduce parsing. Unlike their work, our model is trained via standard backpropagation, which is made possible by exploiting beam search to obtain an approximate gradient. We show that this model performs well compared to baselines, and induces trees that are not as trivial as those learned by the Yogatama et al. model in the experiments of Williams et al. (2017a). This paper also presents an analysis of the trees learned by our model, in the style of Williams et al. (2017a). We further analyse the trees learned by the model of Maillard et al. (2017), which had not yet been done, and perform evaluations on both the SNLI data (Bowman et al., 2015) and the MultiNLI data (Williams et al., 2017b). The former corpus had not been used for the evaluation of trees of Williams et al. (2017a), and we find that it leads to more consistent induced trees.

Related work
The first neural model which learns to both parse a sentence and embed it for a downstream task is by Socher et al. (2011). The authors train the model's parsing component on an auxiliary task, based on recursive autoencoders, while the rest of the model is trained for sentiment analysis. Bowman et al. (2016) propose the "Shiftreduce Parser-Interpreter Neural Network", a model which obtains syntax trees using an integrated shift-reduce parser (trained on gold-standard trees), and uses the resulting structure to drive composition with Tree-LSTMs. Yogatama et al. (2016) is the first model to jointly train its parsing and sentence embedding components. They base their model on shiftreduce parsing. Their parser is not differentiable, so they rely on reinforcement learning for training. Maillard et al. (2017) propose an alternative approach, inspired by CKY parsing. The algorithm is made differentiable by using a soft-gating approach, which approximates discrete candidate selection by a probabilistic mixture of the constituents available in a given cell of the chart. This makes it possible to train with backpropagation. Choi et al. (2017) use an approach similar to easy-first parsing. The parsing decisions are discrete, but the authors use the Straight-Through Gumbel-Softmax estimator (Jang et al., 2017) to obtain an approximate gradient and are thus able to train with backpropagation. Williams et al. (2017a) investigate the trees produced by Yogatama et al. (2016) and Choi et al. (2017) when trained on two natural language inference corpora, and analyse the results. They find that the former model induces almost entirely leftbranching trees, while the latter performs well but has inconsistent trees across re-runs with different parameter initializations.
A number of other neural models have also been proposed which create a tree encoding during parsing, but unlike the above architectures rely on traditional parse trees. Le and Zuidema (2015) propose a sentence embedding model based on CKY, taking as input a parse forest from an automatic parser.  propose RNNG, a probabilistic model of phrase-structure trees and sentences, with an integrated parser that is trained on gold standard trees.
3 Models CKY The model of Maillard et al. (2017) is based on chart parsing, and effectively works like a CKY parser (Cocke, 1969;Kasami, 1965;Younger, 1967) using a grammar with a single nonterminal A with rules A → A A and A → α, where α is any terminal. The parse chart is built bottom-up incrementally, like in a standard CKY parser. When ambiguity arises, due to the multiple ways to form a constituent, all options are computed using a Tree-LSTM, and scored. The constituent is then represented as a weighted sum of all possible options, using the normalised scores as weights. In order for this weighted sum to approximate a discrete selection, a temperature hyperparameter is used in the softmax. This process is repeated for the whole chart, and the sentence representation is given by the topmost cell.
We noticed in our experiments that the weighted sum still occasionally assigned non-trivial weight to more than one option. The model was thus able to utilize multiple inferred trees, rather than a single one, which would have potentially given it an advantage over other latent tree models. Hence for fairness, in our experiments we replace the softmax-with-temperature of Maillard et al. (2017) with a softmax followed by a straight-through estimator (Bengio et al., 2013). In the forward pass, this approach is equivalent to an argmax function; while in the backward pass it is equivalent to a softmax. Effectively, this means that a single tree is selected during forward evaluation, but the training signal can still propagate to every path during backpropagation. This change did not noticeably affect performance on development data.
Beam Search Shift-Reduce We propose a model based on beam search shift-reduce parsing (BSSR). The parser works with a queue, which holds the embeddings for the nodes representing individual words which are still to be processed; and a stack, which holds the embeddings of the nodes which have already been computed. A standard binary Tree-LSTM function (Tai et al., 2015) is used to compute the d-dimensional embeddings of nodes: where W, U are learned 5d × d matrices, and b is a learned 5d vector. The d-dimensional vectors σ(i), σ(f L ), σ(f R ) are known as input gate and left-and right-forget gates, respectively. σ(o t ) and tanh(u t ) are known as output gate and candidate update. The vector w is a word embedding, while h L , h R and c L , c R are the childrens' hand c-states. At the beginning, the queue contains embeddings for the nodes corresponding to single words. These are obtained by computing the Tree-LSTM with w set to the word embedding, and h L/R , c L/R set to zero. When a SHIFT action is performed, the topmost element of the queue is popped, and pushed onto the stack. When a REDUCE action is performed, the top two elements of the stack are popped. A new node is then computed as their parent, by passing the children through the Tree-LSTM, with w = 0. The new node is then pushed onto the stack. Parsing actions are scored with a simple multilayer perceptron, which looks at the top two stack elements and the top queue element: where h s1 , h s2 , h q1 are the h-states of the top two elements of the stack and the top element of the queue, respectively. The three W matrices have dimensions d × d and are learned; a is a learned 2-dimensional vector; and A is a learned 2 × d vector. The final scores are given by log p, and the best action is greedily selected at every time step. The sentence representation is given by the h-state of the top element of the stack after 2n − 1 steps.
In order to make this model trainable with gradient descent, we use beam search to select the b best action sequences, where the score of a sequence of actions is given by the sum of the scores of the individual actions. The final sentence representation is then a weighted sum of the sentence representations from the elements of the beam. The weights are given by the respective scores of the action sequences, normalised by a softmax and passed through a straight-through estimator. This is equivalent to having an argmax on the forward pass, which discretely selects the top-scoring beam element, and a softmax in the backward pass.  (Williams) 82.6 69.1 100D Tree-LSTM (Yogatama) 78.5 -300D SPINN (Williams) 82.2 67.5

Experiments
For each model and dataset, we train five instances using different random initialisations, for a total of 2 × 2 × 5 = 20 instances.

NLI Accuracy
We measure SNLI and MultiNLI test set accuracy for CKY and BSSR. The aim is to ensure that they perform reasonably, and are in line with other latent tree learning models of a similar size and complexity. Results for the best mod-   Table 1. While our models do not reach the state of the art, they perform at least as well as other latent tree models using 100D embeddings, and are competitive with some 300D models. They also outperform the 100D Tree-LSTM of Yogatama et al. (2016), which is given syntax trees, and match or outperform 300D SPINN, which is explicitly trained to parse.
Self-consistency Next, we examine the consistency of the trees produced for the development sets. Adapting the code of Williams et al. (2017a), we measure the models' self F1, defined as the unlabelled F1 between trees by two instances of the same model (given by different random initializations), averaged over all possible pairs. Results are shown in Table 2. In order to test whether BSSR and CKY learn similar grammars, we calculate the inter-model F1, defined as the unlabelled F1 between instances of BSSR and CKY trained on the same data, averaged over all possible pairs. We find an average F1 of 42.6 for MultiNLI+ and 55.0 for SNLI, both above the random baseline.
Our Self F1 results are all above the baseline of random trees. For MultiNLI+, they are in line with ST-Gumbel. Remarkably, the models trained on SNLI are noticeably more self-consistent. This shows that the specifics of the training data play an important role, even when the downstream task is the same. A possible explanation is that MultiNLI has longer sentences, as well as multiple genres, including telephone conversations which often do not constitute full sentences (Williams et al., 2017b). This would require the models to learn how to parse a wide variety of styles of data. It is also interesting to note that the inter-model F1 scores are not much lower than the self F1 scores. This shows that, given the same training data, the grammars learned by the two different models are not much more different than the grammars learned by two instances of the same model.

F1 Scores
Finally, we investigate whether these models learn grammars that are recognisably leftbranching, right-branching, or similar to the trees produced by the Stanford Parser which are included in both datasets. We report the unlabelled F1 between these and the trees from from our models in Table 2, averaged over the five model instances. We show mean, standard deviation, and maximum.
We find a slight preference from BSSR and the SNLI-trained CYK towards left-branching structures. Our models do not learn anything that resembles the trees from the Stanford Parser, and have an F1 score with them which is at or below the random baseline. Our results match those of Williams et al. (2017a), which show that whatever these models learn, it does not resemble PTB grammar.

Conclusions
First, we proposed a new latent tree learning model based on a shift-reduce parser. Unlike a previous model based on the same parsing technique, we showed that our approach does not learn triv-ial trees, and performs competitively on the downstream task.
Second, we analysed the trees induced by our shift-reduce model and a latent tree model based on chart parsing. Our results confirmed those of previous work on different models, showing that the learned grammars do not resemble PTB-style trees (Williams et al., 2017a). Remarkably, we saw that the two different models tend to learn grammars which are not much more different than those learned by two instances of the same model.
Finally, our experiments highlight the importance of the choice of training data used for latent tree learning models, even when the downstream task is the same. Our results suggest that MultiNLI, which has on average longer sentences coming from different genres, might be hindering the current models' ability to learn consistent grammars. For future work investigating this phenomenon, it may be interesting to train models using only the written genres parts of MultiNLI, or MultiNLI without the SNLI corpus.