Recursive Top-Down Production for Sentence Generation with Latent Trees

We model the recursive production property of context-free grammars for natural and synthetic languages. To this end, we present a dynamic programming algorithm that marginalises over latent binary tree structures with N leaves, allowing us to compute the likelihood of a sequence of N tokens under a latent tree model, which we maximise to train a recursive neural function. We demonstrate performance on two synthetic tasks: SCAN, where it outperforms previous models on the LENGTH split, and English question formation, where it performs comparably to decoders with the ground-truth tree structure. We also present experimental results on German-English translation on the Multi30k dataset, and qualitatively analyse the induced tree structures our model learns for the SCAN tasks and the German-English translation task.


Introduction
Given the hierarchical nature of natural language, tree structures have long been considered a fundamental part of natural language understanding. In recent years, a number of studies have shown that incorporating these structures into deep learning systems can be beneficial for various natural language tasks (Socher et al., 2013;Bowman et al., 2015;Eriguchi et al., 2016).
Various work has explored the introduction of syntactic structures into recursive encoders, either with explicit syntactic information (Du et al., 2020;Socher et al., 2010;Dyer et al., 2016) or by means of unsupervised latent tree learning (Williams et al., 2018;Shen et al., 2019;Kim et al., 2019b). Some attempts at formulating structured decoders are Zhang et al. (2015a) and Alvarez-Melis and * Equal contribution Figure 1: Our generative model is a recursive top-down neural network that recursively splits a root node embedding with some node-dependent probability. When the splitting stops, it emits a word with some probability. The joint probability of a sentence and its associated binary tree is the product of the probability of the tree (1 − l 1 )(1 − l 3 )(1 − l 5 )l 2 l 4 l 6 l 7 and the probabilities of the word emitted at its leaves. We devise a novel marginalisation algorithm over binary trees to compute the likelihood of a sentence. Jaakkola (2016) which propose binary top-down tree LSTM architectures for natural language. Chen et al. (2018) proposes a tree-structured decoder for code generation. These methods require groundtruth trees from an external source, and this extra input may not be available for all languages or data sources.
In this work, we propose a tree-based probabilistic decoder model for sequence-to-sequence tasks. Our model generates sentences from a latent tree structure that aims to reflect natural language syntax. The method assumes that each token in a sentence is emitted at the leaves of a full but latent binary tree (Fig. 1). The tree is obtained by recursively producing node embeddings from a root embedding with a recursive neural network. Word emission probabilities are function of the leaf embeddings. We describe a novel dynamic programming algorithm for exact marginalisation over the large number of latent binary trees.
Our generative model parametrizes a prior over binary trees with a stick-breaking process, similar to the "penetration probabilities" defined in Mochihashi and Sumita (2008). It is related to a long tradition of unsupervised grammar induction models that formulate a generative model of sentences (Klein and Manning, 2001;Bod, 2006;Klein and Manning, 2005).
Unlike more recent bottom-up approaches such as Kim et al. (2019a) which require the insideoutside algorithm (Baker, 1979) to marginalise over tree structures, our approach is top-down and comes with an efficient algorithm to perform marginalisation. Top-down models can be useful, as the decoder is encouraged by design to keep global context while generating sentences (Du and Black, 2019;Gū et al., 2018).
In the next section, we will describe the algorithm that marginalises over latent tree structures under some independence assumptions. We first introduce these assumptions and show that by introducing the notion of successive leaves, we can efficiently sum over different tree structures. We then introduce the details of the recursive architecture used. Finally, we present the experimental results of the model in Section 5.

Generative Process
We assume that each sequence is generated by means of an underlying tree structure which takes the form of a full binary tree, which is a tree for which each node is either a leaf or has two children. A sequence of tokens is produced with the following generative process: first, sample a full binary tree T from a distribution p(T ). Denote the sets of leaves of T as L(T ). Then for each leaf v in L(T ), sample a token x ∈ V, where V is the vocabulary, from a conditional distribution p(x|v).
Under this model, the probability of a sequence x 1:N can be obtained by marginalising over possible tree structures with N leaves: We assume that the probability of sequences with lengths different from the number of leaves in the tree is 0. Our generative process prescribes that, given the tree structure, the probability of each word is independent of the other words, i.e.: where L n (T ) represents the n-th leaf of T . In what follows, we describe an algorithm to efficiently marginalise over possible tree structures, such that the involved distributions can be parametrized by neural networks and can be trained end-to-end by maximizing log-likelihood of the observed sequences. We first describe how we model the prior p(T ) and then how to compute p(x 1:N ) efficiently.

Probability of a full binary tree
We model the prior probability of a full binary tree p(T ) by using a branching process similar to the stick-breaking construction, which can be used to model a series of stochastic binary decisions until success (Sethuraman, 1994). In our model, we perform a series of binary decisions at each vertex, starting at the root and branching downwards. Each decision consists in whether to expand the current node by creating two children or not. This binary decision is therefore modeled with a Bernoulli random variable. Let us define a complete binary tree T C of depth D C with vertices {v 1 , . . . , v M }, M = 2 D C +1 − 1. Each vertex above is associated with a Bernoulli parameter l, θ = {l 1 , . . . , l 2 D C +1 −1 }, l i ∈ [0, 1], modeling its split probability. The probabilities (1 − l i ) are similar to the "penetration probabilities" mentioned in Mochihashi and Sumita (2008). A full binary tree depth D ≤ D C is contained in T C , so we will refer to it as an internal tree from here on 1 . See Fig. 1 for an example of two internal trees with three leaves. Its probability can be expressed using parameters l i as follows. The probability p(T ) = π(root), where π is defined recursively as: where left(v i ) and right(v i ) are the left child and right child respectively.  Figure 2: In this figure, root(T C ) = v 4 . Given N = 3, there are two possible trees, T 1 and T 2 . The probabilities of the trees can be expressed as the recurrent process described in Equation 3, or as a product of m(·) at the leaf vertices of the internal tree.

Memoizing the value at each vertex
We can compute Eq. (3) efficiently by storing a partial computation for each vertex and multiplying the values at the leaves to get the tree probability: where L n (T ) denotes the vertex corresponding to the n-th leaf of T . We define this value at the vertex v i to be m(v i ): where V i→j denotes the set of vertices in the path from node v i to node v j inclusive. These values can be efficiently computed with this top-down recurrence relation: where the parent(v i ) is the parent of v i , and m(parent(root)) = 1. For example, in Fig. 2, m(1) = (1 − l 4 ) 1/4 (1 − l 2 ) 1/2 l 1 and we demonstrate the case for two internal trees with D = 2 and N = 3 leaves. We can then use Eq.
(2) and Eq. (4) to write the joint probability of a sequence and a tree: Note that the joint probability factorises as a product over the token probability and the value at the vertex. As we will see later, our method works by traversing the leaves of all possible internal trees, computing the product of the values at the leaves along the way. Therefore, expressing the probability of a full tree as a product of these values ensures that marginalisation stays tractable.

Marginalising over trees
Now that we can compute the probability of a given tree, we need to marginalise over all full binary trees with exactly N leaves. We will denote this formally by the set T N = {T : |L(T )| = N }. The crux of the problem surrounds marginalising over T N . We know |T N | ≤ C N −1 , where C n is the n-th Catalan number 2 , with equality occuring when N ≤ D C − 1.
Successive leaves In order to efficiently enumerate all possible internal trees, we define a set of admissible transitions between the vertices of T C . First, let us define the left and right boundaries of a T C . Starting from the root node, traversing down the all left children recursively until the leftmost leaf, all vertices visited in this process belong to the left boundary B l . This notion is similarly defined for all right children in the right boundary B r . Given a vertex v, we define the successive leaves of v as any of the next possible leaves in a internal binary tree in which v is a leaf. As an example, in Figure 3, vertices 5 and 6 are successive leaves of both vertices 2 and 3 . Therefore, if we start at a vertex in the left boundary and travel along these allowed transitions until we reach the right boundary, the vertices visited along this path describe the leaves of an internal tree. This notion is independent of the length of any sequence, and a traversal from the left boundary of T C to the right boundary will induce the leaves of a valid internal T . As an example, in Figure 3, the admissible transitions 1 → 3 → 6 form a valid internal tree, as well as 1 → 3 → 5 → 7 .
To list all pairs of allowed transitions v i to v j , we compute the Cartesian product of the vertices in the right boundary of the left subtree and the left boundary of the right subtree, and do this recursively for each vertex. See Figure 4 for an illustration of the concept. The pseudo-code for generating all such transitions in a tree is shown in Appendix B: SUCCESSIVELEAVES. The result of SUCCESSIVELEAVES(root) is the set S, which contains pairs of vertices (v i , v j ) such that v j is a successive leaf of v i . Taking N − 1 transitions from the left boundary to the right boundary of T C results in visiting the N leaves of an internal tree. Proof is in Appendix A.
Marginalisation We can use our transitions S to marginalise over internal trees with N leaves as follows: we fill a table M (v, n) that contains the marginal probability of prefix x 1:n , where we sum over all partial trees for which vertex v has emitted token x n : We first initialise the values at M (v, 1) at the left boundary: which should be the state of the table for all prefixes sequences of length 1. Then for 1 < n ≤ N , where we see that Eq. (9) can be recovered by pushing the product p(x n |v i ) · m(v i ) inside the sum in Eq. (10). The sum describes the situation when vertices have more than one incoming arrow, as depicted in Fig. 3. It should be noted that a large number of these values will be zero, which signify that there are no incomplete trees that end on that vertex. In order to compute the marginalisation over T N , we have to finally sum over the values at the right boundary: since valid full binary trees must also end on the right boundary of T C 3 . Note that the values of any trajectory that do not form a full binary tree by N − 1 iterations, i.e. those that do not reach the right boundary, do not get summed. Another interesting property is that full binary trees with fewer leaves than N would have their trajectories reach the right boundaries much earlier, and those values do not get propagated forward once they do.

Decoding from the model
During decoding, we can perform the following maximisation based on a modification of the marginalisation algorithm, arg max This technique borrows heavily from Viterbi (1967). We perform the same dynamic programming procedure as above, but replacing summations with maximizations, and maintaining a backpointer to the summand that was the highest: Figure 5: Schema of a single production function application. From the representation h i , (1) compute the context vector c( h i ) by attending on the encoder, (2) the distribution over word probabilities and the leaf probability parameter l, are computed, (3) apply the Cell(·, ·) function to produce the child representations h j and h k . Repeat until the maximum depth is reached.
Since we do not know the length of the sequence being decoded, we need to decide on a stopping criteria. We know that any subsequent multiplica- Thus, we also know that if the current best full sequence has probability p * , then if all probabilities at the frontier are < p * , no sequence with a higher probability can be found. We can then stop the search, and return the current best solution. Algorithm 2 in the Appendix C contains the pseudo-code for decoding.

Connectionist Tree (CTree) Decoder
We parameterize the emission probabilities p(x|v i ) and the splitting probability at each vertex l i with a recursive neural network. The neural network recursively splits a root embedding into internal hidden states of the binary tree structures via a production function f : where h v is the embedding of the vertex v and c is a generic context embedding that can be optionally vertex dependent and carries external information, e.g. it can be used to pass information in an encoder-decoder setting.
We parameterise f ( h v , c) as a gated two layer neural network with a ReLU hidden layer: where layernorm is layer normalization (Ba et al., 2016). We fix the hidden size to be two times of the dimension of the input vertex embedding. The splitting probability l v and the emission probabilities p(x|v) are defined as functions of the vertex embedding: The leaf prediction g l is a linear transform into a two-dimensional output space followed by a softmax. The specific form of the emission probability function g x can vary with the task. Unless specified, g x is an MLP.

Procedural Description
Starting with the root representation h ρ and its eventual contextual information c ρ , we recursively apply f . This can be done efficiently in parallel breadth-wise, doubling the hidden representations at every level. We apply g l at each level, and then Eq. (6) and Eq. (7) to get m(v), which depend only on the parents. We then apply f recursively until a pre-defined depth D C . We transform all the vertex embeddings using the emission function g x in parallel, and multiply p(x | v) · m(v) for all vertices and words in the vocabulary. We have now computed the sufficient statistics in order to apply the algorithm described in the previous section to compute the marginal probability of the observed sentence.
D C is a hyper-parameter that depends on memory and time constraints: if D C is large, the number of representations grows exponentially with it, as does the time for computing the likelihood. If the depth of the latent trees used to generate the data has an upper bound, we can also restrict the class of trees being learned by setting D C as well.

Related Work
Non-parametric Bayesian approaches to learning a hierarchy over the observed data has been proposed in the past (Ghahramani et al., 2010;Griffiths et al., 2004). These works generally learn a prior on tree-structured data, and assumes a common superstructure that generated the corpus instead of assuming that each observed datapoint may have been produced by a different hierarchical structure. Our generative assumptions are generally stronger but they allow us for tractable marginalisation without costly iterative inference procedures, e.g. MCMC.
Our method shares similarities with the forward algorithm (Baum and Eagon, 1967;Baum and Sell, 1968) which computes likelihoods for Hidden Markov Models (HMM), and CTC (Graves et al., 2006). While the forward algorithm factors in the transition probabilities, both CTC and our algorithm have placed a conditional independence assumption in the factorisation of the likelihood of the output sequence. The inside-outside algorithm (Baker, 1979) is usually employed when it comes to learning parameters for PCFGs. Kim et al. (2019a) gives a modern treatment to PCFGs by introducing Compound PCFGs. In this work, the CFG production probabilities are conditioned on a continuous latent variable, and the entire model is trained using amortized variational inference (Kingma and Welling, 2013). This allows the production rules to be conditioned on a sentencelevel random variable, allowing it to model correlations over rules that were not possible with a standard PCFG. However, all co-dependence between the rules can only be captured through the global latent variable. In CTC, Compound PCFGs, and our work, the fact that the dynamic programming algorithm is differentiable is exploited to train the model. have attempted to model this hierarchy using ground-truth parse trees from a parser. However, the parser was trained based on parses annotated using rules designed by linguists, which presents two challenges: (1) we may not always have these rules, particularly when it comes to lowresource languages, and (2) it may be possible that the structure required for different tasks are slightly different, enforcing the structure based on a universal parse structure may not be optimal. Jacob et al.
(2018) attempts to learn a tree structure using discrete split and merge with REINFORCE (Williams, 1992). However, the method is known to have high variance (Tucker et al., 2017).
There has also been some work that use sequential models for learning a latent hierarchy. Chung et al. (2016)

Experiments
We evaluate our method on three different sequence-to-sequence tasks. Unless otherwise stated, we are using the Ordered Memory (OM) (Shen et al., 2019) as our encoder. Further details can be found in Appendix D.1.

SCAN
The SCAN dataset (Lake and Baroni, 2017) consists of a set of navigation commands as well as their corresponding action sequences. As an example, an input of jump opposite left and walk thrice shoud yield LTURN LTURN JUMP WALK WALK WALK. The dataset is designed as a test bed for examining the systematic generalization of neural models. We follow the experiment settings in Bastings et al. (2018), where the different splits test for different properties of generalisation. We apply our model to the 4 experimentation settings and compare our model with the baselines in the literature (See Table 1).
The SIMPLE split has the same data distribution for both the training set and test set. The TURN LEFT split partitions the data so that while jump left, and turn right would be examples present in the training set, turn left are not, but the model must be able to learn from these examples to produce LTURN when it sees turn left as input.  values. This allows the model to make one-to-one mappings between input token embeddings and output token embeddings (e.g., jump in the input always maps to JUMP in the output), resulting in huge improvements in performance on the JUMP split. We refer to this method as lexical attention (LA).

Results
We report results in Table 1. Our model performs well on the SCAN splits. Figure 6 shows one tree induced from a model trained on SIMPLE. The resulting parses hint at the model learning to "reuse" some lower-level concepts when twice appears in the input, for instance. The two most challenging tasks are JUMP and LENGTH splits. In JUMP, the input token jump only appears alone during training and the model has to learn to use it in different contexts during testing. Surprisingly, this model fails to generalise in the JUMP split, suggesting that the capability of our model to perform well on the JUMP split may be dependent on the hierarchical decoding as well as the leaf attention. The LENGTH split partitions the data so that the distribution of output sequences seen in the training set is much shorter than those seen in the test set. Interestingly, our model converges to a solution that results in a 19.8% accuracy in 5 out of the 10 random seeds we use. In the other runs, the model achieves 25% or higher, with 2 runs achieving > 99% accuracy. The high variance of the model deserves more study, but we suspect in the failure cases, the model does not learn a meaningful concept of thrice. Overall, LENGTH requires some generalisation at the structural level during decoding, and has thus far been the most challenging for current sequential models. Given the results, we believe our model has made some improvements on this front.

English Question Formation
McCoy et al. (2020) proposed linguistic synthetic tasks to test for hierarchical inductive biases in models. One such task is the formation of English questions: the zebra does chuckle → does the zebra chuckle ?. It gets challenging when further relative clauses are inserted into the sentence: your zebras that don't dance do chuckle. The heuristic that may work in the first case -moving the first verb to the front of the sentence -would fail, since the right output would be do your zebras that don't dance chuckle ?. The task involves having two modes of generation, depending on the final token of the input sentence. If it ends with DECL, the decoder simply has to copy the input. If it ends with QUEST, the decoder has to produce the question. The authors argue, and provide evidence, that the models that do this task well have syntactic structure. Like SCAN, a generalisation set is included to test for out-of-distribution examples and only the first-word accuracy is reported for the generalisation set.
Results Training our model on this task, we achieve comparable results to their models that are  Table 2: English Question Formation results. Our models are annotated with †, and we report mean and standard deviation over 5 runs. Models that use attention are noted with *.
given the syntactic structure of the sentence, after considering the results of the sequential models that they used. The results for this task are reported in Table 2.

Multi30k Translation
The Multi30k English-German translation task (Elliott et al., 2016), is a corpus of short English-German sentence pairs. The original dataset includes a picture for each pair, but we have excluded them to focus on the text translation task. Our baseline models include an LSTM sequenceto-sequence with attention, Transformer (Vaswani et al., 2017), and a non-autoregressive model LaNMT (Shu et al., 2020). For a fair comparison, we trained all models with negative loglikelihood loss or knowledge distillation (Kim and Rush, 2016) if applicable.
Results As shown in Table 3, our model achieved comparable performance to its autoregressive counterparts, and outperforms the non-autoregressive model. However, we did not observe significant performance improvements as a result of the generalisation capabilities shown in the previous experiments. This suggests further study is needed to overcome remaining issues before deep learning models can really utilise productivity in language. On the other hand, examples in Figure 7 shows our model does acquire some grammatical knowledge. The model tends to generate all noun phrases (e.g. an older man, a video game) in separate subtrees. But it also tends to split the sentence before noun phrases. For example, the model splits the sub-clause while in the air into two different subtrees. Similarly, previous latent tree induction models (Shen et al., 2017(Shen et al., , 2018 also shows a higher affinity for noun phrases compared to adjec-

Conclusion
In this paper, we propose a new algorithm for learning a latent structure for sequences of tokens. Given the current interest in systematic generalisation and compositionality, we hope our work will lead to interesting avenues of research in this direction. Firstly, the connectionist tree decoding framework allows for different architectural designs for the recurrent function used. Secondly, while the dynamic programming algorithm is an improvement over a naive enumeration over different trees, there is room for improvement. For one, exploiting the sparsity of the M (·, ·) table can perhaps result in some memory and time gains. Finally, the need to recursively expand to a complete tree results in exponential growth with respect to the input length.
These results, while preliminary, suggests that the method holds some potential. The experimental results reveal some interesting behaviours that require further study. Nevertheless, we demonstrate that it performs comparably to current algorithms, and surpasses current models in synthetic tasks that have been known to require structure in the models to perform well.
Tree-structured decoding with doubly-recurrent neural networks.

A Proofs
In this context, all trees are rooted.
Definition 1. A full binary tree is a tree where each vertex has either 0 or 2 children.
Definition 2. A complete binary tree T C is a tree where each vertex that is not a leaf has 2 children.
Definition 3. An internal tree T of a complete binary tree T C is a full binary tree T such that root(T ) = root(T C ) and whose vertices and edges are a subset of T C .
The notion is similarly defined for the right boundary B r .
Definition 9. The probability p(T ) = π(root), where π is defined recursively as: where left(v i ) and right(v i ) are the left child and right child respectively. Proposition 1. If T and T are the left and right subtrees of T respectively, and T C and T C are subtrees of T C , then Since the vertices of T and T are subsets of vertices of T C and T C respectively, they are each internal trees of T C and T C . Therefore T ∈ T (T C ), T ∈ T (T C ) Proposition 2. If for all v i ∈ L(T C ) → l i = 1, then Proof. Base case: T C is is of depth 0, then T (T C ) = {T }, where T = T C = root., and since root is a leaf l = 1. Inductive case: Let the left and right subtrees of T C be T C and T C respectively, and assume T ∈T (T C ) p(T ) = 1, and same for T C Second term has common factor, since root is not a leaf, π(root(T )) · π(root(T )) where V N = L(T ), and |V N | = N . If V 1 , then V 1 = {root(T )}, then m(root(T )) = l root(T ) . If |V N | > 1, since T is a full binary tree, then there exists at least two vertices v i , v j ∈ V such that parent(v i ) = parent(v j ) = v k . Let V N −1 = (V \ {v i , v j }) ∪ {v k }. Then, v∈V (m(parent(v))) 1 2 · π(v) = (m(parent(v k ))) Then V N −1 forms another full binary tree T , where v k is now a leaf, and we can assign l k := π(v k ) Applying this identity, we can repeatedly reduce the number of factors by 1, until we get V 1 Proposition 4. If T is an internal tree of T C , Proof. If T = root, then the leftmost vertex is root, which is in B l by definition. Otherwise, from Definitions 3 & 1 we know that if left(v) for a given v is φ, then v is a leaf. We can then find the left-most leaf of T by recursively calling v = left(v), until left(v) = φ. Since all vertices of T are vertices of T C , and both trees share root, the left-most leaf of T , v ∈ B l The argument for the rightmost vertex is symmetric. Proposition 5. Let T C and T C be left and right subtrees of T C . Then, S(T C ) = S(T C ) ∪ S(T C ) ∪ (B l (T C ) × B r (T C )) Proof. T C is a complete tree so the left and right subtree T C and T C are both complete trees. For any T ∈ T (T C ), then by Definition 5, we can find T and T which are internal trees of T C and T C respectively, such that L