Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projective trees.


Introduction
Discrete structures are ubiquitous in the study of natural languages, for example in morphology, syntax and discourse analysis. In natural language processing, they are often used to inject linguistic prior knowledge into statistical models. For examples, syntactic structures have been shown beneficial in question answering (Cui et al., 2005), sentiment analysis (Socher et al., 2013), machine translation (Bastings et al., 2017) and relation extraction (Liu et al., 2015), among others. However, linguistic tools producing these structured representations (e.g., syntactic parsers) are not available for many languages and not robust when applied outside of the domain they were trained on (Petrov et al., 2010;Foster et al., 2011). Moreover, linguistic structures do not always seem suitable in downstream applications, with simpler alternatives sometimes yielding better performance (Wang et al., 2018).
Indeed, a parallel line of work focused on inducing task-specific structured representations of language (Naradowsky et al., 2012;Kim et al., 2017;Liu and Lapata, 2018;Niculae et al., 2018). In these approaches, no syntactic or semantic annotation is needed for training: representation is induced from scratch in an end-to-end fashion, in such a way as to benefit a given downstream task. In other words, these approaches provide an inductive bias specifying that (hierarchical) structures are appropriate for representing a natural language, but do not make any further assumptions regarding what the structures represent. Structures induced in this way, though useful for the task, tend not to resemble any accepted syntactic or semantic formalisms (Williams et al., 2018a). Our approach falls under this category.
In our method, projective dependency trees (see Figure 3 for examples) are treated as latent variables within a probabilistic model. We rely on differentiable dynamic programming (Mensch and Blondel, 2018) which allows for efficient sampling of dependency trees (Corro and Titov, 2019). Intuitively, sampling a tree involves stochastically perturbing dependency weights and then running a relaxed form of the Eisner dynamic programming algortihm (Eisner, 1996). A sampled tree (or its continuous relaxation) can then be straightforwardly integrated in a neural sentence encoder for a target task using graph convolutional networks (GCNs, Kipf and Welling, 2017). The entire model, including the parser and GCN parameters, are estimated jointly while minimizing the loss for the target task.
What distinguishes us from previous work is that we stochastically sample global structures and do it in a differentiable fashion. For example, the structured attention method (Kim et al., 2017;Liu and Lapata, 2018) does not sample entire trees but rather computes arc marginals, and hence does not faithfully represent higher-order statistics. Much of other previous work relies either on reinforce-ment learning Nangia and Bowman, 2018;Williams et al., 2018a) or does not treat the latent structure as a random variable (Peng et al., 2018). Niculae et al. (2018) marginalizes over latent structures, however, this necessitates strong sparsity assumptions on the posterior distributions which may inject undesirable biases in the model. Overall, differential dynamic programming has not been actively studied in the task-specific tree induction context. Most previous work also focused on constituent trees rather than dependency ones.
We study properties of our approach on a synthetic structure induction task and experiment on sentiment classification (Socher et al., 2013) and natural language inference (Bowman et al., 2015). Our experiments confirm that the structural bias encoded in our approach is beneficial. For example, our approach achieves a 4.9% improvement on multi-genre natural language inference (MultiNLI) over a structure-agnostic baseline. We show that stochastisticity and higher-order statistics given by the global inference are both important. In ablation experiments, we also observe that forcing the structures to be projective dependency trees rather than permitting any general graphs yields substantial improvements without sacrificing execution time. This confirms that our inductive bias is useful, at least in the context of the considered downstream applications. 1 Our main contributions can be summarized as follows: 1. we show that a latent tree model can be estimated by drawing global approximate samples via Gumbel perturbation and differentiable dynamic programming; 2. we demonstrate that constraining the structures to be projective dependency trees is beneficial; 3. we show the effectiveness of our approach on two standard tasks used in latent structure modelling and on a synthetic dataset.

Background
In this section, we describe the dependency parsing problem and GCNs which we use to incorporate latent structures into models for downstream tasks.

Dependency Parsing
Dependency trees represent bi-lexical relations between words. They are commonly represented as directed graphs with vertices and arcs corresponding to words and relations, respectively. Let x = x 0 . . . x n be an input sentence with n words where x 0 is a special root token. We describe a dependency tree of x with its adjacency matrix T ∈ {0, 1} n×n where T h,m = 1 iff there is a relation from head word x h to modifier word x m . We write T (x) to denote the set of trees compatible with sentence x.
We focus on projective dependency trees. A dependency tree T is projective iff for every arc T h,m = 1, there is a path with arcs in T from x h to each word x i such that h < i < m or m < i < h. Intuitively, a tree is projective as long as it can be drawn above the words in such way that arcs do not cross each other (see Figure 3). Similarly to phrase-structure trees, projective dependency trees implicitly encode hierarchical decomposition of a sentence into spans ('phrases'). Forcing trees to be projective may be desirable as even flat span structures can be beneficial in applications (e.g., encoding multi-word expressions). Note that actual syntactic trees are also, to a large degree, projective, especially for such morphologically impoverished languages as English. Moreover, restricting the space of the latent structures is important to ease their estimation. For all these reasons, in this work we focus on projective dependency trees.
In practice, a dependency parser is given a sentence x and predicts a dependency tree T ∈ T (x) for this input. To this end, the first step is to compute a matrix W ∈ R n×n that scores each dependency. In this paper, we rely on a deep dotted attention network. Let e 0 . . . e n be embeddings associated with each word of the sentence. 2 We follow Parikh et al. (2016) and compute the score for each head-modifier pair (x h , x m ) as follows: where MLP head and MLP mod are multilayer perceptrons, and b h-m is a distance-dependent bias, letting the model encode preference for long or short-distance dependencies. The conditional probability of a tree p θ (T |x) is defined by a log-linear model: .
When tree annotation is provided in data D, networks parameters θ are learned by maximizing the log-likelihood of annotated trees (Lafferty et al., 2001). The highest scoring dependency tree can be produced by solving the following mathematical program: If T (x) is restricted to be the set of projective dependency trees, this can be done efficiently in O(n 3 ) using the dynamic programming algorithm of Eisner (1996).

Graph Convolutional Networks
Graph Convolutional Networks (GCNs, Kipf and Welling, 2017; compute context-sensitive embeddings with respect to a graph structure. GCNs are composed of several layers where each layer updates vertices representations based on the current representations of their neighbors. In this work, we fed the GCN with word embeddings and a tree sample T . For each word x i , a GCN layer produces a new representation relying both on word embedding of x i and on embeddings of its heads and modifiers in T . Multiple GCN layers can be stacked on top of each other. Therefore, a vertex representation in a GCN with k layers is influenced by all vertices at a maximum distance of k in the graph. Our GCN is sensitive to arc direction. More formally, let E 0 = e 0 · · · e n , where is the column-wise concatenation operator, be the input matrix with each column corresponding to a word in the sentence. At each GCN layer t, we compute: where σ is an activation function, e.g. ReLU. Functions f (), g() and h() are distinct multilayer perceptrons encoding different types of relationships: self-connection, head and modifier, respectively (hyperparameters are provided in Appendix A). Note that each GCN layer is easily parallelizable on GPU both over vertices and over batches, either with latent or predefined structures. (a) x T y (b) x T x T y Figure 1: The two directed graphical models used in this work. Shaded and unshaded nodes represent observable and unobservable variables, respectively. (a) In the sentence classification task, the output y is conditioned on the input and the latent tree. (b) In the natural language inference task, the output is conditioned on two sentences and their respective latent trees.

Structured Latent Variable Models
In the previous section, we explained how a dependency tree is produced for a given sentence and how we extract features from this tree with a GCN. In our model, we assume that we do not have access to gold-standard trees and that we want to induce the best structure for the downstream task. To this end, we introduce a probability model where the dependency structure is a latent variable (Section 3.1). The distribution over dependency trees must be inferred from the data (Section 3.2). This requires marginalization over dependency trees during training, which is intractable due to the large search space. 3 Instead, we rely on Monte-Carlo (MC) estimation.

Graphical Model
Let x be the input sentence, y be the output (e.g. sentiment labelling) and T (x) be the set of latent structures compatible with input x. We construct a directed graphical model where x and y are observable variables, i.e. their values are known during training. However, we assume that the probability of the output y is conditioned on a latent tree T ∈ T (x), a variable that is not observed during training: it must be inferred from the data. Formally, the model is defined as follows: where θ denotes all the parameters of the model. An illustration of the network is given in Figure 1a.

Parameter Estimation
Our probability distributions are parameterized by neural networks. Their parameters θ are learned via gradient-based optimization to maximize the log-likelihood of (observed) training data. Unfortunately, estimating the log-likelihood of observation requires computing the expectation in Equation 3, which involves an intractable sum over all valid dependency trees. Therefore, we propose to optimize a lower bound on the log-likelihood, derived by application of Jensen's inequality which can be efficiently estimated with the Monte-Carlo (MC) method: However, MC estimation introduces a nondifferentiable sampling function T ∼ p θ (T |x i ) in the gradient path. Score function estimators have been introduced to bypass this issue but suffer from high variance (Williams, 1987;Fu, 2006;Schulman et al., 2015). Instead, we propose to reparametrize the sampling process (Kingma and Welling, 2014), making it independent of the learned parameter θ : in such case, the sampling function is outside of the gradient path. To this end, we rely on the Perturb-and-MAP framework (Papandreou and Yuille, 2011). Specifically, we perturb the potentials (arc weights) with samples from the Gumbel distribution and compute the most probable structure with the perturbed potentials: Each element of the matrix G ∈ R n×n contains random samples from the Gumbel distribution 4 which is independent from the network parameters θ, hence there is no need to backpropagate through this path in the computation graph. Note that, unlike the Gumbel-Max trick (Maddison et al., 2014), sampling with Perturb-and-MAP is approximate, as the noise is factorizable: we add noise to individual arc weights rather than to scores of entire trees (which would not be tractable). This Algorithm 1 This function computes the chart values for items of the form [i, j, →, ⊥] by searching the set of antecedents that maximizes its score. Because these items assume a dependency from x i to x j , we add W i,h to the score.
has contributed the optimal objective, this function sets T i,j to 1. Then, it propagates the contribution information to its antecedents.
is the first source of bias in our gradient estimator. The maximization in Equation 7 can be computed using the algorithm of Eisner (1996). We stress that the marginalization in Equation 3 and MC estimated sum over trees capture high-order statistics, which is fundamentally different from computing edge marginals, i.e. structured attention (Kim et al., 2017). Unfortunately, the estimated gradient of the reparameterized distribution over parse trees is ill-defined (either undefined or null). We tackle this issue in the following section.

Differentiable Dynamic Programming
Neural networks parameters are learned using (variants of) the stochastic gradient descent algorithm. The gradient is computed using the backpropagation algorithm that rely on partial derivative of each atomic operation in the network. 5 The perturb-and-MAP sampling process relies on the dependency parser (Equation 7) which contains ill-defined derivatives. This is due to the usage of constrained arg max operations (Gould et al., 2016;Mensch and Blondel, 2018) in the algorithm of Eisner (1996). Let L be the training loss, backpropagation is problematic because of the following operation: is the partial derivative with respect to the dependency parser (Equation 7) which is null almost everywhere, i.e. there is no descent direction information. We follow previous work and use a differentiable dynamic programming surrogate (Mensch and Blondel, 2018;Corro and Titov, 2019). The use of the surrogate is the second source of bias in our gradient estimation.

Parsing with Dynamic Programming
The projective dependency parser of Eisner (1996) is a dynamic program that recursively builds a chart of items representing larger and larger spans of the input sentence. Items are of the form [i, j, d, c] where: 0 ≤ i ≤ j ≤ n are the boundaries of the span; d ∈ {→, ←} is the direction of the span, i.e. a right span → (resp. left span ←) means that all the words in the span are descendants of x i (resp. x j ) in the dependency tree; c ∈ { , ⊥} indicates if the span is complete ( ) or incomplete (⊥) in its direction. In a complete right span, x j cannot have any modifiers on its right side. In a complete left span, x i cannot have any modifier on its left side. A set of deduction rules defines how the items can be deduced from their antecedents.
The algorithm consists of two steps. In the first step, items are deduced in a bottom-up fashion and the following information is stored in the chart: the maximum weight that can be obtained by each item and backpointers to the antecedents that lead to this maximum weight (Algorithm 1). In the second step, the backpointers are used to retrieve the items corresponding to the maximum score and values in T are set accordingly (Algorithm 2). 6

Continuous Relaxation
The one-hot-argmax operation on line 5 in Algorithm 1 can be written as follows: The second step is often optimized to have linear time complexity instead of cubic. Unfortunately, this change is not compatible with the continuous relaxation we propose.
It is known that a continuous relaxation of arg max in the presence of inequality constraints can be obtained by introducing a penalizer that prevents activation of inequalities at the optimal solutions (Gould et al., 2016): Several Ω functions have been studied in the literature for different purposes, including logarithmic and inverse barriers for the interior point method (Den Hertog et al., 1994;Potra and Wright, 2000) and negative entropy for deterministic annealing (Rangarajan, 2000). When using negative entropy, i.e. Ω(b) = k b k log b k , solving the penalized one-hot-argmax has a closed form solution that can be computed using the softmax function (Boyd and Vandenberghe, 2004), that is: .
Therefore, we replace the non-differentiable one-hot-argmax operation in Algorithm 1 with a softmax in order to build a smooth and fully differentiable surrogate of the parsing algorithm.s

Controlled Experiment
We first experiment on a toy task. The task is designed in such a way that there exists a simple projective dependency grammar which turns it into a trivial problem. We can therefore perform thorough analysis of the latent tree induction method.

Dataset and Task
The ListOps dataset (Nangia and Bowman, 2018) has been built specifically to test structured latent variable models. The task is to compute the result of a mathematical expression written in prefix notation. It has been shown easy for a Tree-LSTM that follows the gold underlying structure but most latent variable models fail to induce it. Unfortunately, the task is not compatible with our neural network because it requires propagation of information from the leafs to the root node, which is not possible for a GCN with a fixed number of layers. Instead, we transform the computation problem into a tagging problem: the task is to tag the valency of operations, i.e. the number of operands they have.
We transform the original unlabelled binary phrase-structure into a dependency structure by following a simple head-percolation table: the head of a phrase is always the head of its left argument. The resulting dependencies represent two kinds of relation: operand to argument and operand to closing parenthesis (Figure 2). Therefore, this task is trivial for a GCN trained with gold dependencies: it simply needs to count the number of outgoing arcs minus one (for operation nodes). In practice, we observe 100% tagging accuracy with the gold dependencies.

Neural Parametrization
We build a simple network where a BiLSTM is followed by deep dotted attention which computes the dependency weights (see Equation 1). In these experiments, unlike Section 6, GCN does not have access to input tokens (or corresponding BiLSTM states): it is fed 'unlexicalized' embeddings (i.e. the same vector is used as input for every token). 7 Therefore, the GCN is forced to rely on tree information alone (see App. A.1 for hyperparameters).
There are several ways to train the neural network. First, we test the impact of MC estimation at training. Second, we choose when to use the continuous relaxation. One option is to use a Straight-Through estimator (ST, Bengio, 2013;Jang et al., 2017): during the forward pass, we use a discrete structure as input of the GCN, but during the backward pass we use the differentiable surrogate to compute the partial derivatives. Another option is to use the differentiable surrogate for both passes (Forward relaxed). As our goal here is to study induced discrete structures, we do not use relaxations at test time. We compare our model with the non-stochastic version, i.e. we set G = 0.

Results
The attachment scores and the tagging accuracy are provided in Table 1. We draw two conclusions from these results. First, using the ST estimator hurts performance, even though we do not relax at test time. Second, the MC approximations, unlike the non-stochastic model, produces latent structures almost identical to gold trees. The non-stochastic version is however relatively successful in terms of tagging accuracy: we hypothesize that the LSTM model solved the problem and 7 To put it clearly, we have two sets of learned embeddings: a set of lexicalized embeddings used for the input of the BiLSTM and a single unlexicalized embedding used for the input of the GCN.

Real-world Experiments
We evaluate our method on two real-world problems: a sentence comparison task (natural language inference, see Section 6.1) and a sentence classification problem (sentiment classification, see Section 6.2). Besides using the differentiable dynamic programming method, our approach also differs from previous work in that we use GCNs followed by a pooling operation, whereas most previous work used Tree-LSTMs. Unlike Tree-LSTMs, GCNs are trivial to parallelize over batches on GPU.

Natural Language Inference
The Natural Language Inference (NLI) problem is a task developed to test sentence understanding capacity. Given a premise sentence and a hypothesis sentence, the goal is to predict a relation between them: entailment, neutral or contradiction. We evaluate on the Stanford NLI (SNLI) and the  Multi-genre NLI (MultiNLI) datasets. Our network is based on the decomposable attention (DA) model of Parikh et al. (2016). We induce structure of both the premise and the hypothesis (see Equation 1 and Figure 1b). Then, we run a GCN over the tree structures followed by inter-sentence attention. Finally, we apply max-pooling for each sentence and feed both sentence embeddings into a MLP to predict the label. Intuitively, using GCNs yields a form of intra-attention. See the hyperparameters in Appendix A.2. SNLI: The dataset contains almost 0.5m training instances extracted from image captions (Bowman et al., 2015). We report results in Table 2.
Our model outperforms both no intra-attention and simple intra-attention baselines 9 with 1 layer 9 The attention weights are computed in the same way as scores for tree prediction, i.e. using Equation 1. of GCN (+0.8) or two layers (+1.8). The improvements with using multiple GCN hops, here and on MultiNLI (Table 3b), suggest that higherorder information is beneficial. 10 It is hard to compare different tree induction methods as they build on top of different baselines, however, it is clear that our model delivers results comparable with most accurate tree induction methods (Kim et al., 2017;Liu and Lapata, 2018). The improvements from using latent structure exceed these reported in previous work.
MultiNLI: MultiNLI is a broad-coverage NLI corpus Williams et al. (2018b): the sentence pairs originate from 5 different genres of written and spoken English. This dataset is particularly interesting because sentences are longer than in SNLI, making it more challenging for baseline models. 11 We follow the evaluation setting in Williams et al. (2018b,a): we include the SNLI training data, use the matched development set for early stopping and evaluate on the matched test set. We use the same network and parameters as for SNLI. We report results in Table 3b.
The DA baseline ('No Intra Attention') performs slightly better (+0.6%) than the original BiLSTM baseline. Our latent tree model significantly improves over our the baseline, either with a single layer GCN (+3.4%) or with a 2-layer GCN (+4.9%). We observe a larger gap than on SNLI, which is expected given that MultiNLI is more complex. We perform extra ablation tests on MultiNLI in Section 6.3.

Sentiment Classification
We experiment on the Stanford Sentiment Classification dataset (Socher et al., 2013). The original dataset contains predicted constituency structure with manual sentiment labeling for each phrase. By definition, latent tree models cannot use the internal phrase annotation. We follow the setting of Niculae et al. (2018) and compare to them in two set-ups: (1) with syntactic dependency trees predicted by CoreNLP (Manning et al., 2014); (2) with latent dependency trees. Results are reported in Table 3a.
First, we observe that the bag of bigrams base-   line of Socher et al. (2013) achieves results comparable to all structured models. This suggest that the dataset may not be well suited for evaluating structure induction methods. Our latent dependency model slighty improves (+0.8) over the CoreNLP baseline. However, we observe that while our baseline is better than the one of Niculae et al. (2018), their latent tree model slightly outperforms ours (+0.1). We hypothesize that graph convolutions may not be optimal for this task.

Analysis (Ablations)
In order to test if the tree constraint is important, we do ablations on MultiNLI with two models: one with a latent projective tree variable (i.e. our full model) and one with a latent head selection model that does not impose any constraints on the structure. The estimation approach and the model are identical, except for the lack of the tree constraint (and hence dynamic programming) in the ablated model. We report results on development sets in Table 3c. We observe that the latent tree models outperform the alternatives. Previous work (e.g., Niculae et al., 2018) included comparison with balanced trees, flat trees and left-to-right (or right-to-left) chains. Flat trees are pointless with the GCN + DA combination: the corresponding pooling operation is already done in DA. Though balanced trees are natural with bottom-up computation of TreeLSTMs, for GCNs they would result in embedding essentially random subsets of words. Consequently, we compare only to left-to-right chains of dependencies. 12 This approach is substantially less accurate than our methods, especially for out-of-domain (i.e. mismatched) data.
(Grammar) We also investigate the structure of the induced grammar. We report the latent structure of three sentences in Figure 3. We observe that sentences are divided into spans, where each span is represented with a series of left dependencies. Surprisingly, the model chooses to use only left-to-right dependencies. The neural network does not include a RNN layer, so this may suggest that the grammar is trying to reproduce an recurrent model while also segmenting the sentence in phrases. (Speed) We use a O(n 3 )-time parsing algorithm. Nevertheless, our model is efficient: one epoch on SNLI takes 470 seconds, only 140 seconds longer than with the O(n 2 )-time latent-head version of our model (roughly equivalent to classic self-attention). The latter model is computed on GPU (Titan X) while ours uses CPU (Xeon E5-2620) for the dynamic program and GPU for running the rest of the network.

Related work
Recently, there has been growing interest in providing an inductive bias in neural network by forcing layers to represent tree structures (Kim et al., 2017;Maillard et al., 2017;Choi et al., 2018;Niculae et al., 2018;Williams et al., 2018a;Liu and Lapata, 2018). Maillard et al. (2017) also operates on a chart but, rather than modeling discrete trees, uses a soft-gating approach to mix representations of constituents in each given cell. While these models showed consistent improvement over comparable baselines, they do not seem to explicitly capture syntactic or semantic structures (Williams et al., 2018a). Nangia and Bowman (2018) introduced the ListOps task where the latent structure is essential to predict correctly the downstream prediction. Surprisingly, the models of Williams et al. (2018a) and Choi et al. (2018) failed. Much recent work in this context relies on latent variables, though we are not aware of any work closely related to ours. Differentiable structured layers in neural networks have been explored for semi-supervised parsing, for example by learning an auxiliary task on unlabelled data (Peng et al., 2018) or using a variational autoencoder (Corro and Titov, 2019).
Besides research focused on inducing taskspecific structures, another line of work, grammar induction, focused on unsupervised induction of linguistic structures. These methods typically rely on unlabeled texts and are evaluated by comparing the induced structures to actual syntactic annotation (Klein and Manning, 2005;Shen et al., 2018;Htut et al., 2018).

Conclusions
We introduced a novel approach to latent tree learning: a relaxed version of stochastic differentiable dynamic programming which allows for efficient sampling of projective dependency trees and enables end-to-end differentiation. We demonstrate effectiveness of our approach on both synthetic and real tasks. The analyses confirm importance of the tree constraint. Future work will investigate constituency structures and new neural architectures for latent structure incorporation.
A Neural Parametrization (Implementation) We implemented our neural networks with the C++ API of the Dynet library (Neubig et al., 2017). The continuous relaxation of the parsing algorithm is implemented as a custom computation node.
(Training) All networks are trained with Adam initialized with a learning rate of 0.0001 and batches of size 64. If the dev score did not improve in the last 5 iterations, we multiply the learning rate by 0.9 and load the best known model on dev. For the ListOps task, we run a maximum of 100 epochs, with exactly 100 updates per epoch. For NLI and SST tasks, we run a maximum of 200 epochs, with exactly 8500 and 100 updates per epoch, respectively.
All MLPs and GCNs have a dropout ratio of 0.2 except for the ListOps task where there is no dropout. We clip the gradient if its norm exceed 5.
A.1 ListOps Valency Tagging (Dependency Parser) Embeddings are of size 100. The BiLSTM is composed of two stacks (i.e. we first run a left-to-right and a right-to-left LSTM, then we concatenate their outputs and finally run a left-to-right and a right-to-left LSTM again) with one single hidden layer of size 100. The initial state of the LSTMs are fixed to zero.
The MLPs of the dotted attention have 2 layers of size 100 and a ReLU activation function (Tagger) The unique embedding is of size 100. The GCN has a single layer of size 100 and a ReLU activation. Then, the tagger is composed of a MLP with a layer of size 100 and a ReLU activation followed by a linear projection into the output space (i.e. no bias, no non-linearity).

A.2 Natural Language Inference
All activation functions are ReLU. The interattention part and the classifier are exactly the same than in the model of Parikh et al. (2016).
(Embeddings) Word embeddings of size 300 are initialized with Glove and are not updated during training. We initialize 100 unknown word embeddings where each value is sampled from the normal distribution. Unknown words are mapped using a hashing method.
(GCN) The embeddings are first passed through a one layer MLP with an output size of 200. The dotted attention is computed by two MLP with two layers of size 200 each. Function f (), g() and h() in the GCN layers are one layer MLPs without activation function. The σ activation function of a GCN is ReLU. We use dense connections for the GCN.
A.3 Sentiment Classification (Embeddings) We use Glove embeddings of size 300. We learn the unknown word embeddings. Then, we compute context sensitive embeddings with a single-stack/single-layer BiLSTM with a hidden-layer of size 100.
(GCN) The dotted attention is computed by two MLP with one layer of size 300 each. There is no distance bias in this model. Function f (), g() and h() in the GCN layers are one layer MLPs without activation function. The σ activation function of a GCN is ReLU. We do not use dense connections in this model.

(Output)
We use a max-pooling operation on the GCN outputs followed by an single-layer MLP of size 300.

B Illustration of the Continuous Relaxation
Too give an intuition of the continuous relaxation, we plot the arg max function and the penalized arg max in Figure 4. We plot the first output for input (x 1 , x 2 , 0).

C ListOps Training
We plot tagging accuracy and attachment score with respect to the training epoch in Figure 5. On the one hand, we observe that the non-stochastic versions converges way faster in both metrics: we suspect that it develops an alternative protocol to pass information about valencies from LSTM to the GCN. On the other hand, MC sampling may have a better exploration of the search space but it is slower to converge. We stress that training with MC estimation results in the latent tree corresponding (almost) perfectly to the gold grammar.

D Fast differentiable dynamic program implementation
In order to speed up training, we build a a fast the differentiable dynamic program (DDP) as a custom computational node in Dynet and use it in a static graph. Instead of relying on masking, we add an input the DDP node that contains the sentence size : therefore, even if the size of the graph is fixed, the cubic-time algorithm is run on the true input length only. Moreover, instead of allocating memory with the standard library functionnality, we use the fast scratch memory allocator of Dynet.