Learning from Non-Binary Constituency Trees via Tensor Decomposition

Processing sentence constituency trees in binarised form is a common and popular approach in literature. However, constituency trees are non-binary by nature. The binarisation procedure changes deeply the structure, furthering constituents that instead are close. In this work, we introduce a new approach to deal with non-binary constituency trees which leverages tensor-based models. In particular, we show how a powerful composition function based on the canonical tensor decomposition can exploit such a rich structure. A key point of our approach is the weight sharing constraint imposed on the factor matrices, which allows limiting the number of model parameters. Finally, we introduce a Tree-LSTM model which takes advantage of this composition function and we experimentally assess its performance on different NLP tasks.

hidden state) of a tree node combining the representation of its constituents (i.e. the hidden state of its child nodes). Then, the hidden state of the root (i.e. the whole sentence) is taken as sentence encoding. The Matrix-Vector Recurrent Neural Network (MV-RNN) (Socher et al., 2012) and the Recursive Neural Tensor Network (RNTN) (Socher et al., 2013) apply the RecNN architecture to binary constituency trees using complex composition functions. Tai et al. (2015) extends the well known Long-Short Term Memory (Hochreiter and Schmidhuber, 1997) architecture to tree-structured data. They propose two different Tree-LSTMs (that we discuss in Section 2.1): the N -ary Tree-LSTM defines a composition function which considers constituent order while the child-sum Tree-LSTM ignores such an order. However, only the former model is applied to binary constituency trees. The latter is applied to dependency trees, which are another kind of tree representation for sentences, out of our scope.
In recent years, Tree-LSTM has been used as a building block to develop more sophisticated models. For example, , Liu et al. (2017), Kim et al. (2019), Shen et al. (2020) build new Tree-LSTM models which define dynamic composition functions depending on syntactic categories (i.e. Part-Of-Speech tags). Instead, Teng and Zhang (2017) introduces a Bidirectional Tree-LSTM which takes advantage of both parsing directions: bottom-up and top-down. As we stated before, constituency trees are intrinsically bottom-up; to this end, the author introduces a first bottom-up pass, called head lexicalization, to propagate information from leaves to the root. All these models are applied only to binary constituency trees.
Thus far, we have shown that most of the models compute sentences encodings starting from binary constituency trees. This simplification solves one crucial problem of tree-structured data: the variable number of child nodes. However, the price to pay is the loss of structural information. For example, in Fig. 1b and Fig. 1c we report the constituency and the binary constituency tree of the sentence "Effective but too-tepid biopic". Comparing the two representation, we can observe that binary tree has one more node that breaks the ternary relation in the non-binary tree; in general, to break a node with L child nodes, we need to add L − 2 new nodes. All these new nodes create a chain which moves away the child nodes of the n-ary relation from their parent. The composition of them is obtained by considering one child at a time, as it happens in sequence representation. Hence, the binarisation removes the equality among child nodes, with the risk of weakening contribution of child nodes that are moved far away from their parent and strengthening the contribution of the ones that remain close.
As far as we know, the only work which builds a model suitable for non-binary constituency trees is the TreeNet (Cheng et al., 2018). The idea is to consider all child nodes in a chain: the hidden state of a node depends on the hidden state of its left sibling and its rightmost child. Even if the model itself works with non-binary trees, the composition function expressed is binary since it always composes two elements. We discuss this observation in details in Sec. 3.
The definition of models for non-binary constituency trees requires to go beyond the standard definition of composition function. Standard RecNNs define learnable composition functions which are based on the summation of the contribution of each constituent. Castellana and Bacciu (2020a) proposed a generalisation of such sum-based composition functions leveraging more expressive multi-affine maps represented as tensors. The exponential number of parameters with respect to the tree out-degree (i.e. the maximum number of children for each node in the tree) required by the full-tensorial approach can be controlled by applying tensor decomposition. The tensorial models outperform sum-based models, especially when the tree out-degree increases (Castellana and Bacciu, 2020a;Castellana and Bacciu, 2020b).
Within the scope of this paper, we unveil that non-binary constituency trees can be effectively exploited to improve predictive performance in NLP task, showing that more powerful composition functions are necessary to take advantages of such a rich representation. To this end, we introduce two new Tree-LSTM models which leverage canonical tensor decomposition: the former is suitable for binarised constituency trees, while the latter can process general non-binary constituency trees imposing weight sharing on the tensor decomposition factors. Finally, we test the quality of sentence encodings produced by our models on different NLP tasks, showing that the combination of a rich representation and a powerful composition function is able to outperform baseline models using the same number of parameters.

Related Models
In this section, we discuss three architectures from the literature that are related to the approach put forward in this paper and are used as baselines in our empirical analysis.

Tree-Structured LSTM
The Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is one of the most popular neural architecture to process sequential data. Tai et al. (2015) proposed two extensions of such architecture to handle tree-structured information: the L-ary Tree-LSTM and the Child-Sum Tree-LSTM. Both models propagate the information along the input tree structure in a bottom-up fashion (i.e. from the leaves to the root). Hence, at each node, the tree-LSTM cell aggregates the hidden child states to compute its gates leveraging sum-based composition functions.
L-ary Tree-LSTM (Tai et al., 2015) assumes that child nodes are ordered and then they can be indexed from 1 to L, where L denotes the maximum node output degree, i.e. the maximum number of children a node can have. Let v a generic node, we indicate with h vj ∈ R d and c vj ∈ R d respectively the hidden state and the memory cell of its j-th child node; hence, the L-ary Tree-LSTM cell computation is defined by the following equations: where c v ∈ R d , h v ∈ R d and x v ∈ R n are the memory cell state, the hidden state and the label of node v; {i v , o v , u v } ∈ R d are the input gate, the output gate and the update value, respectively, and f vk is the forget gate associated with k-th child of v. The symbol σ denotes the logistic sigmoid function and denotes the elementwise product.
We apply this model on binary constituency tree. Hence, the input tree has labels (i.e. the words) attached only on leaf nodes. In particular, we consider x v as the vector representation of a word in the sentence. Hence, internal nodes do not have any input labels (see Fig. 2a). In the rest of the paper, we refer to this model as Binary Sum-LSTM. Child-Sum Tree-LSTM (Tai et al., 2015) assumes that there is no order among child nodes. Let v a generic node, we indicate with Ch(v) the set of its child nodes; hence, its hidden state h v is computed as: (2) When comparing Eq.
(1), we observe that the Child-Sum Tree-LSTM can be derived from the L-ary Tree-LSTM by imposing weight sharing across children positions. In fact, in Eq. (2), the parameter U is the same for each child, while in in Eq. (1) the subscript j of the parameter U j denotes that the weights are positional dependent.
In Tai et al. (2015), this model has been applied only on dependency trees. However, in this work, we use it on non-binary constituency trees. Again, we consider x v as the vector representation of a word in the sentence and therefore are attached only on leaf nodes (see Fig. 2c). In the remainder of the paper, we refer this model with the name Child-Sum LSTM.

TreeNet
TreeNet (Cheng et al., 2018) has been introduced with the aim of learning from unconstrained treestructured data (i.e. trees where out-degree not fixed). The idea is to process child nodes as a sequence and links only the rightmost child node to the parent node. Hence, each node compose the information of its left sibling and its rightmost child: where {h vs , c vs } ∈ R d are hidden state and memory cell of left sibling of v, while {h vc , c vc } ∈ R d are hidden state and memory cell of the rightmost child node of v.
If the node v is a leaf, its hidden state depends solely on the input label x v : Comparing Eq.
(1), we can observe that TreeNet cell is a binary Tree-LSTM cell without the input and the output gates. In fact, in both cells, all the gates are computed composing two constituents. The TreeNet define the constituents of a node as its left sibling and its rightmost child, while the binary Tree-LSTM used directly its left child node and right child node. Hence, we argue that the difference between the two lies solely on how the tree is binarised, rather than on how the tree is processed. In Fig. 1d, we show an example on how a constituency tree is binarised according to the TreeNet; the constituent ADJP of the original constituency tree (see Fig. 1a) is composed of three words: "Effective", "but", "too-tepid". The TreeNet breaks this ternary relation processing one words at a time; as we can see in Fig. 1d, the node which has the first word "Effective" is combined with a bottom node since it does not have a left sibling. The result is then fused with the word "but". Finally, also the word "too-tepid" is combined with the result of the previous composition, obtaining the encoding of the constituent ADJP. Hence, the original ternary relation is broken into a sequence of three binary relations (one for each word) each of them combines the composition of previous words with the new word.

Canonical Tree-LSTM
This section aims to introduce new Tree-LSTM models that extend the recurrent approach to tensor-based processing tailored for constituency trees. These two models rely on Tensor Tree-LSTM (Castellana and Bacciu, 2020a) and canonical tensor decomposition. We focus on such decomposition since it can be combined with a weight sharing constraint, developing a composition function which can exploit nonbinary constituency trees without increasing the model parameters number.
Since these models are applied on constituency trees, we define them only on the nodes which apply composition functions (i.e. the internal nodes). The input label x v , which is attached only on leaf nodes, is processed using the same computation described in Eq. 4.

Canonical Decomposition
The canonical decomposition (usually denoted by CP) factorises a tensor into a sum of component rank-one tensors, i.e. tensors that are obtained by the outer product of vectors. For example, let T ∈ R d 1 ×d 2 ×···×d L a general L-th order tensor, the CP decomposition is defined by (Kolda and Bader, 2009): where brackets are used to denote entries of vectors, matrices and tensors (e.g. W (i, j) denotes the entry in the i-th row and j-th column of W ). U 1 ∈ R d 1 ×r , . . . , U L ∈ R d L ×r are the factor matrices of the decomposition. Each factor matrix contains r vectors, which are the basis of the rank-one tensors. The value of r indicates the number of such tensors which are summed to obtain the original tensor T and it is denoted as tensor rank. Following Castellana and Bacciu (2020a), we apply the decomposition on a tensor which represents a multi-affine map. Note that in this case the tensor that should be decomposed in not known, since it is the parameter of the recursive model and it is learned from data. Instead, we assume that such tensor is already decomposed, making the decomposition factors the new recursive model parameters. Let φ T : R d 1 ×· · ·×R d 1 → R K a multi-affine map which parameter is the tensor T ∈ R (d 1 +1)×···×(d L +1)×K , applying the CP decomposition on T we obtain: whereā = [a; 1] denotes the homogeneous coordinate of vector a. From the equation, we can observe that the CP decomposition define a multi-affine map which applies each factor matrix to the corresponding input vector (i.e. the one on the same mode), obtaining a vector e j ∈ R r , for each mode j ∈ {1, . . . , L}. Then, these vectors are element-wise multiplied and the result mapped to the output space R K thanks to matrix Q, i.e. the factor matrices on the output (the last) mode.

Binary Canonical Tree-LSTM
The Binary Tensor Tree-LSTM (Castellana and Bacciu, 2020a) computes the hidden state of an intern node v by: All the terms (except hidden state and memory cell) are computed by applying a multi-affine map φ T (·) on the left and right child hidden states, i.e. h vl and h vr , respectively. The superscript T indicates the parameters of such multi-affine map; hence, the parameters {I, O, U, F l , F r } define the LSTM cell computation.
The Binary Canonical Tree-LSTM (Binary CP-LSTM) exploits the canonical decomposition on tensor parameters (see Fig. 2b): the decomposition's factor matrices (the bias is made explicit) and the value r is the decomposition rank. We use the superscript t ∈ {i, o, u, f l , f r } to indicate different parameters of each multi-affine maps. The number of parameters required by the Binary CP-LSTM is O(dr).

Invariant Canonical Tree-LSTM
The Invariant Canonical Tree-LSTM (Invariant CP-LSTM) allows to encode information from constituency trees which are not binarised. Let v an internal node, its hidden state it is computed by: As in Eq.
(2), we indicate with Ch(v) the set of child nodes of v. {I, O, U} are the parameters of the multi-affine maps to compute the input gate, output gate and the update value respectively. Again, we exploit the canonical decomposition on parameter tensors to define the Invariant CP-LSTM. Moreover, we impose the weight sharing among input factor matrices, obtaining (see Fig. 2d): where U t ∈ R d×r , b t ∈ R r is the factor matrix shared on all input modes and Q t ∈ R r×d , q t ∈ R d is the factor matrix on the last mode; the value r is the decomposition rank. Hence, the number of parameters required by Invariant CP-LSTM is still O(dr). Thanks to the weight sharing, we are able to process nonbinary trees without adding new parameters. As we highlighted in Sec. 2.1, the weight sharing makes the model invariant to child nodes order, thus its name is Invariant CP-LSTM.

Experiments
We test the models introduced in the previous section on two tasks: sentence classification and semantic textual similarity.
Sentence Classification. The goal of this task is to predict the class of the given input sentence. Hence, we use a Tree-LSTM to encode the constituency tree of the input sentence in a succinct representation (h root ) and then we feed it to a single-layer Neural Network to predict the class y ∈ {1 . . . m}: where s ∈ R s is the hidden representation of the classifier and W ∈ R d×s , b ∈ R s , W ∈ R s×m , b ∈ R m are the classifiers parameters. Following Tai et al. (2015), we use dropout (Srivastava et al., 2014) with rate 0.5 on both h root and h s . We test our models on three different classification datasets: • SST-5: Stanford Sentiment Treebank (SST) dataset (Socher et al., 2013) contains sentences that are classified with a fine-grained sentiment which goes from 1 to 5 (very negative, negative, neutral, positive and very positive); • SST-2: identical to SST-5, but with binary sentiment class; neutral sentences are removed and all negative (positive) sentences are collapsed in one cluster; • TREC: TREC dataset (Li and Roth, 2002) contains questions that are annotated with six classes which indicates a question type.
Semantic Textual Similarity. The goal of this task is to predict the semantic similarity between two sentences. Let a and b the two sentences, we produce their encodings h a and h b applying a Tree-LSTM on both sentence constituency trees and taking the hidden state of the root. Then, we compute the similarity score y r ∈ [1, m] as in Tai et al. (2015): where s ∈ R s is the hidden representation of the classifier, {W + , W × } ∈ R d×s , b ∈ R s , W r ∈ R s×m , b r ∈ R m are the classifiers parameters and r = [1, 2, . . . , m].
It also common to represent the semantic similarity attaching an entailment class to each pair of sentences. In this case, we predict the entailment class y e = arg max(p e ) starting from the distribution p e = softmax (W e s + b e ); s is computed as in Eq. (12). We test our models on two different datasets: • SICK-R: Sentences Involving Compositional Knowledge (SICK) dataset (Marelli et al., 2014), contains sentence pairs annotated with a relatedness score between 1 and 5; • SICK-E: identical to SICK-R, but with entailment classes instead of relatedness scores; the entailment class indicates whether one sentence entails or contradicts the other (neutral, entailment, contradiction).

Implementation and training details
We have implemented all the models using PyTorch (Paszke et al., 2019) and Deep Graph Library (Wang et al., 2019). The source code to reproduce the model and the experiments is released here. 1 Constituency trees are built using the PCFG constituency parser of the Stanford Core NLP . Also, we binarise them computing the Chomsky Normal Form available in the Natural Language Tool Kit (Bird et al., 2009). To facilitate the learning, we collapse all unary relations. In each task, we perform a grid search to find the best hyper-parameters configuration (see Appendix A for further details). For the training, we use AdaDelta (Zeiler, 2012) algorithm and Adam (Kingma and Ba, 2015) optimiser. In all experiments, we initialised our word representations using 300-dimensional Glove vectors (Pennington et al., 2014). We fine-tune the words representation only on the SST dataset.   Table 1: Results obtained on different task All the values are accuracy except SICK-R, whose score is Pearson's correlation multiplied by 100. The superscript * indicates we re-implement the model.

Results
paper on the TREC dataset. However, to be sure that there are no errors in our implementation, we run the original code published by Cheng et al. (2018) with our experimental settings (i.e. model selection on the validation set and risk assessment on the test set) and we obtain results comparable with the one published in this table. In the next paragraphs, we analyse in details the results obtained on each dataset.
SST. The results obtained on the SST dataset (both with fine-grained and binary labels) do not show any improvements using non-binary constituency trees. However, the comparison is unfair since the original dataset provides labels on internal nodes of binary constituency trees. By removing the binarisation, it is no longer possible to leverage such information during training. As a reference, note that binary constituency trees data contains 119.413 labels, while the non-binary constituency trees data contains only 91.536 labels.
SICK. The results obtained on the SICK dataset show the advantage of combining a rich representation (such as non-binary trees) and a more powerful composition function. In fact, the Invariant CP-LSTM outperforms all the other models in both the entailment (SICK-E) and relatedness (SICK-R) task. It is worth to point out that the Invariant CP-LSTM is the only model which benefits from the non-binary representation. In Fig. 3a we report the test accuracy of each model with respect to the input length (we consider the maximum between the length of each sentence in the input pairs). Observing the plot, it is clear that most of the models struggle with long sentences. The Invariant CP-LSTM model, instead, reaches an accuracy of approximately 84%, while other models stop around 81%. In Fig. 3b, we show the validation results on SICK-R obtained by all the models against the number of parameters they require.
Observing the plot, it is clear that thanks to the canonical decomposition and the weight sharing, we can build powerful composition function using the same number of parameters of sum-based functions.
TREC. On the TREC dataset, the sum-based composition functions seems to be advantageous over the canonical counterpart. In fact, both on binary and non-binary trees, the sum-based Tree-LSTMs outperforms the canonical-based Tree-LSTMs. Also, the Tree-Net (which is based on summation), reaches results comparable with the Child-Sum LSTM. We argue that the summation is preferable for the question classification task, probably due to the intrinsic characteristics of question sentences.

Qualitative analysis on SICK-E
In this section, we analyse in details the prediction of Child-Sum LSTM and Invariant CP-LSTM on the SICK-E dataset. In particular, we report the prediction on example #3991 of the test set. The input pair  is composed of the following sentences: A: The girl has red hair and eyebrows, several piercings in a ear and a tattoo on the back. B: The girl has red hair and eyebrows, several piercings in a ear and no tattoo on the back.
The two sentences are exactly the same, unless the sub-phrase a tattoo in sentence A which becomes no tattoo in sentence B. Hence, the expected output is "contradiction". In Fig. 4a we plot the constituency tree of the input sentence, indicating with ? the position where the two sentences differ. Also, we highlight all the nodes that are in the path between the ? and the root. These are the only nodes which have a different sub-tree in the two sentences. To analyse how the two models predict the final label, we study how the prediction changes going up through the structure. In Fig. 4b we report the output of the classifier fed with the hidden states pair (h ai , h bi ), where h ai (h bi ) is the hidden state of node i computed by the tree model on sentence A (B). On node 25, both models predict correctly the contradiction. However, going up through the structure, the Child-Sum Tree-LSTM changes the prediction to entailment, which will be the final output on node 0. The change of the output label start on node 6, which is the node where most of the information is aggregated; it seems that the Child-Sum Tree-LSTM performs a sort of average which soften the contribution of node 24 (the only one that instead should be taken into account). On the contrary, the Invariant CP-LSTM propagates correctly the information through the structure. Even if the model process sub-phrases which are identical, their contribution does not influence the output. In fact, observing Fig. 4b, we can notice that the class predicted by the Invariant CP Tree-LSTM is always "contradiction" in all nodes in the path between ? and the root.

Conclusion
In this paper, we show that using non-binary constituency trees can be beneficial, especially in semantic similarity tasks. Moreover, we highlight the need of powerful composition function to exploit such a rich representation. To this end, we have introduced a new Tree-LSTM model which leverages tensor canonical decomposition and weight sharing to process non-binary trees without adding new parameters.
Such results pave the way to the definition of new tensor models which leverage suitable tensor decomposition to take advantage of non-binary constituency trees. To this end, the next step would be the application of other tensor decompositions. Among the others, the tensor train decomposition seems to be promising to define new composition functions which are sensitive to child nodes order.
Ultimately, we would like to test multiple tensor-based models on different NLP tasks, studying the relation between the bias introduced by each different tensor decomposition and the intrinsic property of the task.