KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations

Syntactic parsers have dominated natural language understanding for decades. Yet, their syntactic interpretations are losing centrality in downstream tasks due to the success of large-scale textual representation learners. In this paper, we propose KERMIT (Kernel-inspired Encoder with Recursive Mechanism for Interpretable Trees) to embed symbolic syntactic parse trees into artiﬁcial neural networks and to visualize how syntax is used in inference. We experimented with KERMIT paired with two state-of-the-art transformer-based universal sentence encoders (BERT and XLNet) and we showed that KERMIT can indeed boost their performance by effectively embedding human-coded universal syntactic representations in neural networks.

Traditional task-independent, symbolic, humandefined syntactic interpretations for sentences, which may be referred to as universal syntactic interpretations, are losing their centrality in language understanding systems due to the success of transformer-based neural networks (Vaswani et al., 2017) that have boosted performances on a wide variety of linguistic tasks (Devlin et al., 2018;Liu et al., 2019;. There is evidence that universal sentence embeddings store bits of universal syntactic interpretations. Even if not explicitly designed for encoding syntax, these embeddings implicitly capture syntactic relations among words with different strategies. Transformers (Devlin et al., 2018;Liu et al., 2019; seem to capture syntactic relations among words by "focusing the attention". Yet, to be sure that syntax is encoded, many syntactic probes (Conneau et al., 2018) for neural networks have been designed to test for specific phenomena (Kovaleva et al., 2019;Jawahar et al., 2019;Hewitt and Manning, 2019;Ettinger, 2019;Goldberg, 2019) or for full syntactic trees (Hewitt and Manning, 2019;Mareček and Rosa, 2019). Indeed, some syntax is correctly encoded in these universal sentence embeddings.
However, universal sentence embeddings encode syntax in a way that is opaque and not so universal. Firstly, and perhaps surprisingly, task-adapted universal sentence embeddings encode syntax better than general universal sentence embeddings (Jawahar et al., 2019). Secondly, even if these embeddings contains syntactic information and may be "just another way in which traditional syntactic models are encoded" (Fodor and Pylyshyn, 1988), there is no clear view on how this information is encoded and, hence, on how syntactic information is holistically (Chalmers, 1992) used in inference. Then, it is difficult to envisage ways to symbolically control the behavior of neural networks.
In this paper, we investigate whether explicit universal syntactic interpretations can be used to improve state-of-the-art universal sentence embeddings and to create neural network architectures where syntax decisions are less obscure and, thus, syntactically explainable. For this purpose we propose KERMIT, a Kernel-inspired Encoder with a Recursive Mechanism for Interpretable Trees, and KERMITviz . KERMIT is a lightweight en- coder for embedding syntax parse trees in universalsyntax-encoding vectors by explicitly embedding subtrees in the representation space. KERMITviz is a visualizer to inspect how syntax is used in taking final decisions in specific tasks. We showed that KERMIT can effectively embed different syntactic information and KERMITviz can explain KERMIT's decisions. Furthermore, paired with universal sentence embeddings, KERMIT outperforms state-of-the-art models -BERT (Devlin et al., 2018) and XLNet  -in three different downstream tasks, albeit findings in Kuncoro et al. (2020), showing that traditional syntactic information is not represented in universal sentence embeddings.

Background and Related Work
Embedding symbolic syntactic or structured information within neural networks is a very active research field given the impression that using preexisting syntactic knowledge in neural networks can be beneficial for many tasks. Initial attempts have tried to recursively encode structures in distributed representations to use them inside neural networks (Pollack, 1990;Goller and Kuechler, 1996). More recently, Socher et al. (2011) have defined the notion of Recursive Neural Networks (RecNN) that are Recurrent Neural Networks applied to binary trees. Initially, these RecNNs have been used to parse sentences and not to include preexisting syntax in a final task (Socher et al., 2011). Then, these RecNNs have been used to encode preexisting syntax in the specific task of Sentiment Analysis (Socher et al., 2012(Socher et al., , 2013. With the rise of Long Short-Term Memories (LSTMs), Tai et al. (2015);  and Zhang et al. (2016) independently proposed TreeLSTM as an adapted version of LTSM that may use syntactic information. In TreeLSTM, the LSTM is applied following the structure of a binary tree instead of following an input sequence. In semantic relatedness and in sentiment classification, TreeLSTM has outperformed RecNN (Tai et al., 2015) by using pre-existing syntactic information. TreeLSTM has also been used to induce task-specific trees while learning a novel task (Choi et al., 2018). Moreover, Munkhdalai and Yu (2017) have specialized LSTM for binary and n-ry trees with their Neural Tree Indexers and Strubell et al. (2018) have encoded syntactic information by using multi-head attention within a transformer architecture.
However, there is a major problem with the methods for embedding syntactic structures in neural networks, it is unclear which parts of the parse trees are represented, and how. Hence, the behavior of neural networks that use these embeddings is obscure. It is then difficult to understand what kind of syntactic knowledge is encoded in the different layers and how this syntactic knowledge is used.
Some initial attempts to clarify which syntactic parts are encoded in embedding vectors exist. Zhang et al. (2018) have encoded parse trees by means of paths connecting the root of parse trees with words. Yet, these attempts are still far from completely representing parse trees.
For a long time, structural kernel functions have been the way to exploit syntactic information in learning but these functions cannot be used within neural networks. Kernel machines (Cristianini and Shawe-Taylor, 2000) exploit these, generally recursive, structural kernel functions that define a similarity measure between two trees counting common substructures. Hence, these structural kernel functions are built over a clear, although hidden, space of substructures. Structural kernels have been defined for both constituency-based (Collins and Duffy, 2002;Moschitti, 2006) and dependencybased parse trees (Culotta and Sorensen, 2004). As underlying spaces are well defined, it is even possible to extract back substructures that are relevant in each decision (Pighin and Moschitti, 2010). However, these structural kernel functions are generally recursive algorithms that hide the real underlying space of features. Thus, structures are never represented as vectors in the target representation spaces as these spaces are generally huge. It is generally impossible then to use these clear spaces in learning with neural networks.
In the field of structural kernels, distributed tree kernels (Zanzotto and Dell'Arciprete, 2012) have opened an interesting possibility. To reduce the computational cost of tree kernels, these distributed tree kernels embed the huge space of substructures in a smaller space. This embedding is obtained by using recursive functions, which are linear with respect to the tree size. Hence, structures are represented in a smaller vector in an embedded space that represents the original space of structures. Hence, DTKs open an interesting path to include clear syntactic information in neural network architectures (Zanzotto and Ferrone, 2017;Santilli and Zanzotto, 2018).

The model
This section introduces our Kernel-inspired Encoder with a Recursive Mechanism for Interpretable Trees (KERMIT) (Sec.3.2) along its visualizer KERMITviz (Sec.3.3). KERMIT is a lightweight encoder for universal syntactic interpretations which can be used in combination with transformer-based networks such as BERT (Devlin et al., 2018) and XLNet   (Fig.  1). Some preliminary notations are given in Section 3.1.

Preliminary notation
This section fixes the notation for parse trees, random vectors and operations on random vectors as these are core representations in our model to deal with universal syntactic interpretations.
Parse trees T and parse subtrees τ are recursively represented as trees t = (r, [t 1 , ..., t k ]) where r is the label representing the root of the tree and [t 1 , ..., t k ] is the list of child trees t i . Leaves t are represented as trees t = (r, []) with an empty list of children or directly as t = r.
On parse trees T , our model KERMIT requires the definition of three sets of subtrees: the set N (T ), the set S(T ) and the set of S(T ). The last two sets are defined according to subtrees we want to model in the embeddings of the universal syntactic interpretations. We use subtrees defined in Collins and Duffy (2002). The set N (T ) contains all the complete subtrees of T . Given a tree T and r one of its nodes, a complete subtree of T from r is the subtree rooted in r that reaches the leaves, for example (see the parse tree in Fig. 1 Finally, the set S(T ) is the union of the sets S(t) for all the trees t ∈ N (T ), that is:

S(t)
and it contains the subtrees used during training and inference. Finally, to build the untrained KERMIT encoder, we use the properties of random vectors drawn from a multivariate Gaussian distribution v ∼ N (0, 1 √ d I). These vectors guarantee that u T v ≈ 0 if u = v and u T u ≈ 1. This property is extremely important for interpretability. To compose vectors, we use the shuffled circular convolution u ⊗ v. If these vectors are drawn from a multivariate Gaussian distribution, the function . This operation is a circular convolution (as for Holographic Reduced Representations (Plate, 1995)) with a permutation matrix Φ: u ⊗ v = u * Φv. This operation is extremely important for soundly composing node vectors.

The encoder for parse trees and its sub-network
KERMIT is a lightweight neural layer that allows the encoding and use of universal syntactic interpretations in neural networks architectures. This layer has two main components. The first component is the KERMIT encoder that actually encodes parse trees T in embedding vectors: which corresponds to the gray arrow and the gray box in the KERMIT side of Fig. 1. The second component is a multi-layer perceptron that exploits these embedding vectors: which corresponds to green area in the KERMIT side of Fig. 1. The KERMIT encoder D in Eq. 1 stems from tree kernels (Collins and Duffy, 2002) and distributed tree kernels (Zanzotto and Dell'Arciprete, 2012). It makes it possible to represent parse trees in vector spaces R d that embed huge spaces of subtrees R n where n is the huge number of different subtrees. Each tree T is represented by using the set of its valid subtrees S(T ). The encoder is based on an embedding layer for tree node labels x r = W o r ∈ R d and on a recursive encoding function based on the shuffled circular convolution ⊗ introduced by Zanzotto and Dell'Arciprete (2012). The embedding layer x r = W o r ∈ R d is an untrained encoding function that maps onehot vectors r of tree node labels r to random vectors drawn from the previously introduced multivariate Gaussian distribution N (0, 1 √ d I). Hence, W o ∈ R m×d is a matrix of m columns where m is the cardinality of the set of node labels and each col- The function D(T ) is defined as a the sum of recursive function Υ(t) on parse trees: where N (T ) is the previously defined set of complete subtrees of T . Then, Υ(t) is defined as: where 0 < λ ≤ 1 is a decaying factor penalizing large subtrees (Collins and Duffy, 2002;Zanzotto and Dell'Arciprete, 2012). By implementing D(T ) with a dynamic algorithm, its computational cost is linear with respect to the nodes of the tree T and the cost of the basic function ⊗ is d log d where d is the size of the representation space R d . In fact, the circular convolution can be computed with Fast Fourier Transformation. Given its nature, the tree neural encoder has a nice interpretation as a very simple embedding layer, that is, W Υ ∈ R d×n that embeds the space of subtrees in a smaller space R d . This is in line with the Johnson-Lindenstrauss Transformation (Johnson and Lindenstrauss, 1984). Hence, D(T ) can be seen as the following: where x is the vector representing the set of subtrees S(T ), that is, the sum of √ λ k x t where x t is one-hot vector representing t ∈ S(T ), λ is the decaying factor for penalizing large trees and k is the number of nodes of the tree t. It is possible and easy to show that columns w i of W Υ encode subtrees t as follows: where √ λ 8 is the decay factor applied to the sample subtree with 8 nodes.
Given the properties of the vectors E(r) ∼ N (0, 1 √ d I) and the properties of the shuffled circular convolution ⊗, it is possible to empirically demonstrate that Γ(t i ) T Γ(t i ) ≈ 1 and Γ(t i ) T Γ(t j ) ≈ 0 (Plate, 1995; Zanzotto and Dell'Arciprete, 2012). Hence, this property can be used to interpret the behavior of the decision in the neural network.

Visualizing Neural Network Activation on Syntactic Trees
The definitions of the KERMIT encoder make it possibile to devise KERMITviz , which offers prediction interpretability (Jacovi et al., 2018) in the context of textual classification. We propose a clear causal relation for explaining (Lipton, 2016) classification decisions where syntax is important by defining heat parse trees and calculating the relevance of single subtrees with layer-wise relevance propagation (LRP) (Bach et al., 2015). LRP has already been used in the context of explaining decisions in natural language processing tasks (Croce et al., 2019b,a).
Heat parse trees (HPTs), similarly to "heat trees" in biology (Foster et al., 2017), are heatmaps over parse trees (see the colored tree in Fig. 1). The underlying representation is an active tree t, that is a tree where each node t = (r, v r , [t 1 , ..., t k ]) has an activation value v r ∈ R associated. HPTs are graphical visualizations of active trees t where colors and sizes of nodes r depend on their activation values v r . In this way, HPTs highlight parts of parse trees relevant in final decisions.
To draw HPTs, we compute activation value v r of nodes r in active tree t by using Layer-wise Relevance Propagation (LRP) (Bach et al., 2015) and the property in Eq. 3 of the KERMIT encoder D. LRP is a framework which explains decisions of a generic neural network using local redistribution rules that propagate back decisions to activation values of initial features. In our case, this is used as a sort of inverted function of the multi-layer perceptron in Eq. 2, that is: The property in Eq. 3 enables the activation of each subtree t ∈ T to be computed back by transposing the matrix W Υ , that is: To make the computation feasible, W Υ T is produced on-the-fly for each tree T . Finally, activation values v r of nodes r ∈ T are computed by summing up values x (i) LRP if r ∈ t (i) .

Experiments
We aim to investigate whether KERMIT can be used to create neural network architectures where universal syntactic interpretations are useful: (1) to improve state-of-the-art universal sentence embeddings, especially in computationally light environments, and (2) to syntactically explain decisions.
The rest of the section describes the experimental set-up, the quantitative experimental results of KERMIT and discusses how KERMITviz can be used to explain inferences made by neural networks over examples.

Experimental Set-up
This section describes the general experimental set-up of our experiments, the specific configurations adopted in the completely universal and task-specific settings, the used computational architecture and the datasets.
The general experimental settings are described hereafter. Firstly, the core of our method KERMIT encoder has been tested on a distributed representation space R d with d = 4000 with the penalizing factor λ set to λ = 0.4 as this has been considered a common value in previous works (Moschitti, 2006). Secondly, constituency parse trees for KERMIT have been obtained by using Stanford's CoreNLP probabilistic context-free grammar parser (Manning et al., 2014). Thirdly, the following transformer sub-networks have been used: (1) BERT BASE , used in the uncased setting with the pre-trained English model; (2) BERT LARGE , used with the same settings of BERT BASE ; and, (3) XLNet base cased. All the models were implemented using Huggingface's transformers library (Wolf et al., 2019). The input text for BERT and XLNet has been preprocessed and tokenized as specified in respectively in Devlin et al. (2018. Fourthly, as the experiments are text classification tasks, the decoder layer of our KERMIT+Tranformer architecture is a fully connected layer with the softmax activation function applied to the concatenation of the KERMIT output and the final [CLS] token representation of the selected transformer model. Finally, the optimizer used to train the whole architecture is AdamW (Loshchilov and Hutter, 2019) with the learning rate set to 3e −5 .
In the completely universal setting, KERMIT is composed only by the first lightweight encoder layer (grey layer in Figure 1) (KERMIT ENC ). In this setting, we used BERT BASE and XLNet. To study universality, transformers' weights are fixed in order to avoid the representation drifting toward the data distribution of the task. Moreover, we also experimented with BERT BASE -Reverse and BERT BASE -Random to understand whether syntactic or structural information is important for the specific task. In fact, BERT BASE -Reverse is BERT BASE with a reversed text as input and BERT BASE -Random is BERT BASE with a randomly shuffled text as input. Comparing BERT BASE with BERT BASE -Reverse and BERT BASE -Random is in itself an extremely important test as it offers also a way to determine if syntactic information is useful for a specific task. The KERMIT+Tranformer is trained with a batch size of 125 for 50 epochs. In addition, each experiment has been repeated 5 times with 5 different fixed seeds to assess the statistical significance of experimental results. This setting is designed to asses whether universal syntactic interpretations add different information with respect to universal sentence embeddings and whether universal syntactic interpretations are a viable solution to increase the performance of neural networks in light computational systems.
In the task-adapted setting, we used two different architecture of BERT, BERT BASE and BERT LARGE , and we trained different layers of these architectures. In this way, BERT may adapt the universal sentence embeddings to include taskspecific information which is the specific lexicon that may drive syntactic analysis. For the KER-MIT side of the architecture, we used two different multi-layer perceptrons: (1) a funnel MLP with two linear layers that brings the 4,000 units of the KER-MIT encoder down to 200 units with an intermediate level of 300 units (KERMIT ); (2) a diamond MLP with four linear layers forming a diamond shape: 4,000 units, 5,000 units, 8,000 units, 5,000 units and, finally 4,000 units (KERMIT ). Both KERMIT and KERMIT have ReLU (Agarap, 2018) activation functions and dropout (Srivastava et al., 2014) set to 0.25 for each layer. Due to the computational demand of these architectures and these experiments, we used the heavy system and we trained the overall model in two settings: a one-epoch training session and a normal training session. In the one-epoch training session, we trained the architecture with 1 epoch (Komatsuzaki, 2019) to avoid overfitting and to guarantee the possibility of having a relatively light computational burden. In the normal training session, we trained the architecture for 5 epochs. The batch size for these two settings was 32.
We experimented with two hardware systems: a light system and a heavy system. The light system is an affordable old desktop consisting of a 4 Cores Intel Xeon E3-1230 CPU with 62 Gb of RAM and 1 Nvidia 1070 GPU with 8Gb of onboard memory.
The heavy system is a more expensive, dedicated server consisting of an IBM PowerPC 32 Cores CPU with 256 Gb of RAM and 2 Nvidia V100 GPUs with 32Gb of on board memory each.
To verify our model, we experimented with four classification tasks 1 (Zhang et al., 2015) which should be sensitive to syntactic information. The tasks include: (1) AGNews, a news classification task with 4 target classes; (2) DBPedia, a classification task over wikipedia with 14 classes; (3) Yelp Polarity, a binary sentiment classification task of Yelp reviews; and (4) Yelp Review, a sentiment classification task with 5 classes. Given the computational constraints of the light system setting, we created a smaller version of the original training datasets by randomly sampling 11% of the examples and keeping the datasets balanced as the original versions.
For reproducibility, the source code of our experiments is publically available 2 .

Results and Discussion
Results from the completely universal experimental setting suggest that universal syntactic interpretations complement syntax in universal sentence embeddings. This conclusion is derived from the following observations of Table 1, which reports results in terms of the accuracy of the different models based on the different datasets. All these experiments were carried out on the light system.
Firstly, syntactic or structural information seems to be relevant in three out of four tasks. Syntactic information in AGNews seems to be irrelevant as there is a small difference in results between BERT BASE , on the one side, with 82.88(±0.09) and BERT BASE -Reverse with 79.72(±0.11) and This last observation is a very important indication and, together with the other observations, confirms that universal sentence embeddings encode different syntactic information with respect to that defined in universal syntactic interpretations. Moreover, our KERMIT encoder allows neural networks to positively use universal syntactic interpretations. Hence, using universal syntactic interpretations is a viable solution also when only light computational systems are available.
Experiments in the task adapted setting: (1) show that universal syntactic interpretation is still useful even when universal sentence embeddings are adapted to the specific task; (2) confirm the conclusions of Jawahar et al. (2019) that universal sentence embeddings better capture syntactic phenomena when the middle layers of BERT are learned over the task. The results of these experiments are plotted in Figure 2 where system accuracy is plotted against the number of BERT's learned layers starting from the output layer. In fact, it seems that different BERT's layers encode different in- formation Jawahar et al. (2019). Hence, learning different layers in a specific setting means adapting that kind of information. We experimented with two sub-settings: (1) a computationally lighter setting where training is done only for 1 epoch; (2) a more expensive setting where training is done for 5 epochs. Our results in the task adapted setting confirms that BERT adapts universal sentence embeddings to include a better syntactic model when its weights in different layers are trained over the specific corpus. Moreover, as shown in Jawahar et al. (2019), layers in the middle cover better syntactic phenomena. In fact, when BERT learns up to the 8th layer, BERT's accuracy seems to come closer to the best model including universal syntactic interpretations (see Figure 2) . This suggests that more syntax is encoded in BERT.
All these experiments were performed also using BERT LARGE in place of BERT BASE , but in all the experiments results were worse compared to the base version, therefore not reported in the paper.
When syntax matters, that is, in Yelp Review and in Yelp Polarity, KERMIT is able to exploit universal syntactic interpretation to compensate for missing syntactic information in the task-adapted sentence embeddings of a trained BERT. In fact, KER-MIT+BERT outperforms a trained BERT BASE in both the 1-epoch and 5-epoch settings for any number of trained layers (see Figure 2). In the 1epoch setting, KERMIT +BERT BASE outperforms BERT BASE and all the other configurations. In the 5-epoch setting, KERMIT ENC +BERT BASE is the best model. Moreover, KERMIT-based models behave better with less training. In fact, KERMIT-based models learned in the 1-epoch setting, outperform models learned in the 5-epoch setting. Plots in Figure 2 report the best 1-epoch setting model in the plots of the 5-setting model. This can be linked to the fact that KERMIT with more parameters overfits on training. In fact, KERMIT ENC +BERT BASE outperforms the funnel and diamond KERMIT-based systems. KERMIT ENC has fewer parameters than KERMIT and KERMIT .
Finally, we explored the interpretative power of KERMITviz comparing it with the transformer visualizer BERTviz (Vig, 2019). We focused on two examples of Yelp Reviews where the coordinating conjunction but plays an important role (see Fig.  3): (1) "Unique food, great atmosphere, pricey but worth a trip for special occasions."; (2) "The boba drink was terrible, but the shaved ice was good.". The two sentences have 4 and 3 as ratings, respectively. In fact, the but in the first sentence introduces a coordinated sentence that does not change the rating. On the contrary, the but in the second sentence introduces a coordinated sentence but the shaved ice was good that radically changes the polarity. In the case of BERTviz, this causal relationship is extremely difficult to grasp from the visual representation. In fact, BERTviz is a good visualization mechanism for seeing how models assign weights to different input elements (Bahdanau et al., 2015;Belinkov and Glass, 2019), but it is extremely obscure in explaining causal relations in classification predictions (Wiegreffe and Pinter, 2019). Instead, KERMITviz with its tree heat maps show exactly that the but and the related syntactic structure is irrelevant in the first sentence and extremely relevant in the second. Hence, our heat parse trees can be useful to draw the causal relation between the decision and the information used.

Conclusions
Universal syntactic interpretations are valuable language interpretations, which have been developed in years of study. In this paper, we introduced KERMIT to show that these interpretations can be effectively used in combination with universal sentence embeddings produced from scratch. Moreover, KERMITviz allows us to explain how syntactic information is used in classification decisions within networks combining KERMIT, on the one side, and BERT or XLNet on the other. We also showed that KERMIT can be easily used in situations where training large transformers is extremely difficult.
As KERMIT has a clear description of the used syntactic subtrees and gives the possibility of visualizing how syntactic information is exploited during inference, it opens the possibility of devising models to include explicit syntactic inference rules in the training process.
Finally, KERMIT is in the line of research of Human-in-the-Loop Artificial Intelligence (Zanzotto, 2019), since it gives the opportunity to track how human knowledge is used by learning algorithms.