Enhancing Generalization in Natural Language Inference by Syntax

Pre-trained language models such as BERT have achieved the state-of-the-art performance on natural language inference (NLI). However, it has been shown that such models can be tricked by variations of surface patterns such as syntax. We investigate the use of dependency trees to enhance the generalization of BERT in the NLI task, leveraging on a graph convolutional network to represent a syntax-based matching graph with heterogeneous matching patterns. Experimental results show that, our syntax-based method largely enhance generalization of BERT on a test set where the sentence pair has high lexical overlap but diverse syntactic structures, and do not degrade performance on the standard test set. In other words, the proposed method makes BERT more robust on syntactic changes.


Introduction
The task of natural language inference (NLI) targets at determining whether one sentence entails another (Condoravdi et al., 2003). Recently, largescale pre-trained contextualized embeddings such as BERT (Devlin et al., 2018) and XLNet  have given the state-of-the-art accuracy for this task. It has been shown that pre-trained models help to better capture heuristic patterns in a set of training data and therefore enhance indomain performance (Wang et al., 2018). However, there are still limitations on the generalization of such models to examples under a different distribution. In particular, it has been shown that seemingly simple types of examples in a carefully designed evaluation set (i.e. HANS) can lead to significant degeneration and large variability in performance * Equal Contribution. Work is done when working at Westlake University. † Corresponding author. (McCoy et al., 2019b,a). Table 1 shows a set of test cases from HANS, where premise and hypothesis have high lexical overlap but different syntactic structures. The BERT model gives incorrect results on most cases. This issue can negatively affect NLI applications such as dialogue (Dziri et al., 2019;Welleck et al., 2019). It has been shown that syntactic structures are useful for cross-domain generalization of NLP models (Wang et al., 2017;Strubell and McCallum, 2018). Intuitively, a more robust NLI model can be obtained by making use of structural information. We empirically investigate the effectiveness of syntactic features for enhancing the generalization of BERT-based matching models. In particular, given a pair of sentences, the dependency syntax of each sentence is obtained using a neural parser (Qi et al., 2018). The parse trees are then extended using four types of edge patterns, including a soft co-attention matching pattern that links the sentence pair into an integrated graph. A graph convolutional network (GCN) (Kipf and Welling, 2016) is used to represent the whole matching graph structure.
Experiments show that the performance of the proposed model is much better than BERT and other syntax-based baselines on the category in HANS where the premise and non-entailment hypothesis have high lexical overlap but different syntactic structures, when both models are trained on MNLI dataset. It proves that incorporating syntax by the proposed method enhances generalization of the BERT model on syntactic changes ‡ .

Related Work
There has been much work based on deep neural networks for the NLI task. One straight-forward solution is to independently encode the premise ‡ Our code will be available at: https://github. com/heqi2015/CA_GCN In each example, the words in hypothesis are drawn from the premise but do not form a subsentence of premise, and thus the syntactic structures in hypothesis and premise are quite different. Both the BERT-CLS baseline and our GCN-based BERT model with co-attention links (BERT+CAGCN) are finetuned on MNLI dataset (Williams et al., 2018), and the neutral or contradiction labels are translated into non-entailment when evaluation (McCoy et al., 2019b). Note that E stands for entailment, and N stands for non-entailment. and the hypothesis into embedding vectors, which are fed to a multi-layer neural network for classification (Bowman et al., 2015). It has been shown that alignment between local words in the premise and hypothesis benefits the aggregation of information (Chen et al., 2016;, and encoding the sentence pair simultaneously can capture more interaction and thus further improve the performance (Devlin et al., 2018). We thus adopt this model as our baseline. Syntax has been proven beneficial for semantic tasks such as NLI (Bowman et al., 2016;Pang et al., 2019;Lei et al., 2019). Tree-based SPINN methods encode sentences by combining constituency phrases (Bowman et al., 2016). Recently, Pang et al. (2019) proposed to enhance the token representation by using contextual vector representations from a pretrained parser. The GCN method has also been used to represent syntax for sentence matching (Lei et al., 2019), where the syntax of each sentence is encoded separately. In this paper, we use a GCN to encode a whole matching graph with syntactic information, showing that integrating syntax by our method benefits the generalization of BERT-based method.

Method
The overall architecture of the proposed method is shown in the top of Figure 1. At the bottom layer, contextualized representations of the two sentences are obtained by using BiLSTM, ELMo or BERT. The representation is then fed into GCN to initialize the representations in the first layer.

GCN
The graph structure of each layer in GCN is depicted in the bottom of Figure 1. Each node in the graph represents one word in the sentence pair. We define four types of directed edges in the graph, as described in Equation 1, where E denotes the set of syntactic dependency arcs inside sentences, and S (w i ) indicates which sentence the word w i belongs to. The first two edge types are introduced to allow information flow along and against syntactic arcs. Thirdly, the self-loop edge is added for better preserving information of each word across message passing iterations (Kipf and Welling, 2016).
The last type of relation aims to enforce alignment of words between sentences, where the similarity between each word w i in sentence A and each word w j in sentence B at the kth layer is calculated by the co-attention operation as C where σ denotes the sigmoid function, h the feature vector, and W co the affinity weight. The feature of node i is updated at the kth layer by h is ReLU activation function, N (·) is the neighbor set, and g (k) i,j is a gate function that is described below.
Note that we only take unlabelled dependencies into account to avoid over-parameterization (Marcheggiani and Titov, 2017), as shown in Equation 1. By bringing in sparse and unlabeled dependency relations, the embedding of each word is influenced by its immediately semantically or syntactically related words, which leads to a potentially more robust word representation. We apply a gate g (k) i,j to each edge to calculate the importance of information exchange (Marcheggiani and Titov, 2017). (2) In addition, highway units are adopted in each layer to preserve information in multiple stacked GCN layers (Srivastava et al., 2015).

Co-Attention Layer
We denote the word representations of sentence A and sentence B in the GCN output as H A and H B , respectively. An affinity matrix is calculated by , which is used to calculate the co-attention maps between the sentence pair (Lu et al., 2016): where W A , W B , w A , w B are weight parameters, and each element in a A and a B is the attention probability of words in sentence A and B, respectively. Finally, the vector representations of the sentences are calculated by where a i denotes the ith element in a, and H i the ith column in H.

Output Classifier
With vector representations of the sentence pair, we obtain an overall representation by concatenating them with their element-wise difference and multiplication , which is fed to a linear layer with softmax activation to obtain the final classification output. The final model is trained using a cross-entropy loss.

Experiments
Models for Comparison. We consider three variants of the proposed model based on BERT, linking words between the premise and the hypothesis at the GCN layer in different ways: co-attention links as described in Section 3 (BERT+CAGCN), simply linking the same lemmarized words (BERT+SWGCN), and no links (BERT+SGCN) as in Lei et al. (2019). We also tried combining the outputs of BERT and GCN, which results in little performance improvement. The baseline models include "BERT-CLS", which adds classifier to the vector representation of [CLS] token in BERT model (Devlin et al., 2018), "BERT-Attn", which feeds word output of BERT sequentially to co-attention layer and classifier, "BERT+LF", which adds syntactic features to the input of the classifier layer (Pang et al., 2019), and "SPINN" which encodes sentences with a parse tree (Bowman et al., 2016).

Datasets and Settings.
We train all the models on MNLI training data (Williams et al., 2018), and evaluate them on MNLI and HANS * (Mc-Coy et al., 2019b). Evaluation examples in MNLI are divided into two categories: in-domain match (MNLI-m) and cross-domain mismatch (MNLImm). The evaluation set HANS is designed to * The labels neutral or contradiction are translated into non-entailment for evaluation on HANS.   diagnose whether an NLI model has learned specific invalid heuristics in the training data, in order to evaluate its generalization ability. The number of GCN layers is set at 3. The BERT components in BERT-related models are initialized with the same pre-trained weights. The BERTrelated baselines are optimised using the Adam optimizer (Loshchilov and Hutter, 2017). For the proposed models, we adopt two different Adam optimizers for BERT and the other components in the model, respectively (Liu and Lapata, 2019).

Results
Intuitively, the effectiveness of syntax can be obvious when the sentence pair has high lexical overlap but are syntactically different, which leads to semantic diversity. In HANS, this category of examples is named as non-entailment lexical overlap, where the words in hypothesis are derived from premise and do not form a contiguous subsequence of the premise. The performance comparison on this category is shown in Table 2. It can be observed that our models outperform the baselines including BERT by a wide margin. This result proves that incorporating syntax by using GCN is indeed beneficial for the generalization of BERT, especially identifying different syntactic structures in the sentence pair. By comparing the results of GCN-based methods, it can be seen that linking words between the sentence pair by co-attention can lead to a better performance. Some examples in this category are shown in Table 1.
The overall results on MNLI and HANS are shown in Table 3. It can be seen that incorporating syntax by GCN improves the averaged precision on the six categories of HANS, and slightly improves the performance on the in-domain dataset MNLI.

Analysis
Despite the success in the first four subcategories in Table 2, the proposed GCN-based methods do not bring as much improvement compared to the baselines on the Passive subcategory, and neither on the Non-entailment-subsequence and Non-entailmentconstituent categories in HANS, as shown in Table  3. One common characteristic in these categories is that the syntactic structures and the relative positions of words between the premise and the hypothesis remain basically unchanged. One example of Passive subcategory is shown in Figure 2. It can be the reason that both BERT baselines and syntax-based methods perform badly on this type of examples.
Similarly to HANS, the in-domain dataset MNLI also contains examples in which the sentence pair have high lexical overlap. However, most examples of this kind in MNLI have supporting labels rather than contradicting (McCoy et al., 2019b). Furthermore, more than a half of the contradicting cases contain negation in the premise but not the hypothesis, e.g. "I don't care." vs. "I care.". Note that this bias of MNLI is also one main motivation of HANS. Thus, a model may account for the above trait of MNLI by both learning and evaluation on it, and the syntactic feature might be learned as a secondary factor compared to content and negation words. This can be the main reason why the proposed syntactic model only improves the performance slightly on MNLI.

Conclusion
We have investigated the effectiveness of introducing syntax into the NLI task, by adopting GCN to enhance the text representation in existing models such as BERT. Results on HANS show that our method can improve the generalization of BERT, especially on examples where the sentence pair have high lexical overlap but different syntactic structures. It demonstrates that adding inductive biases such as dependency tree by GCN can make sentence encoding more robust.