Compare, Compress and Propagate: Enhancing Neural Architectures with Alignment Factorization for Natural Language Inference

This paper presents a new deep learning architecture for Natural Language Inference (NLI). Firstly, we introduce a new architecture where alignment pairs are compared, compressed and then propagated to upper layers for enhanced representation learning. Secondly, we adopt factorization layers for efficient and expressive compression of alignment vectors into scalar features, which are then used to augment the base word representations. The design of our approach is aimed to be conceptually simple, compact and yet powerful. We conduct experiments on three popular benchmarks, SNLI, MultiNLI and SciTail, achieving competitive performance on all. A lightweight parameterization of our model also enjoys a $\approx 3$ times reduction in parameter size compared to the existing state-of-the-art models, e.g., ESIM and DIIN, while maintaining competitive performance. Additionally, visual analysis shows that our propagated features are highly interpretable.


Introduction
Natural Language Inference (NLI) is a pivotal and fundamental task in language understanding and artificial intelligence. More concretely, given a premise and hypothesis, NLI aims to detect whether the latter entails or contradicts the former. As such, NLI is also commonly known as Recognizing Textual Entailment (RTE). NLI is known to be a significantly challenging task for machines whose success often depends on a wide repertoire of reasoning techniques.
In recent years, we observe a steep improvement in NLI systems, largely contributed by the release of the largest publicly available corpus for NLI -the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) which comprises 570K hand labeled sentence pairs. This has improved the feasibility of training complex neural models, given the fact that neural models often require a relatively large amount of training data.
Highly competitive neural models for NLI are mostly based on soft-attention alignments, popularized by (Parikh et al., 2016;Rocktäschel et al., 2015). The key idea is to learn an alignment of sub-phrases in both sentences and learn to compare the relationship between them. Standard feed-forward neural networks are commonly used to model similarity between aligned (decomposed) sub-phrases and then aggregated into the final prediction layers.
Alignment between sentences has become a staple technique in NLI research and many recent state-of-the-art models such as the Enhanced Sequential Inference Model (ESIM) (Chen et al., 2017b) also incorporate the alignment strategy. The difference here is that ESIM considers a nonparameterized comparison scheme, i.e., concatenating the subtraction and element-wise product of aligned sub-phrases, along with two original sub-phrases, into the final comparison vector. A bidirectional LSTM is then used to aggregate the compared alignment vectors.
This paper presents a new neural model for NLI. There are several new novel components in our work. Firstly, we propose a compare, compress and propagate (ComProp) architecture where compressed alignment features are propagated to upper layers (such as a RNN-based encoder) for enhancing representation learning. Secondly, in order to achieve an efficient propagation of alignment features, we propose alignment factorization layers to reduce each alignment vector to a single scalar valued feature. Each scalar valued feature is used to augment the base word representation, allowing the subsequent RNN encoder layers to benefit from not only global but also cross sentence information.
There are several major advantages to our proposed architecture. Firstly, our model is relatively compact, i.e., we compress alignment feature vectors and augment them to word representations instead. This is to avoid large alignment (or match) vectors being propagated across the network. As a result, our model is more parameter efficient compared to ESIM since the width of the middle layers of the network is now much smaller. To the best of our knowledge, this is the first work that explicitly employs such a paradigm.
Secondly, the explicit usage of compression enables improved interpretabilty since each alignment pair is compressed to a scalar and hence, can be easily visualised. Previous models such as ESIM use subtractive operations on alignment vectors, edging on the intuition that these vectors represent contradiction. Our model is capable of visually demonstrating this phenomena. As such, our design choice enables a new way of deriving insight from neural NLI models.
Thirdly, the alignment factorization layer is expressive and powerful, combining ideas from standard machine learning literature (Rendle, 2010) with modern neural NLI models. The factorization layer tries to decompose the alignment vector (constructed from the variations of a − b, a b and [a; b]), learning higher-order feature interactions between each compared alignment. In other words, it models the second-order (pairwise) interactions between each feature in every alignment vector using factorized parameters, allowing more expressive comparison to be made over traditional fully-connected layers (FC). Moreover, factorization-based models are also known to be able to model low-rank structure and reduce risks of overfitting. The effectiveness of the factorization alignment over alternative baselines such as feed-forward neural networks is confirmed by early experiments.
The major contributions of this work are summarized as follows: • We introduce a Compare, Compress and Propagate (ComProp) architecture for NLI.
The key idea is to use the myriad of generated comparison vectors for augmentation of the base word representation instead of simply aggregating them for prediction. Subsequently, a standard compositional encoder can then be used to learn representations from the augmented word representations. We show that we are able to derive meaningful insight from visualizing these augmented features.
• For the first time, we adopt expressive factorization layers to model the relationships between soft-aligned sub-phrases of sentence pairs. Empirical experiments confirm the effectiveness of this new layer over standard fully connected layers.
• Overall, we propose a new neural model -CAFE (ComProp Alignment-Factorized Encoders) for NLI. Our model achieves stateof-the-art performance on SNLI, MultiNLI and the new SciTail dataset, outperforming existing state-of-the-art models such as ESIM. Ablation studies confirm the effectiveness of each proposed component in our model.

Related Work
Natural language inference (or textual entailment recognition) is a long standing problem in NLP research, typically carried out on smaller datasets using traditional methods (Maccartney, 2009;Dagan et al., 2006;MacCartney and Manning, 2008;Iftene and Balahur-Dobrescu, 2007). The relatively recent creation of 570K human annotated sentence pairs (Bowman et al., 2015) have spurred on many recent works that use neural networks for NLI. Many advanced neural architectures have been proposed for the NLI task, with most exploiting some variants of neural attention which learn to pay attention to important segments in a sentence (Parikh et al., 2016;Chen et al., 2017b;Wang and Jiang, 2016b;Rocktäschel et al., 2015;Yu and Munkhdalai, 2017a).
Amongst the myriad of neural architectures proposed for NLI, the ESIM (Chen et al., 2017b) model is one of the best performing models. The ESIM, primarily motivated by soft subphrase alignment in (Parikh et al., 2016), learns alignments between BiLSTM encoded representations and aggregates them with another BiLSTM layer. The authors also propose the usage of subtractive composition, claiming that this helps model contradictions amongst alignments.
Compare-Aggregate models are also highly popular in NLI tasks. While this term was coined by (Wang and Jiang, 2016a), many prior NLI models follow this design (Wang et al., 2017;Parikh et al., 2016;Gong et al., 2017;Chen et al., 2017b).
The key idea is to aggregate matching features and pass them through a dense layer for prediction. (Wang et al., 2017) proposed BiMPM, which adopts multi-perspective cosine matching across sequence pairs. (Wang and Jiang, 2016a) proposed a one-way attention and convolutional aggregation layer. (Gong et al., 2017) learns representations with highway layers and adopts ResNet for learning features over an interaction matrix.
There are several other notable models for NLI. For instance, models that leverage directional selfattention (Shen et al., 2017) or Gumbel-Softmax (Choi et al., 2017). DGEM is a graph based attention model which was proposed together with a new entailment challenge dataset, SciTail (Khot et al., 2018). Pretraining have been known to also be highly useful in the NLI task. For instance, contextualized vectors learned from machine translation (McCann et al., 2017) (CoVe) or language modeling (Peters et al., 2018) (ELMo) have showned to be able to improve performance when integrated with existing NLI models.
Our work compares and compresses alignment pairs using factorization layers which leverages the rich history of standard machine learning literature. Our factorization layers incorporate highly expressive factorization machines (FMs) (Rendle, 2010) into neural NLI models. In standard machine learning tasks, FMs remain a very competitive choice for learning feature interactions (Xiao et al., 2017) for both standard classification and regression problems. Intuitively, FMs are adept at handling data sparsity (typically interactions) by using factorized parameters to approximate a feature matching matrix. This makes it suitable in our model architecture since feature interaction between subphrase alignment pairs is typically very sparse as well.
A recent work (Beutel et al., 2018) reports an interesting empirical study pertaining to the ability of standard FC layers and their ability to model 'cross features' (or multiplicative features). Their overall finding suggests that while standard ReLU FC layers are able to approximate 2-way or 3-way features, they are extremely inefficient in doing so (requiring either very wide or deep layers). This further motivates the usage of FMs in this work and is well aligned with our empirical results, i.e., strong competitive performance with reasonably small parameterization.

Our Proposed Model
In this section, we provide a layer-by-layer description of our model architecture. Our model accepts two sentences as an input, i.e., P (premise) and H (hypothesis). Figure 1 illustrates a highlevel overview of our proposed model architecture.

Input Encoding Layer
This layer aims to learn a k-dimensional representation for each word. Following (Gong et al., 2017), we learn feature-rich word representations by concatenating word embeddings, character embeddings and syntactic (part-of-speech tag) embeddings (provided in the datasets). Character representations are learned using a convolutional encoder with max pooling function and is commonly used in many relevant literature (Wang et al., 2017;Chen et al., 2017c).
Highway Encoder Subsequently, we pass each concatenated word vector into a two layer highway network (Srivastava et al., 2015) in order to learn a k-dimensional representation. Highway networks are gated projection layers which learn adaptively control how much information is being carried to the next layer. Our strategy is similar to (Parikh et al., 2016) which trains the projection layer in place of tuning the embedding matrix. The usage of highway layers over standard projection layers is empirically motivated. However, an intuition would be that the gates in this layer adapt to learn the relative importance of each word to the NLI task. Let H(.) and T (.) be single layered affine transforms with ReLU and sigmoid activation functions respectively. A single highway network layer is defined as: Notably, the dimensions of the affine transform might be different from the size of the input vector.
In this case, an additional nonlinear transform is used to project x to the same dimensionality. The output of this layer isP ∈ R k× P (premise) and H ∈ R k× H (hypothesis), with each word converted to a r-dimensional vector.

Soft-Attention Alignment Layer
This layer describes two soft-attention alignment techniques that are used in our model.

Inter-Attention Alignment Layer
This layer learns an alignment of sub-phrases betweenP and H. Let F (.) be a standard projection layer with ReLU activation function. The alignment matrix of two sequences is defined as follows: where E ∈ R p× h andp i ,h j are the i-th and j-th word in the premise and hypothesis respectively.
where β i is the sub-phrase inP that is softly aligned to h i . Intuitively, β i is a weighted sum across {p j } p j=1 , selecting the most relevant parts ofP to represent h i .

Intra-Attention Alignment Layer
This layer learns a self-alignment of sentences and is applied to bothP andH independently. For the sake of brevity, letS represent eitherP orH, the intraattention alignment is computed as: is a nonlinear projection layer with ReLU activation function. The intra-attention layer models similarity of each word with respect to the entire sentence, capturing long distance dependencies and 'global' context of the entire sentence.

Alignment Factorization Layer
This layer aims to learn a scalar valued feature for each comparison between aligned sub-phrases. Firstly, we introduce our factorization operation, which lives at the core of our neural model.

Factorization Operation Given an input vector
x, the factorization operation (Rendle, 2010) is defined as: where Z(x) is a scalar valued output, .; . is the dot product between two vectors and w 0 is the global bias. Factorization machines model lowrank structure within the matching vector producing a scalar feature. The parameters of this layer are w 0 ∈ R, w ∈ R r and v ∈ R r×k . The first term n i=1 w i x i is simply a linear term. The second term n i=1 n j=i+1 v i , v j x i x j captures all pairwise interactions in x (the input vector) using the factorization of matrix v.
Inter-Alignment Factorization This operation compares the alignment between inter-attention aligned representations, i.e., (β i , h i ) and (α j , p j ). Let (a, b) represent an alignment pair, we apply the following operations: where y c , y s , y m ∈ R, Z(.) is the factorization operation, [.; .] is the concatenation operator and is the element-wise multiplication. The intuition of modeling subtraction is targeted at capturing contradiction. However, instead of simply concatenating the extra comparison vectors, we compress them using the factorization operation. Finally, for each alignment pair, we obtain three scalar-valued features which map precisely to a word in the sequence.
Intra-Alignment Factorization Next, for each sequence, we also apply alignment factorization on the intra-aligned sentences. Let (s, s ) represent an intra-aligned pair from either the premise or hypothesis, we compute the following operations: where v c , v s , v m ∈ R and Z(.) is the factorization operation. Applying alignment factorization to intra-aligned representations produces another three scalar-valued features which are mapped to each word in the sequence. Note that each of the six factorization operations has its own parameters but shares them amongst all words in the sentences.

Propagation and Augmentation
Finally, the six factorized features are then aggregated 1 via concatenation to form a final feature vector that is propagated to upper representation learning layers via augmentation of the word rep-resentationP orH.
where s i is i-th word inP orH, and f i intra and f i inter are the intra-aligned [v c ; v s ; v m ] and interaligned [y c ; y s ; y m ] features for the i-th word in the sequence respectively. Intuitively, f i intra augments each word with global knowledge of the sentence and f i inter augments each word with crosssentence knowledge via inter-attention.

Sequential Encoder Layer
For each sentence, the augmented word representations u 1 , u 2 , . . . u are then passed into a sequential encoder layer. We adopt a standard vanilla LSTM encoder.
where represents the maximum length of the sequence. Notably, the parameters of the LSTM are siamese in nature, sharing weights between both premise and hypothesis. We do not use a bidirectional LSTM encoder, as we found that it did not lead to any improvements on the held-out set. A logical explanation would be because our word representations are already augmented with global (intra-attention) information. As such, modeling in the reverse direction is unnecessary, resulting in some computational savings.
Pooling Layer Next, to learn an overall representation of each sentence, we apply a pooling function across all hidden outputs of the sequential encoder. The pooling function is a concatenation of temporal max and average (avg) pooling.
where x is a final 2k-dimensional representation of the sentence (premise or hypothesis). We also experimented with sum and avg standalone poolings and found sum pooling to be relatively competitive.

Prediction Layer
Finally, given a fixed dimensional representation of the premise x p and hypothesis x h , we pass their concatenation into a two-layer h-dimensional highway network. Since the highway network has been already defined earlier, we omit the technical details here. The final prediction layer of our model is computed as follows: where H 1 (.), H 2 (.) are highway network layers with ReLU activation. The output is then passed into a final linear softmax layer.
where W F ∈ R h×3 and b F ∈ R 3 . The network is then trained using standard multi-class cross entropy loss with L2 regularization.

Experiments
In this section, we describe our experimental setup and report our experimental results.

Experimental Setup
To ascertain the effectiveness of our models, we use the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2017) benchmarks which are standard and highly competitive benchmarks for the NLI task. We also include the newly released Sc-iTail dataset (Khot et al., 2018) which is a binary entailment classification task constructed from science questions. Notably, SciTail is known to be a difficult dataset for NLI, made evident by the low accuracy scores even though it is binary in nature.
SNLI The state-of-the-art competitors on this dataset are the BiMPM (Wang et al., 2017), ESIM (Chen et al., 2017b) and DIIN (Gong et al., 2017). We compare against competitors across three settings. The first setting disallows cross sentence at-tention. In the second setting, cross sentence is allowed. The last (third) setting is a comparison between model ensembles while the first two settings only comprise single models. Note that we consider the 1st setting to be relatively less important (since our focus is not on the encoder itself) but still report the results for completeness.

MultiNLI
We compare on two test sets (matched and mismatched) which represent indomain and out-domain performance. The main competitor on this dataset is the ESIM model, a powerful state-of-the-art SNLI baseline. We also compare with ESIM + Read (Weissenborn, 2017).
SciTail This dataset only has one official setting. We compare against the reported results of ESIM (Chen et al., 2017b) and DecompAtt (Parikh et al., 2016) in the original paper. We also compare with DGEM, the new model proposed in (Khot et al., 2018). Across all experiments and in the spirit of fair comparison, we only compare with works that (1) do not use extra training data and (2) do not use external resources (such as external knowledge bases, etc.). However, for the sake of completeness, we still report their scores 2 (McCann et al., 2017;Chen et al., 2017a;Peters et al., 2018).

Implementation Details
We implement our model in TensorFlow (Abadi et al., 2015) and train them on Nvidia P100 GPUs. We use the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.0003. L2 regularization is set to 10 −6 . Dropout with a keep probability of 0.8 is applied after each fullyconnected, recurrent or highway layer. The batch size is tuned amongst {128, 256, 512}. The number of latent factors k for the factorization layer is tuned amongst {5, 10, 50, 100, 150}. The size of the hidden layers of the highway network layers are set to 300. All parameters are initialized with xavier initialization. Word embeddings are preloaded with 300d GloVe embeddings (Pennington et al., 2014) and fixed during training. Sequence lengths are padded to batch-wise maximum. The batch order is (randomly) sorted within buckets following (Parikh et al., 2016).

Experimental Results
Table 1 reports our results on the SNLI benchmark. On the cross sentence (single model setting), the performance of our proposed CAFE model is extremely competitive. We report the test accuracy of CAFE at different extents of parameterization, i.e., varying the size of the LSTM encoder, width of the pre-softmax hidden layers and final pooling layer. CAFE obtains 88.5% accuracy on the SNLI test set, an extremely competitive score on the extremely popular benchmark. Notably, competitive results can be also achieved with a much smaller parameterization. For example, CAFE also achieves 88.3% and 88.1% test accuracy with only 3.5M and 1.5M parameters 2 Additionally, we added ELMo (Peters et al., 2018) to our CAFE model at the embedding layer. We report CAFE + ELMo under external resource models. This was done post review after EMNLP. Due to resource constraints, we did not train CAFE + ELMo ensembles but a single run (and single model) of CAFE + ELMo already achieves 89.0 score on SNLI.
respectively. This outperforms the state-of-theart ESIM and DIIN models with only a fraction of the parameter cost. At 88.1%, our model has about three times less parameters than ESIM/DIIN (i.e., 1.4M versus 4.3M/4.4M). Moreover, our lightweight adaptation achieves 87.7% with only 750K parameters, which makes it extremely performant amongst models having the same amount of parameters such as the decomposable attention model (86.8%).
Finally, an ensemble of 5 CAFE models achieves 89.3% test accuracy, the best test scores on the SNLI benchmark to date 3 . Overall, we believe that the good performance of our CAFE can be attributed to (1) the effectiveness of the ComProp architecture (i.e., providing word representations with global and local knowledge for better representation learning) and (2) the expressiveness of alignment factorization layers that are used to decompose and compare word alignments. More details are given at the ablation study. Finally, we emphasize that CAFE is also relatively lightweight, efficient and fast to train given its performance. A single run on SNLI takes approximately 5 minutes per epoch with a batch size of 256. Overall, a single run takes ≈ 3 hours to get to convergence.   (Khot et al., 2018) and (Williams et al., 2017) respectively. Table 2 reports our results on the MultiNLI and SciTail datasets. On MultiNLI, CAFE significantly outperforms ESIM, a strong state-of-the-art model on both settings. We also outperform the ESIM + Read model (Weissenborn, 2017). An ensemble of CAFE models achieve competitive re-sult on the MultiNLI dataset. On SciTail, our proposed CAFE model achieves state-of-the-art performance. The performance gain over strong baselines such as DecompAtt and ESIM are ≈ 10% − 13% in terms of accuracy. CAFE also outperforms DGEM, which uses a graph-based attention for improved performance, by a significant margin of 5%. As such, empirical results demonstrate the effectiveness of our proposed CAFE model on the challenging SciTail dataset.   Table 3 reports ablation studies on the MultiNLI development sets. In (1), we replaced all FM functions with regular full-connected (FC) layers in order to observe the effect of FM versus FC. More specifically, we experimented with several FC configurations as follows: (a) 1-layer linear, (b) 1-layer ReLU (c) 2-layer ReLU. The 1-layer linear setting performs the best and is therefore reported in Table 3. Using ReLU seems to be worse than nonlinear FC layers. Overall, the best combination (option a) still experienced a decline in performance in both development sets.

Ablation Study
In (2-3), we explore the utility of using character and syntactic embeddings, which we found to have helped CAFE marginally. In (4), we remove the inter-attention alignment features, which naturally impact the model performance significantly. In (5-6), we explore the effectiveness of the highway layers (in prediction layers and encoding layers) by replacing them to FC layers. We observe that both highway layers have marginally helped the overall performance. Finally, in (7-9), we remove the alignment features based on their composition type. We observe that the Sub and Concat compositions were more important than the Mul composition. However, removing any of the three will result in some performance degradation. Finally, in (10), we replace the LSTM encoder with a BiLSTM, observing that adding bi-directionality did not improve performance for our model.

Linguistic Error Analysis
We perform a linguistic error analysis using the supplementary annotations provided by the MultiNLI dataset. We compare against the model outputs of the ESIM model across 13 categories of linguistic phenenoma (Williams et al., 2017). Ta  On the mismatched setting, CAFE outperforms ESIM in 12 out of 13 categories, losing only in one percentage point in Active/Passive category. On the matched setting, CAFE is outperformed by ESIM very marginally on coreference and paraphrase categories. Despite generally achieving much superior results, we noticed that CAFE performs poorly on conditionals 4 on the matched setting. Measuring the absolute ability of CAFE, we find that CAFE performs extremely well in handling linguistic patterns of paraphrase detection and active/passive. This is likely to be attributed by the alignment strategy that CAFE and ESIM both exploits.

Interpreting and Visualizing with CAFE
Finally, we also observed that the propagated features are highly interpretable, giving insights to the inner workings of the CAFE model. Figure 2 shows a visualization of the feature values from an example in the SNLI test set. The ground truth is contradiction. Based on the above example we make several observations. Firstly, inter mul features mostly capture identical words (or semantically similar words), i.e., inter mul features for 'river' spikes in both sentences. Secondly, inter sub spikes on conflicting words that might cause contradiction, e.g., 'sedan' and 'land rover' are not the same vehicle. Another interesting observation is that we notice the inter sub features for driven and stuck spiking. This also validates the observation of (Chen et al., 2017b), which shows what the sub vector in the ESIM model is looking out for contradictory information. However, our architecture allows the inspection of these vectors since they are compressed via factorization, leading to larger extents of explainability -a quality that neural models inherently lack. We also observed that intra-attention (e.g., intra cat) features seem to capture the more important words in the sentence ('river', 'sedan', 'land rover').

Conclusion
We proposed a new neural architecture, CAFE for NLI. CAFE achieves very competitive performance on three benchmark datasets. Extensive ablation studies confirm the effectiveness of FM layers over FC layers. Qualitatively, we show how different compositional operators (e.g., sub and mul) behave in NLI task and shed light on why subtractive composition helps in other models such as ESIM.