Co-Stack Residual Affinity Networks with Multi-level Attention Refinement for Matching Text Sequences

Learning a matching function between two text sequences is a long standing problem in NLP research. This task enables many potential applications such as question answering and paraphrase identification. This paper proposes Co-Stack Residual Affinity Networks (CSRAN), a new and universal neural architecture for this problem. CSRAN is a deep architecture, involving stacked (multi-layered) recurrent encoders. Stacked/Deep architectures are traditionally difficult to train, due to the inherent weaknesses such as difficulty with feature propagation and vanishing gradients. CSRAN incorporates two novel components to take advantage of the stacked architecture. Firstly, it introduces a new bidirectional alignment mechanism that learns affinity weights by fusing sequence pairs across stacked hierarchies. Secondly, it leverages a multi-level attention refinement component between stacked recurrent layers. The key intuition is that, by leveraging information across all network hierarchies, we can not only improve gradient flow but also improve overall performance. We conduct extensive experiments on six well-studied text sequence matching datasets, achieving state-of-the-art performance on all.


Introduction
Determining the semantic affinity between two text sequences is a long standing research problem in natural language processing research. This is understandable, given that technical innovations in this domain would naturally bring benefits to a diverse plethora of applications ranging from paraphrase detection to standard document retrieval. This work focuses on short textual sequences, focusing on a myriad of applications such as natural language inference, question answering, reply * Denotes equal contribution. prediction and paraphrase detection. This paper presents a new deep matching model for universal text matching.
Neural networks are dominant state-of-the-art approaches for many of these matching problems (Gong et al., 2017;Shen et al., 2017;Wang et al., 2017;. Fundamentally, neural networks operate via a concept of feature hierarchy, in which hierarchical representations are constructed as sequences propagate across the network. In the context of matching, representations are often (1) encoded, (2) matched, and then (3) aggregated for prediction. Each key step often comprises several layers, which consequently adds to the overall depth of the network.
Unfortunately, it is a well established fact that deep networks are difficult to train. This is attributed to not only vanishing/exploding gradients but also an instrinsic difficulty pertaining to feature propagation. To this end, commonly adopted solutions include Residual connections  and/or Highway layers (Srivastava et al., 2015). The key idea in these approaches is to introduce additional (skip/residual) connections, propagating shallower layers to deeper layers via shortcuts. To the best of our knowledge, these techniques are generally applied to single sequences and therefore the notion of pairwise residual connections have not been explored.
This paper presents Co-Stack Residual Affinity Networks (CSRAN), a stacked multi-layered recurrent architecture for general purpose text matching. Our model proposes a new co-stacking mechanism that computes bidirectional affinity scores by leveraging all feature hierarchies between text sequence pairs. More concretely, wordby-word affinity scores are not computed just from the final encoded representations but across all the entire feature hierarchy.
There are several benefits to our co-stacking mechanism. Firstly, co-stacking acts as a form of residual connector, alleviating the instrinsic issues with network depth. Secondly, there are more extensive matching interfaces between text sequences as the affinity matrix is not computed by just one representation but multiple representations instead. Naturally, increasing the opportunities for interactions between sequences is an intuitive method for improving performance. Additionally, our model incorporates a Multilevel Attention Refinement (MAR) architecture in order to fully leverage the stacked recurrent architecture. The MAR architecture is a multi-layered adaptation and extension of the CAFE model (Tay et al., 2017c), in which attention is computed, compressed and then re-fed into the input sequence. In our approach, we use CAFE blocks to repeatedly refine representations at each level of the stacked recurrent encoder.
The overall outcome of the above-mentioned architectural synergies is a highly competitive model that establishes state-of-the-art performance on six well-known text matching datasets such as SNLI and TrecQA. The overall contributions of this work are summarized as follows: • We propose a new deep stacked recurrent architecture for matching text sequences. Our model is based on a new co-stacking mechanism which learns to align by exploiting matching across feature hierarchies. This can be interpreted as a new way to incorporate shortcut connections within neural models for sequence matching. Additionally, we also propose a multi-level attention refinement scheme to leverage our stacked recurrent model.
• While stacked architectures can potentially lead to considerable improvements in performance, our experiments show that in the absence of our proposed CSRA (Co-stack Residual Affinity) mechanism, stacking may conversely lead to performance degradation. As such, this demonstrates that our proposed techniques are essential for harnessing the potential of deep architectures.

Co-Stack Residual Affinity Networks
In this section, we introduce our proposed model architecture for general/universal text matching.
The key idea of this architecture is to leverage deep stacked layers, while mitigating the inherent weaknesses of going deep. As such, our network is in similar spirit to highway networks, residual networks and DenseNets, albeit tailored specifically for pairwise architectures. Figure 1 illustrates a high-level overview of our proposed model architecture.

Input Encoder
The inputs to our model are standard sequences of words A and B which represent sequence a and sequence b respectively. In the context of different applications, a and b take different roles such as premise/hypothesis or question/answer. Both sequences are converted into word representations (via pretrained word embeddings) and characterbased representations. Character embeddings are trainable parameters and a final character-based word representation of d dimensions is learned by passing all characters into a Bidirectional LSTM encoder. This is standard, following many works such as (Wang et al., 2017). Word embeddings and character-based word representations are then concatenated to form the final word representation. Next, the word representation is passed through a (optional and tuned as a hyperparameter) 2-layered highway network of d dimensions.

Stacked Recurrent Encoders
Next, word representations are passed into a stacked recurrent encoder layer. Specifically, we use Bidirectional LSTM encoders at this layer. Let k be the number of layers of the stacked recurrent encoder layer.
where BiLSTM i represents the i-th BiLSTM layer and h i t represents the t-th hidden state of the i-th BiLSTM layer. is the sequence length. Note that the parameters are shared for both a and b.

Multi-level Attention Refinement (MAR)
Inspired by CAFE (Tay et al., 2017c) (Compare-Align-Factorized Encoders), a top performing model on the SNLI benchmark, we utilize CAFE blocks between the BiLSTM layers. Each CAFE block returns six features, which are generated by a factorization operation using factorization machines (FM). While the authors in (Tay et al., 2017c) simply use this operation in a single layer, we utilize this in a multi-layered fashion which we found to have worked well. This constitutes our multi-level attention refinement mechanism. More concretely, we apply the CAFE operation to the outputs of each BiLSTM layer, allowing the next BiLSTM layer to process the 'augmented' representations. The next layer retains its dimensionality by projecting the augmented representation back to its original size using the BiLSTM encoder. This can be interpreted as repeatedly refining representations via attention. As such, adding CAFE blocks is a very natural fit to the stacked recurrent architecture.

CAFE Blocks
This section describes the operation of each CAFE block. The key idea behind CAFE blocks is to align a and b, and compress alignment vectors such as b − a (subtraction), b a (element-wise multiplication) and [b ; a] (concatenation) into scalar features. These scalar features are concatenated to the original input embedding, which can be pased into another BiL-STM layer for refining representations. Firstly, a, b are modeled aligned via E ij = F (a) F (b) and then aligned via: where v ∈ R d×k , w 0 ∈ R and w i ∈ R d . The output M (x) is a scalar. Intuitively, this layer tries to learn pairwise interactions between every x i and x j using factorized (vector) parameters v. Factorization machines model low-rank structure within the matching vector, producing a scalar feature. This enables efficient propagation of these matching features to the next layer. The output of each CAFE block is the original input to the CAFE module, augmented with the output of the factorization machines. As such, if the input sequence is of d dimensions, then the output is d + 3 dimensions. Additionally, intra-attention is applied in similar fashion as above to generate three more features for each sequence. As a result, the output dimensions for each word becomes d + 6.

Co-Stack Residual Affinity (CSRA)
This layer is the cornerstone of our proposed approach and is represented as the middle segment of Figure 1 (the colorful matrices).
Co-Stacking Co-stacking refers to the fusion of a and b across multiple hierarchies. Recall that the affinity score between two words is typically computed by s ij = a b. We extend this to a residual formulation. More concretely, the affinity score between both words is now computed as the maximum influence it has over all layers.
where a pi is the i-th word for the p-th stacked layer for a and b qj is the j-th word for the q-th stacked layer for b. The choice of the maximum operator is intuitive and is strongly motivated by the fact that we would like to give a high affinity for each word pair that shows a strong match at any of different hierarchical stages of learning representations. Note that this layer can be interpreted as constructing a matching tensor based on multihierarchical information and selecting the most informative match across all representation hierarchies.
Bidirectional Alignment In order to learn (bidirectionally) attentive representations, we first concatenate all stacked outputs to form a × kd vector. Next, we apply the following operations to A ∈ R a×kd and B ∈ R b ×kd .
whereĀ ∈ R b ×kd ,B ∈ R a×kd are the attentive (aligned) representations.

Matching and Aggregation Layer
Next, we match the attentive (aligned) representations using the subtraction, element-wise multiplication and concatenation of each aligned word. Subsequently, we pass this matching vector into a k layered BiLSTM layer.
The final feature representation is learned via the summation across the temporal dimension as follows: where [.; .] is the concatenation operator.

Output and Prediction Layer
Our model predicts using the feature vector z for every given sequence pair. At this layer, we utilize standard fully connected layers. The number of output layers is typically 2-3 and is a tuned hyperparameter. Softmax is applied onto the final layer. The final layer is application specific, e.g., k classes for classification tasks and a two-class softmax for pointwise ranking. For all datasets, we optimize the cross entropy loss. Quora Duplicate Detection is a well-studied paraphrase identification dataset 1 . We use the splits provided by (Wang et al., 2017). The task is to determine if two questions are paraphrases of each other. This task is formulated as a binary classication problem. We compare with L.D.C , BiMPM, the DecompAtt implementation by (Tomar et al., 2017) (word and char level) and DIIN.
TwitterURL (Lan et al., 2017) is another dataset for paraphrase identification. It was constructed using Tweets referring to news articles. This task is also a binary classification problem. We compare with (1) MultiP (Xu et al., 2014), a strong baseline, (2) the implementation of (He and Lin, 2016) by (Lan et al., 2017) and (3) the Subword + LM model from (Lan and Xu, 2018).
Ubuntu (Lowe et al., 2015) is a dataset for Utterance-Response Matching and comprises 1million utterance-response pairs. This dataset is based on the Ubuntu dialogue corpus. The goal is to predict the response to a message. We use the same setup as (Wu et al., 2016). Baselines include CNTN (Qiu and Huang, 2015), APLSTM (dos Santos et al., 2016), MV-LSTM (Wan et al., 2016a) and KEHNN (Wu et al., 2016). Results are reported from (Wu et al., 2016).
Metrics For all datasets, we follow the evaluation procedure from all the original papers. The metric for SNLI, SciTail and Quora is the accuracy metric. The metric for the TwitterURL dataset is the F1 score. The metric for TrecQA is the Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) metric. The metric for Ubuntu is the Recall@K for k = 1, 2, 5 (given 9 negative samples) and the binary classification accuracy score.  (Pennington et al., 2014) and fixed during training. We implement our model in Tensorflow (Abadi et al., 2015) and use the CUDNN implementation for all BiLSTM layers.

Experimental Results
Overall, our proposed CSRAN architecture achieves state-of-the-art performance on all six well-established datasets.
On SNLI (Table 2), CSRAN achieves the best 2 single model performance to date on the wellestablished dataset. This demonstrates the effectiveness of CSRAN, taking into consideration of the inherent competitiveness of this well-known benchmark. On SciTail (Table 3), CSRAN similarly achieves the best performance to date on this dataset, outperforming the existing CAFE model by +3.4% absolute accuracy.
Overall, CSRAN achieves state-of-the-art performance on six well-studied datasets. On several datasets, our achieved performance is not only the highest reported score but also outperforms the existing state-of-the-art models by a considerable margin.

Training Efficency
With many BiLSTM layers, it is natural to be skeptical about the training efficiency of our model. However, since we use the CUDNN implementation of the BiLSTM model, the runtime is actually very manageable. On SNLI, with a batch size of 128, our model with 3 stacked recurrent layers and 2 aggregation BiLSTM layers runs at ≈17 minutes per epoch and converges in less than 20 epochs. On SciTail, our model runs at ≈ 2 minutes per epoch with a batch size of 32. This is benchmarked on a TitanXP GPU. While our model is targetted at performance and not efficiency, this section serves as a reassurance that our model is not computationally prohibitive.

Ablation Study
In order to study the effectiveness of the key components in our proposed architecture, we conduct an extensive ablation study. Table 8 reports the results on several ablation baselines. There are three key ablation baselines as follows: (1) we removed MAR from the stacked recurrent network,

Effect of Stack Depth
In this section, our goals are twofold -(1) studying the effect of stack depth on model performance and (2) determining if the proposed CSRAN model indeed helps with enabling deeper stack depths. In order to do so, we compute the development set performance of two models. The first is the full CSRAN architecture and the second is a baseline stacked model architecture. Note that the bidirectional alignment layer and remainder of the model architecture (highway layers, etc.) remain completely identical to CSRAN to make this study as fair as possible.  Figure 2 illustrates the model performance with varying stack depth. As expected, the performance of the stacked model declines when increasing the stack depth. On the other hand, the performance of CSRAN improves by adding additional layers. The largest gain is when jumping from 2 layers to 3 layers. The subsequent performance improvement from 3-5 layers is marginal. From this study, the takeaway is that standard stacked architectures are insufficient. As such, our proposed CSRA mechanism can aid in enabling deeper models which can result in stronger model performance 3 .
Next, we study the general effect of stack depth (number of layers) on model performance. Figure 3 reports the model performance (dev accuracy) of our CSRAN architecture on Quora and SNLI datasets. We observe that a stacked architecture with 3 layers is significantly better than a single-layered architecture. The optimal development score is 3-4 layers for SNLI and 3 layers for Quora. However, we observe the performance of Quora declines after 3 layers (notably it is still higher than an unstacked model). However, the performance on SNLI remains relatively stable.

Related Work
Learning to matching text sequences is a core and fundamental research problem in NLP and Information Retrieval. A wide range of NLP applications fall under this paradigm such as natural language inference (Bowman et al., 2015;Khot et al., 2018), paraphrase identification (Lan and Xu, 2018), question answering (Severyn and Moschitti, 2015), document search (Shen et al., 2014;, social media search (Rao et al., 2018) and entity linking . As such, universal text matching algorithms are generally very attractive, in lieu of the prospects of potentially benefitting an entire suite of NLP applications.
Neural networks have been the prominent choice for text matching. Earlier works are mainly concerned with learning a matching function between RNN/CNN encoded representations (Severyn and Moschitti, 2015;Yu et al., 2014;Qiu and Huang, 2015;Tay et al., 2017bTay et al., , 2018b. Models such as Recursive Neural Networks have also been explored (Wan et al., 2016b). Subsequently, attention-based models were adopted (Rocktäschel et al., 2015;Parikh et al., 2016), demonstrating superior performance relative to their non-attentive counterparts.
Today, the dominant state-of-the-art approaches for text matching are mostly based on neural mod-els configured with bidirectional attention layers (Shen et al., 2017;Tay et al., 2017c). Bidirectional attention comes in various flavours which can be known as soft alignment (Shen et al., 2017;, decomposable attention (Parikh et al., 2016), attentive pooling (dos Santos et al., 2016) and even complex-valued attention (Tay et al., 2018a). The key idea is to jointly soft align text sequences such that they can be compared at the index level. To this end, various comparison functions have been utilized, ranging from feedforward neural networks (Parikh et al., 2016) to factorization machines (Tay et al., 2017c). Notably, these attention (and bi-attention) mechanisms are also widely adopted (or originated) from many related sub-fields of NLP such as machine translation (Bahdanau et al., 2014) and reading comprehension (Xiong et al., 2016;Seo et al., 2016;Wang and Jiang, 2016b).
Many text matching neural models are heavily grounded in the compare-aggregate architecture (Wang and Jiang, 2016a). In these models, matching and comparisons occur between text sequences, aggregating features for making the final prediction. Recent state-of-the-art models such as BiMPM (Wang et al., 2017) and DIIN (Gong et al., 2017) are representative of such architectural paradigm, utilizing an attention-based matching scheme and then a CNN or LSTM-based feature aggregator. Earlier works (Wan et al., 2016a;He et al., 2015;) exploit a similar paradigm, albeit without the usage of attention.
Across many NLP and machine learning applications, utilizing stacked architectures is a common way to enhance representation capability of the encoder (Sutskever et al., 2014;Graves et al., 2013;Nie and Bansal, 2017), leading to performance improvement. Deep networks suffer from inherent difficulty in feature propagation and/or vanishing/exploding gradients. As a result, residual strategies have often been employed Srivastava et al., 2015;Huang et al., 2017). However, to the best of our knowledge, this work presents a new way of residual connections, leveraging on the fact that pairwise formulation of the text matching task.

Conclusion
We proposed a deep stacked recurrent architecture for general-purpose text sequence matching. We proposed a new co-stack residual affin-ity mechanism for matching sequence pairs, leveraging multi-hierarchical information for learning bidirectional alignments. Our proposed CSRAN model achieves state-of-the-art performance across six well-studied benchmark datasets and four different problem domains.