Enhanced Sentence Alignment Network for Efficient Short Text Matching

Cross-sentence attention has been widely applied in text matching, in which model learns the aligned information between two intermediate sequence representations to capture their semantic relationship. However, commonly the intermediate representations are generated solely based on the preceding layers and the models may suffer from error propagation and unstable matching, especially when multiple attention layers are used. In this paper, we pro-pose an enhanced sentence alignment network with simple gated feature augmentation, where the model is able to flexibly integrate both original word and contextual features to improve the cross-sentence attention. Moreover, our model is less complex with fewer parameters compared to many state-of-the-art structures.Experiments on three benchmark datasets validate our model capacity for text matching.


Introduction
Modeling the semantic relationship of a sentence pair is a long standing task in natural language processing, which can be applied in many scenarios such as paraphrase detection and natural language inference (Wang et al., 2017;Bowman et al., 2015;Lan and Xu, 2018). Neural network approaches have achieved impressive results on solving text matching tasks for the good representation learning ability and benefiting from large datasets (Rocktäschel et al., 2015;Wang et al., 2017;Gong et al., 2017).
One of the major paradigms is attention based neural approach which adopts matching and fusion method (Chen et al., 2017;Wang and Jiang, 2016;Duan et al., 2018). Specifically, attention mechanism is used as a key component to compute word or phrase alignments between the two parallel sequences, and then the aligned information is fused to update the sentence representations. Recent work also adopts multiple matching processes to equip model with power on gradually refining the attention results (Yang et al., 2019;Liang et al., 2019;Kim et al., 2019).
Unfortunately, conducting cross-sentence attention between two intermediate sentence representations may lead to unstable matching since different layers aim at capturing different semantic information (Liu et al., 2019a). Also, each intermediate representation is highly correlated to the previous layers, and error propagation would affect the following representations and lead to incorrect alignments since model is unable to amend the information without recalling the original semantic features. Furthermore, in case of multiple alignment blocks are used, models may suffer from difficulty of training such as vanishing gradients, and low-level features are inefficient to be fully trained. Different connection methods are adopted by some recent models to overcome this problem (Tay et al., 2018a;Yang et al., 2019;Nie and Bansal, 2017).
Recently pre-trained language models such as BERT have achieved impressive improvements on text matching tasks (Devlin et al., 2019;Liu et al., 2019b). Despite the promising results, the large parameter size and growing computational requirements make it hard to directly deploy these models to real-time applications (Sanh et al., 2019). Thus designing efficient and effective models to tackle text matching has been of increasing importance.
In this work, we introduce an Enhanced Sentence Alignment Network with Gated Feature Augmentation (ESAN), in which our model integrates the word features (embedding outputs) and contextual features (encoding outputs) to the intermediate representations for each cross-sentence attention, as shown in Figure 1. The embedding outputs contain the original word information, and the encoding outputs represent each token with the aggregated contexts, which are helpful to guide the attention layer to properly capture the aligned information. A gate operation is used to flexibly control how much these two features to be added. Also, incorporating the original semantic features directly to the different levels of representation layers can be viewed as a shortcut connection, which is helpful to reduce the training difficulty on lowlevel features. We then apply a simple but effective fusion layer to fuse the aligned features and update the sentence representations gradually. Different from previous work (Yang et al., 2019;Kim et al., 2019), we do not apply residual connections between alignment layers or use multiple encoders in alignment layers, and our architecture is more efficient and less complex compared with many strong baselines, indicating the feasibility to be deployed in real applications.
To demonstrate the effectiveness of our method, we conduct experiments on three text matching datasets: SNLI, MultiNLI and Quora Question Pairs. The results show our model outperforms strong baselines with fast inference speed. We also conduct model analysis including an ablation study and a case study on attention visualization.

Encoding Layer
Given inputs S a and S b , the model first passes each sequence to an embedding layer to get word representations. We use pre-trained word vectors as word embeddings and keep it fixed during training. Character-based word representations is also leveraged, in which we use 1D convolutional network on the character embeddings, and then apply max pooling over the time dimension of each token. The word vectors and character-based vectors are concatenated. Following , we further concatenate syntactical features including part-of-speech (POS) tagging feature and binary exact match feature for the NLI task. The embedding outputs are regarded as the final word features: where m, n are the sequence lengths. We then pass E a and E b to a Bidirectional LSTM encoder to obtain the contextual features H a = {h a i } and Intuitively, the word features contain the original information of each token, and the contextual features represent each word with aggregated context information. They will be used as additional features to enhance the following alignment process. and contextual features (Ha) are used for the enhanced sentence alignment layer, and multiple alignments are stacked with independent parameters. Symmetric structure is applied, and we omit the right part for space limitation.

Enhanced Sentence Alignment Layer
The proposed enhanced sentence alignment layer takes the intermediate representations a and b as inputs. As shown in Figure 1, the enhanced alignment layer consists of: (1) gated feature augmentation, (2) co-attention and (3) fusion layer. Multiple enhanced sentence alignment layers are stacked to enable the model to gradually refine the alignments.

Gated Feature Augmentation.
Given two intermediate sequence representations a and b, which are the inputs of the current alignment layer, we first augment the word and contextual features to each representation as different levels of the original semantic features. Specifically, for sequence representation a = {a i |a i ∈ R d , i = 1, 2, ..., m}, we augment the word feature E a and contextual feature H a with a gate operation to enable the model to selectively keep the features from different parts, which is formally defined as: where W * ∈ R d×d and z * ∈ R d are trainable parameters. The same operation is performed for sequence b. For the first alignment inputs (the encoding outputs), we only augment the word features. Inspired by residual connections (He et al., 2016), we also try a simplified version of augmentation without gate operation:

Co-attention.
We then apply the co-attention between two enhanced sequence representations a and b to capture their relationship. We first calculate similarity score e ij of token a i and token b j : where W c is a trainable parameter, and the bias term is omitted. Then the attentive representations of each sequence are computed by the weighted sum of the other sequence to highlight the relevant elements: where m, n are the lengths of sequence a and b.

Fusion Layer
We apply a simple yet effective fusion layer to fuse the aligned features to the original representations. The output of fusion layerā is computed as follows: where W f and z f are trainable parameters, [; ] represents concatenation, and • is element-wise product. The output has the same dimension size as a and a .

Pooling and Classification Layer
We use both mean and max pooling on each sequence to get the corresponding vector representations, as inputs of the classification layer. Mean pooling aggregates global semantics and max pooling represents the import semantic features. Then we apply an MLP with softmax to get the final distributions. Formally, assume the outputs of the last fusion layer is V a and V b , we first compute the feature vector: Then a multi-layer perceptron (MLP) is used to calculate the final target: where W * and z * are trainable parameters.

Model
Test Accuracy (%) ESIM (Chen et al., 2017) 88.0 BiMPM (Wang et al., 2017) 87.5 DIIN (Gong et al., 2017) 88.0 CAFE (Tay et al., 2018b) 88.5 CSRAN (Tay et al., 2018a) 88.7 ADIN (Liang et al., 2019) 88.8 RE2 (Yang et al., 2019) 88.9 OSOA-DFN (Liu et al., 2019a) 88.8 ESAN 89.0 Training Details and Parameters. We tune the number of enhanced alignment layers from 2 to 3 in all experiments, which can be easily extended to more layers. We apply 300D-840B Glove (Pennington et al., 2014) as pre-trained word vectors. The 1D convolutional network is used for char embedding with kernel size 5 and 100 filters. We tune the number of recurrent layers from 1 to 2, and the dimension of feed-forward layers from 150 to 300 with ReLU (Glorot et al., 2011) as activation function. Adam optimizer (Kingma and Ba, 2014) is used with β 1 to be 0.9 and β 2 to be 0.999 during training. We use cropping or padding to limit each token to have 16 characters in char embedding. Dropout with dropout rate of 0.2 is applied to prevent overfitting. We set initial learning rate as 0.001 with exponential decay. The batch size is tuned from 64 to 256. More details are in Supplementary.

Quantitative Results
Our model outperforms strong baselines with competitive results on all three datasets. For a fair comparison, we do not include the methods with pre-trained language models such as BERT (Devlin et al., 2019) or ensemble systems.

Model Test Accuracy (%) Matched Mismatched
DIIN (Gong et al., 2017) 78.8 77.8 CAFE (Tay et al., 2018b) 78.7 77.9 AF-DMN (Duan et al., 2018) 76.9 76.3 MwAN (Tan et al., 2018) 78.5 77.7 ADIN (Liang et al., 2019) 78  The results for SNLI and Quora are shown in Table 1 and Table 2. For SNLI, our model achieves 89.0% test accuracy, which is higher than all comparisons including some strong state-of-theart models. For Quora, our model also achieves the best performance, with 89.3% test accuracy. Table 3 presents the results on MultiNLI, and our model produces higher accuracy on both in-domain (matched) and out-domain (mismatched) test sets, which further proves the model ability for natural language inference task. Above all, the results on the challenging datasets verify our model effectiveness for solving text matching tasks.

Model Analysis
Ablation Study. To verify the effectiveness of our model components, we conduct an ablation study on Quora as shown in Table 4. The first line represents the model variant without feature augmentation (using original co-attention between intermediate representations) and the result drops dramatically. It shows that feature augmentation plays  a key role to enhance the alignment process. In the next two settings, after removing word features and contextual features respectively, both the results drop, and removing contextual features brings more decrease to the final performance. These two features are complementary to each other to improve the following cross-sentence attention. For the last ablation study, we apply simple augmentation without gate as Equation 4, and the performance decreases by 0.3 percentage point, which indicates the usefulness of the gate operation.  Model Efficiency. Figure 2 presents the comparison of total number of parameters for our model and baselines. Some strong comparisons such as CSRAN and MwAN contain more than 10M parameters, while our model has less parameters (3.9M) and achieves better results. We also compare the inference time with BERT to show the efficiency of our model in Table 5. Specifically, we set the sentence lengths as 20. Both models are required to make predictions for a batch of 8 sentence pairs on a MacBook Pro with Intel Core i7 CPUs. For BERT, we add a linear layer on top of the [CLS] token for classification as the original paper did (Devlin et al., 2019). We report the average and the standard deviation of processing 1000 batches. From the results we can see ESAN has a higher inference speed than BERT with less model complexity, which further indicates ESAN is more efficient and can be applied in many real scenarios.

Attention Visualization.
We present a case study through the attention visualization to investigate what our model learns in cross-sentence attention. We take an instance from SNLI, where sentence 1 is "police officer with riot shield stands in front of crowd" and sentence 2 is "a police officer stands in front of a crowd". The attention results are shown in Figure 3.
In the first attention, the model tends to align elements mostly in word-level. For example, "stands" and "crowd" in two sequences are successfully connected. Also, the model correctly aligns phrase "police officer" together which is one of the key components. In the second attention, the model tries to refine the attention distribution and gives "stands" and "crowd" larger weights. Also, the model tends to align longer phrases together instead of individual words. For example, phrases "in front of" in two sequences are connected. Notably, "riot shield" is also aligned to "police officer". We hypothesis that the model learns this phrase is used to describe entity "police officer", thus correctly aligning these two would help to make the final decision. With the proper alignments, the model correctly classifies their relationship as "entailment".

Related Work
Text matching is a key technique for many NLP tasks such as natural language inference (Bowman et al., 2015), paraphrase identification (Wang et al., 2017) and machine reading comprehension (Rajpurkar et al., 2016;. As a long standing problem, this area has been in the center of attention and investigated widely. Benefiting from large-scale datasets, neural networks have achieved much success for solving this problem. One of the paradigms uses sentence encoding structure, in which two sentences are encoded into vector representations, and then the vectors are combined to make the final prediction (Conneau et al., 2017;Yin and Schütze, 2015;Mueller and Thyagarajan, 2016). However, the interaction of the two input sequences is not directly considered during the encoding process, which makes the model difficult to capture complex relationship.
Later work adopts matching and aggregation method to model the alignments of the two sentences. Wang and Jiang (2016) uses a match-LSTM to conduct word-level matching of the two sequences. Parikh et al. (2016) propose a simple attention operation and use a feed-forward network to integrate the aligned representations. BiMPM (Wang et al., 2017) uses multi-perspective matching operation to compare two sequences, and applies a Bi-LSTM network for aggregation. Gong et al. (2017) uses DensNet as feature extractor to extract the semantic feature from the interaction tensor.
To better capture the sentence alignments in different levels, multiple attention operations can be stacked together. Yang et al. (2019) propose a simple but effective framework with richer alignment features. Tay et al. (2018a) leverages multilevel attention refinement component to conduct more extensive matching and improve the results. ADIN (Liang et al., 2019) stacks asynchronous inference layers for a multi-step reasoning process.
Recently the pre-trained language models have achieved state-of-the-art results on text matching tasks with pre-training and finetuning procedure (Devlin et al., 2019). Nevertheless, large parameter size and slow inference speed make it hard to directly deploy these structures to the real applications. Different from above methods, we propose a simple but effective gated augmentation layer to enrich the intermediate representations with the original word features and contextual features, and thus guide model to produce better alignments.

Conclusions
In this work, we present ESAN, an enhanced sentence alignment network for text matching. We flexibly integrate both word and contextual features to the intermediate representations with a gate operation to conduct better co-attention between two sequences. Our model outperforms strong baselines on three datasets and contains fewer parameters, which indicates the model capacity on producing proper alignments for text matching. In the future, we also plan to apply our methods to some other scenarios such as question answering.