Simple and Effective Text Matching with Richer Alignment Features

In this paper, we present a fast and strong neural approach for general purpose text matching applications. We explore what is sufficient to build a fast and well-performed text matching model and propose to keep three key features available for inter-sequence alignment: original point-wise features, previous aligned features, and contextual features while simplifying all the remaining components. We conduct experiments on four well-studied benchmark datasets across tasks of natural language inference, paraphrase identification and answer selection. The performance of our model is on par with the state-of-the-art on all datasets with much fewer parameters and the inference speed is at least 6 times faster compared with similarly performed ones.


Introduction
Text matching is a core research area in natural language processing with a long history. In text matching tasks, a model takes two text sequences as input and predicts a category or a scala value indicating their relationship. A wide range of tasks, including natural language inference (also known as recognizing textual entailment) (Bowman et al., 2015;Khot et al., 2018), paraphrase identification , answer selection (Yang et al., 2015), and so on, can be seen as specific forms of text matching problems. Research on general purpose text matching algorithm is beneficial to a large number of relevant applications.
Deep neural networks are the most popular choices for text matching nowadays. Semantic alignment and comparison of two text sequences are the keys in neural text matching. Many previous deep neural networks contain a single intersequence alignment layer. To make full use of this only alignment process, the model has to take rich external syntactic features or hand-designed align-ment features as additional inputs of the alignment layer (Chen et al., 2017;Gong et al., 2018), adopt a complicated alignment mechanism Tan et al., 2018), or build a vast amount of post-processing layers to analyze the alignment result (Tay et al., 2018b;Gong et al., 2018).
More powerful models can be built with multiple inter-sequence alignment layers. Instead of making a prediction based on the comparison result of a single alignment process, a stacked model with multiple alignment layers maintains its intermediate states and gradually refines its predictions. However, suffering from inefficient propagation of lower-level features and vanishing gradients, these deeper architectures are harder to train. Recent works have come up with ways of connecting stacked building blocks including dense connection (Tay et al., 2018a;Kim et al., 2018) and recurrent neural networks (Liu et al., 2018), which strengthen the propagation of lower-level features and yield better results than those with a single alignment process. This paper presents RE2, a fast and strong neural architecture with multiple alignment processes for general purpose text matching. We question the necessity of many slow components in text matching approaches presented in previous literature, including complicated multi-way alignment mechanisms, heavy distillations of alignment results, external syntactic features, or dense connections to connect stacked blocks when the model is going deep. These design choices slow down the model by a large amount and can be replaced by much more lightweight and equally effective ones. Meanwhile, we highlight three key components for an efficient text matching model. These components, which the name RE2 stands for, are previous aligned features (Residual vectors), original point-wise features (Embedding vectors), and contextual features (Encoded vectors). The re-maining components can be as simple as possible to keep the model fast while still yielding strong performance.
The general architecture of RE2 is illustrated in Figure 1. An embedding layer first embeds discrete tokens. Several same-structured blocks consisting of encoding, alignment and fusion layers then process the sequences consecutively. These blocks are connected by an augmented version of residual connections (see section 2.1). A pooling layer aggregates sequential representations into vectors which are finally processed by a prediction layer to give the final prediction. The implementation of each layer is kept as simple as possible, and the whole model, as a well-organized combination, is quite powerful and lightweight at the same time.
Our proposed method achieves the performance on par with the state-of-the-art on four benchmark datasets across three different tasks, namely SNLI and SciTail for natural language inference, Quora Question Pairs for paraphrase identification, and WikiQA for answer selection. Furthermore, our model has the least number of parameters and the fastest inference speed in all similarlyperformed models. We also conduct an ablation study to compare with alternative implementations of most components, perform robustness checks to see whether the model is robust to changes of structural hyperparameters, explore what roles the three key features in RE2 play by comparing their occlusion sensitivity and show the evolution of alignment results by a case study. We release the source code 1 of our experiments for reproducibility and hope to facilitate future researches.

Our Approach
In this section, we introduce our proposed approach RE2 for text matching. Figure 1 gives an illustration of the overall architecture. Two text sequences are processed symmetrically before the prediction layer, and all parameters except those in the prediction layer are shared between the two sequences. For conciseness, we omit the part for the other sequence in the figure.
In RE2, tokens in each sequence are first embedded by the embedding layer and then processed consecutively by N same-structured blocks with independent parameters (dashed boxes in Figure  Figure 1: An overview of RE2. There are three parts in the input of alignment and fusion layers: original pointwise features (Embedding vectors, denoted by blank rectangles), previous aligned features (Residual vectors, denoted by rectangles with diagonal stripes), and contextual features (Encoded vectors, denoted by solid rectangles). The architecture on the right is the same as the one on the left so it's omitted for conciseness. 1) connected by augmented residual connections. Inside each block, a sequence encoder first computes contextual features of the sequence (solid rectangles in Figure 1). The input and output of the encoder are concatenated and then fed into an alignment layer to model the alignment and interaction between the two sequences. A fusion layer fuses the input and output of the alignment layer. The output of the fusion layer is considered as the output of this block. The output of the last block is sent to the pooling layer and transformed into a fixed-length vector. The prediction layer takes the two vectors as input and predicts the final target. The cross entropy loss is optimized to train the model in classification tasks.
The implementation of each layer is kept as simple as possible. We use only word embeddings in the embedding layer, without character embeddings or syntactic features. Vanilla multi-layer convolutional networks with same padding (Collobert et al., 2011) are adopted as the encoder. Recurrent networks are slower and do not lead to further improvements, so they are not adopted here. A max-over-time pooling operation (Collobert et al., 2011) is used in the pooling layer.
The details of augmented residual connections and other layers are introduced as follows.

Augmented Residual Connections
To provide richer features for alignment processes, RE2 adopts an augmented version of residual connections to connect consecutive blocks. For a sequence of length l, We denote the input and output of the n-th block as l ), respectively. Let o (0) be a sequence of zero vectors. The input of the first block x (1) , as mentioned before, is the output of the embedding layer (denoted by blank rectangles in Figure 1). The input of the n-th block x (n) (n ≥ 2), is the concatenation of the input of the first block x (1) and the summation of the output of previous two blocks (denoted by rectangles with diagonal stripes in Figure 1): where [; ] denotes the concatenation operation. With augmented residual connections, there are three parts in the input of alignment and fusion layers, namely original point-wise features kept untouched along the way (Embedding vectors), previous aligned features processed and refined by previous blocks (Residual vectors), and contextual features from the encoder layer (Encoded vectors). Each of these three parts plays a complementing role in the text matching process.

Alignment Layer
A simple form of alignment based on the attention mechanism is used following Parikh et al. (2016) with minor modifications. The alignment layer, as shown in Figure 1, takes features from the two sequences as input and computes the aligned representations as output. Input from the first sequence of length l a is denoted as a = (a 1 , a 2 , . . . , a la ) and input from the second sequence of length l b is denoted as The similarity score e ij between a i and b j is computed as the dot product of the projected vectors: (2) F is an identity function or a single-layer feedforward network. The choice is treated as a hyperparameter. The output vectors a and b are computed by weighted summation of representations of the other sequence. The summation is weighted by similarity scores between the current position and the corresponding positions in the other sequence: (3)

Fusion Layer
The fusion layer compares local and aligned representations in three perspectives and then fuse them together. The output of the fusion layer for the first sequenceā is computed bȳ where G 1 , G 2 , G 3 , and G are single-layer feedforward networks with independent parameters and • denotes element-wise multiplication. The subtraction operator highlights the difference between the two vectors while the multiplication highlights similarity. Formulations forb are similar and omitted here.

Prediction Layer
The prediction layer takes the vector representations of the two sequences v 1 and v 2 from the pooling layers as input and predicts the final target following Mou et al. (2016): H is a multi-layer feed-forward neural network. In a classification task,ŷ ∈ R C represents the unnormalized predicted scores for all classes where C is the number of classes. The predicted class isŷ = argmax iŷi . In a regression task,ŷ is the predicted scala value. In symmetric tasks like paraphrase identification, a symmetric version of the prediction layer is used for better generalization: We also provide a simplified version of the prediction layer. Which version to use is treated as a hyperparameter. The simplified prediction layer can be expressed as: 3 Experiments

Datasets
In this section, we briefly introduce datasets used in the experiments and their evaluation metrics. SNLI (Bowman et al., 2015) (Stanford Natural Language Inference) is a benchmark dataset for natural language inference. In natural language inference tasks, the two input sentences are asymmetrical. The first one is called "premise" and the second is called "hypothesis". The dataset contains 570k human annotated sentence pairs from an image captioning corpus, with labels "entailment", "neutral", "contradiction" and "-". The "-" label indicates that the annotators cannot reach an agreement, so we ignore text pairs with this kind of labels in training and testing following Bowman et al. (2015). We use the same dataset split as in the original paper. Accuracy is used as the evaluation metric for this dataset.
SciTail (Khot et al., 2018) (Science Entailment) is an entailment classification dataset constructed from science questions and answers. Since scientific facts cannot contradict with each other, this dataset contains only two types of labels, entailment and neutral. We use the original dataset partition. This dataset contains 27k examples in total. 10k examples are with entailment labels and the remaining 17k are labeled as neutral. Accuracy is used as the evaluation metric for this dataset.
Quora Question Pairs 2 is a dataset for paraphrase identification with two classes indicating whether one question is a paraphrase of the other. The dataset contains more than 400k real question pairs collected from Quora.com. We use the same dataset partition as mentioned in . Accuracy is used as the evaluation metric for this dataset.
WikiQA (Yang et al., 2015) is a retrieval-based question answering dataset based on Wikipedia. It contains questions and their candidate answers, with binary labels indicating whether a candidate sentence is a correct answer to the question it belongs to. This dataset has 20.4k training pairs, 2.7k development pairs, and 6.2k testing pairs. Mean average precision (MAP) and mean reciprocal rank (MRR) are used as the evaluation metrics for this task.

Implementation Details
We implement our model with TensorFlow (Abadi et al., 2016) and train on Nvidia P100 GPUs. We tokenize sentences with the NLTK toolkit (Bird et al., 2009), convert them to lower cases and remove all punctuations. We do not limit the maximum sequence length, and all sequences in a batch are padded to the batch-wise maximum. Word embeddings are initialized with 840B-300d GloVe word vectors (Pennington et al., 2014) and fixed during training. Embeddings of out-ofvocabulary words are initialized to zeros and fixed as well. All other parameters are initialized with He initialization (He et al., 2015) and normalized by weight normalization (Salimans and Kingma, 2016). Dropout with a keep probability of 0.8 is applied before every fully-connected or convolutional layer. The kernel size of the convolutional encoder is set to 3. The prediction layer is a two-layer feed-forward network. The hidden size is set to 150 in all experiments. Activations in all feed-forward networks are GeLU activations (Hendrycks and Gimpel, 2016), and we use √ 2 as an approximation of the variance balancing parameter for GeLU activations in He initialization. We scale the summation in augmented residual connections by 1/ √ 2 when n ≥ 3 to preserve the variance under the assumption that the two addends have the same variance.
The number of blocks is tuned in a range from 1 to 3. The number of layers of the convolutional encoder is tuned from 1 to 3. Although in robustness checks (Table 7) we validate with up to 5 blocks and layers, in all other experiments we deliberately limit the maximum number of blocks and number of layers to 3 to control the size of the model. We use the Adam optimizer (Kingma and Ba, 2015) and an exponentially decaying learning rate with a linear warmup. The initial learning rate is tuned from 0.0001 to 0.003. The batch size is tuned from 64 to 512. The threshold for gradient clipping is set to 5. For all the experiments except for the comparison of ensemble models, we report the average score and the standard deviation of 10 runs.

Results on Natural Language Inference
Results on SNLI dataset are listed in Table 1. We compare single models and ensemble models. For a fair comparison, we only compare with results obtained without external contextualized embed-
Our method obtains a result on par with the state-of-the-art among single models and a highly competitive result among ensemble models, with only a few parameters. Compared to SAN, our model reduces 20% parameters while improves the performance by 0.3% in accuracy, which indicates that our proposed architecture is highly efficient.
Results on Scitail dataset are listed in Table 2. The performance of our method is very close to state-of-the-art. This dataset is considered much more difficult with fewer training data available and generally low accuracy as a binary classification problem. The variance of the results is larger since the size of training and test set is only 4% and 20% compared to those of SNLI.

Results on Paraphrase Identification
Results on Quora dataset are listed in Table 3. Since paraphrase identification is a symmetric task where two input sequences can be swapped with no effect to the label of the text pair, in hyperparameter tuning we validate between two symmet-

Results on Answer Selection
Results on WikiQA dataset are listed in Table 4. Note that some of the previous methods round their reported results to three decimal points, but we choose to align with the original paper (Yang et al., 2015) and round our results to four decimal points. In hyperparameter tuning, we choose the best hyperparameters including early stopping according to MRR on WikiQA development set. We obtain a result on par with the state-of-the-art reported on this dataset. It's worth mentioning that we still train our model by point-wise binary classification loss, unlike some of the previous methods (including HCRN) which are trained by the pairwise ranking loss. Our method can perform well in the answer selection task without any taskspecific modifications.

Inference Time
To show the efficiency of our proposed model, we compare the inference time with some other models whose code is open-source. Table 5 shows the comparison results. All the compared models are implemented in TensorFlow in the original implementations. The † mark indicates that the model uses POS tags as external syntactic features and the computation time of POS tagging is not included. In our RE2 model, the number of en-Model time(s/batch) BiMPM  0.05 ± 0.00 CAFE † (Tay et al., 2018b) 0.07 ± 0.01 DIIN † (Gong et al., 2018) 0.85 ± 0.11 DIIN with EM feature † 1.79 ± 0.22 CSRAN † (Tay et al., 2018a) 0.28 ± 0.02 RE2 (1 block) 0.03 ± 0.00 RE2 (2 blocks) 0.04 ± 0.00 RE2 (3 blocks) 0.05 ± 0.00 coder layers is set to 3, the largest possible number in all previously reported experiments. Besides, since all the reported results of our proposed method are obtained with no more than 3 blocks, we only measure the inference time of RE2 with 1-3 blocks. We train all the compared models using the official training code and commands released by the authors on Nvidia P100 GPUs and save model checkpoints to disk. After training, all the models are required to make predictions for a batch of 8 pairs of sentences on a MacBook Pro with Intel Core i7 CPUs. The lengths of these sentences are 20 and the maximum number of characters in a word is 12. The reported statistics are the average and the standard deviation of processing 100 batches. The comparison results in Table 5 show that our method has very high CPU inference speed, even with multiple stacked blocks. Compared with similarly performed methods, ours is 6 times faster than CSRAN and at least 17 times faster than DIIN. With the highly efficient design, our method can perform well without any strong but slow building blocks like recurrent neural networks, dense connections or any syntactic features. Compared with models of similar inference speed, BiMPM and CAFE, ours obtains much higher prediction scores according to Table 1, Table 2, Table 3 and Table 4.
In summary, our proposed method achieves performance on par with the state-of-the-art on all four well-studied datasets across three different tasks with only a few parameters and fast inference speed.

Analysis
Ablation study. We present an ablation study of our model, comparing the original model with 6 ablation baselines: (1) "w/o enc-in": use directly  the output of the encoder as the input of the alignment and fusion layers like in most previous approaches without concatenating the encoder input; (2) "residual conn.": use vanilla residual connec- ) in place of the augmented version; (3) "simple fusion": use simplȳ as the fusion layer; (4) "alignment alt.": use the alternative version of the alignment layer where F in Equation 2 is a single-layer feed-forward network or an identity function; (5) "prediction alt.": use the alternative version (Equation 5/6 or Equation 7) of the prediction layer; (6) parallel blocks: feed the embeddings directly to all the blocks and sum up their outputs as the input of the pooling layer instead of processing input sequences consecutively by each block. The last setting is designed to study whether the improvement is due to deeper architecture or just a larger amount of parameters.
The ablation study is conducted on the development set of SNLI, Quora, Scitail, and WikiQA. In WikiQA we choose MRR as the evaluation metric. Note that on SciTail, F in Equation 2 in alignment layers is an identity function while on all other datasets F is a single-layer feed-forward network. On WikiQA, the simplified version (Equation 7) is used as the prediction layer while on all other datasets the full version (Equation 5 or 6) is used. The reported results are the average of 10 runs and the standard deviations are omitted for clarity.
The result is shown in Table 6. The first ablation baseline shows that without richer features as the alignment input, the performance on all datasets degrades significantly. This is the key component in the whole model. The results of the second baseline show that vanilla residual connections without direct access to the original pointwise features are not enough to model the relations in many text matching tasks. The simpler implementation of the fusion layer leads to evidently worse performance, indicating that the fu-   sion layer cannot be further simplified. On the other hand, the alignment layer and the prediction layer can be simplified on some of the datasets. In the last ablation study, we can see that parallel blocks perform worse than stacked blocks, which supports the preference for deeper models over wider ones.

Robustness checks.
To check whether our proposed method is robust to different variants of structural hyperparameters, we experiment with (1) the number of blocks varying from 1 to 5 with the number of encoder layers set to 2; (2) the number of encoder layers varying from 1 to 5 with the number of blocks set to 2. Robustness checks are performed on the development set of SNLI, Quora and Scitail. The result is presented in Table 7. We can see in the table that fewer blocks or layers may not be sufficient but adding more blocks or layers than necessary hardly harms the performance. On WikiQA dataset, our method does not seem to be robust to structural hyperparameter changes. Crane (2018) mentions that on WikiQA dataset a neural matching model (Severyn and Moschitti, 2015) trained with different random seeds can result in differences up to 0.08 in MAP and MRR. We leave the further investigation of the high variance on the WikiQA dataset for further work.
Occlusion sensitivity. To better understand what roles the three alignment features play, we perform an analysis of occlusion sensitivity similar to those in computer vision (Zeiler and Fergus, 2014). We use a three-block RE2 model to predict on SNLI dev set, mask one feature in one block to zeros at a time and report changes in accuracy of the three categories: entailment, neutral and contradiction. Occlusion sensitivity can help to reveal how much the model depends on each part when deciding on a specific category and we can make some speculations about how the model works based on the observations. Figure 2 shows the result of occlusion sensitivity. Previous aligned features are absent in the first block and thus left blank.
The text matching process can be abstracted, with moderate simplifications, to three stages: aligning tokens between the two sequences, focusing on a subset of the aligned pairs, discerning the semantic relations between the attended pairs. Each of the three key features in RE2 has a closer connection with one of the stages.
As we can see in Figure 2a, contextual features, represented by the output of the encoder, are indispensable when predicting entailment. These features connect with the first stage of text matching. The sequence encoder, implemented by convolutional networks, models local and phrase-level semantics, which helps to build correct alignment for each position. For example, consider the pair "A red car is next to a green house" and "A red car is parked near a house". If the noun phrases in the two sentences are not correctly modeled by the contextual encoding and "green" is incorrectly aligned with another color word "red", the pair looks much less like entailment.
In Figure 2b and Figure 2c, we can see that lacking direct access of previous aligned features (residual vectors), especially in the final block, results in significant degradation when predicting neutral and contradiction. Previous aligned features are related to the second stage of focusing on a subset of the aligned pairs. Without correct focus, the model may ignore non-entailing pairs and attend to other trivially aligned and semantically matched pairs, which results in failure in predicting neutral and contradiction. The importance of each position can be distilled and stored in previous aligned features and helps the model to focus in latter blocks.
We can conclude from Figure 2b and Figure  2c that when original point-wise features represented by embedding vectors are not directly accessible by alignment layers and fusion layers, the model is struggling to predict neutral and contradiction correctly. Original point-wise features connect with the final stage where semantic differences between aligned pairs are compared. Intact point-wise representations of the aligned pairs facilitate the model in the comparison of their semantic differences, which plays a vital role in predicting neutral and contradiction.
Case study. We present a case study of our model to show how inter-sequence alignment results evolve in our stacked architecture. An example pair of sentences are chosen from the development set of the SNLI dataset. The premise is "A green bike is parked next to a door", and the hypothesis is "The bike is chained to the door". In the first block, the alignment results are almost word-or phrase-level. "parked next to" is associated mostly with "bike" and "door" since there is a weaker direct connection between "parked" and "chained". In the final block, the alignment results take consideration of the semantics and structures of the whole sentences. The word "parked" is strongly associated with "chained" and "next to" is aligned with "to the" following "chained". With correct alignment, the model is able to tell that although most parts in the premise entail the aligned parts in the hypothesis, "parked" does not entail "chained", so it correctly predicts that the relation between the two sentences is neutral. Our model keeps the lower-level alignment results as intermediate states and gradually refines them to higherlevel ones.  Figure 3: A case study of the natural language inference task. The premise is "A green bike is parked next to a door", and the hypothesis is "The bike is chained to the door".

Related Work
Deep neural networks are dominant in the text matching area. Semantic alignment and comparison between two text sequences lie in the core of text matching. Early works explore encoding each sequence individually into a vector and then building a neural network classifier upon the two vectors. In this paradigm, recurrent (Bowman et al., 2015), recursive (Tai et al., 2015) and convolutional (Yu et al., 2014;Tan et al., 2016) networks are used as the sequence encoder. The encoding of one sequence is independent of the other in these models, making the final classifier hard to model complex relations.
Later works, therefore, adopt the matching aggregation framework to match two sequences at lower levels and aggregate the results based on the attention mechanism. DecompAtt (Parikh et al., 2016) uses a simple form of attention for alignment and aggregate aligned representations with feed-forward networks. ESIM (Chen et al., 2017) uses a similar attention mechanism but employs bidirectional LSTMs as encoders and aggregators.
Three major paradigms are adopted to further improve performance. First is to use richer syntactic or hand-designed features. HIM (Chen et al., 2017) uses syntactic parse trees. POS tags are found in many previous works including Tay et al. (2018b) and Gong et al. (2018). The exact match of lemmatized tokens is reported as a powerful binary feature in Gong et al. (2018) and Kim et al. (2018). The second way is adding complexity to the alignment computation. BiMPM  utilizes an advanced multiperspective matching operation, and MwAN (Tan et al., 2018) applies multiple heterogeneous attention functions to compute the alignment results.
The third way to enhance the model is building heavy post-processing layers for the alignment results. CAFE (Tay et al., 2018b) extracts additional indicators from the alignment process using alignment factorization layers. DIIN (Gong et al., 2018) adopts DenseNet as a deep convolutional feature extractor to distill information from the alignment results. More effective models can be built if intersequence matching is allowed to be performed more than once. CSRAN (Tay et al., 2018a) performs multi-level attention refinement with dense connections among multiple levels. DRCN (Kim et al., 2018) stacks encoding and alignment layers. It concatenates all previously aligned results and has to use an autoencoder to deal with exploding feature spaces. SAN (Liu et al., 2018) utilizes recurrent networks to combine multiple alignment results. This paper also proposes a deep architecture based on a new way to connect consecutive blocks named augmented residual connections, to distill previous aligned information which serves as an important feature for text matching.

Conclusion
We propose a highly efficient approach, RE2, for general purpose text matching. It achieves the performance on par with the state-of-the-art on four well-studied datasets across three different text matching tasks with only a small number of parameters and very high inference speed. It highlights three key features, namely previous aligned features, original point-wise features, and contextual features for inter-sequence alignment and simplifies most of the other components. Due to its fast speed and strong performance, the model is quite suitable for a wide range of related applications.