Original Semantics-Oriented Attention and Deep Fusion Network for Sentence Matching

Sentence matching is a key issue in natural language inference and paraphrase identification. Despite the recent progress on multi-layered neural network with cross sentence attention, one sentence learns attention to the intermediate representations of another sentence, which are propagated from preceding layers and therefore are uncertain and unstable for matching, particularly at the risk of error propagation. In this paper, we present an original semantics-oriented attention and deep fusion network (OSOA-DFN) for sentence matching. Unlike existing models, each attention layer of OSOA-DFN is oriented to the original semantic representation of another sentence, which captures the relevant information from a fixed matching target. The multiple attention layers allow one sentence to repeatedly read the important information of another sentence for better matching. We then additionally design deep fusion to propagate the attention information at each matching layer. At last, we introduce a self-attention mechanism to capture global context to enhance attention-aware representation within each sentence. Experiment results on three sentence matching benchmark datasets SNLI, SciTail and Quora show that OSOA-DFN has the ability to model sentence matching more precisely.


Introduction
Natural language sentence matching is a key technique of comparing two sentences and identifying the semantic relationship between them, which is usually viewed as a classification problem (Wang et al., 2017). The technique has applications in natural language inference to judge whether a hypothesis sentence can be inferred from a premise sentence (Bowman et al., 2015) and in paraphrase identification to determine whether two sentences express the equivalent meaning or not (Yin et al., 2015). The core issue for sentence matching is to model the relatedness between two sentences (Rocktäschel et al., 2015;Parikh et al., 2016;Wang et al., 2017;Duan et al., 2018).
Recently, neural network-based models for sentence matching have attracted more attention for their powerful ability to learn sentence representation (Bowman et al., 2015;Wang et al., 2017;Duan et al., 2018). There are mainly two types of frameworks: sentence encoding based framework and attention-based framework. For the first type of framework, a simple and effective model is proposed by using two sentence vectors (Bowman et al., 2015), but the interaction between two sentences is neglected. For the second type of framework, attention mechanism is introduced to model word-level interaction between two sentences and a higher accuracy is achieved (Rocktäschel et al., 2015;Parikh et al., 2016;Wang et al., 2017). Particularly, multi-layered deep matching network with attention shows that deeper models outperform shallower models (Duan et al., 2018).
However, the existing attention mechanism still has some limitations. When one sentence learns attention to another sentence, the attention is performed between two parallel layers and oriented to the intermediate representations from the preceding layer of another one. As a result, semantics to be paid attention are uncertain and unstable for matching because semantics are changed at different layers. On the other hand, the intermediate representations tend to be affected by error propagation in multi-layered attentions, in which if the first attention aligns the wrong position, the second attention will now have the incorrect information as input for alignment.
In order to address these problems, we propose an original semantics-oriented attention and deep fusion network (OSOA-DFN) for sentence matching. OSOA-DFN mainly consists of three sub-components: (1) original semantics-oriented cross sentence attention; (2) deep fusion; and (3) selfattention mechanism. The cross sentence attention is oriented to the original semantic representations of another sentence, so as to be able to capture inherent semantics by relying on the fixed matching target. The multiple cross attention operations allow one sentence to repeatedly read the important information of another sentence for better interaction. We then design a deep fusion in addition to usual fusion to augment the propagation of attention information at each matching layer. The self-attention mechanism is also introduced at the last to capture global context to enhance attentionaware representation within each sentence. Experiment results demonstrate that OSOA-DFN has the ability to model sentence matching more precisely on the SNLI, the SciTail, and the Quora datasets.
Our contributions can be summarized as follows: • We pay attention to the original semantic representations for cross sentence interaction and the matching target of attention for a certain sentence is therefore ensured to be fixed in spite of multiple layers. The multiple cross attention operations allow one sentence to repeatedly read the important information of another sentence for better interaction.
• We design a deep fusion in addition to usual fusion to augment the propagation of attention information for matching, and introduce a self-attention mechanism at the last to capture global context to enhance attentionaware representation within each sentence.
• We evaluate our model on three challenging datasets and show that the proposed model has the ability to model sentence matching more precisely and significantly improves the performance.

General Neural Attention-Based Model for Sentence Matching
Formally, we can define the sentence matching as follows. Given two sentences P = [p 1 , · · · , p i , · · · , p m ] and Q = [q 1 , · · · , q j , · · · , q n ], the goal is to predict a label y * ∈ Y, where Y = {entailment, contradiction, neutral} in natural language inference and Y = {0,1} in paraphrase identification, indicating the logic semantic relationship between two sentences P and Q (Wang et al., 2017).
Generally, the architecture of neural attentionbased models for sentence matching includes three components (Wang et al., 2017;Duan et al., 2018): (1) input encoding layer encodes each sentence into semantic representation; (2) attention-based matching layer models word-level alignment between two sentences and produces attention-aware representation for each sentence; and (3) prediction layer predicts the semantic relation between two sentences. Figure 1(a) illustrates the general model.

Input Encoding Layer
For the given sentence pairs P = [p 1 , · · · , p i , · · · , p m ] and Q = [q 1 , · · · , q j , · · · , q n ], where p i and q j indicate the i-th and j-th word in P and Q respectively, the input encoding layer first converts words of P and Q into vectors [e p 1 , · · · , e p i , · · · , e pm ] and [e q 1 , · · · , e q j , · · · , e qn ] by looking up M respectively, where M ∈ R d×|V | is the embedding table. d is the dimension of embeddings and |V | is the size of the vocabulary.
In order to encode contextual information into word representations, we use a BiLSTM neural network (Hochreiter and Schmidhuber, 1997) to encode two sentences P and Q. The sequential BiLSTM calculates a new hidden state conditioned on the previous states to incorporate contextual information, and several previous works have shown its effectiveness for sentence matching (Rocktäschel et al., 2015;Wang et al., 2017;Duan et al., 2018).
Then the two sentences are converted to H 0 Hereafter, we call H 0 P and H 0 Q as original semantic representations of sentences P and Q respectively. In this paper, we will use them as the targets of cross sentence attention.

Attention-Based Matching Layer
Generally, this layer employs the attention mechanism to model the interaction information between  two sentences. It can be formulated as: where Match(·) is a neural attention-based matching function, V P = [v p 1 , · · · , v p i , · · · , v pm ] and V Q = [v q 1 , · · · , v q j , · · · , v qn ] are new attention-aware representations for P and Q, respectively. This layer is the core layer for sentence matching. Match(·) is mainly focused by researches and some effective frameworks are proposed (Rocktäschel et al., 2015;Wang et al., 2017;Duan et al., 2018). In this paper, we also focus on this layer, and propose an original semanticsoriented attention and deep fusion network. The details will be described in Section 3.

Prediction Layer
A pooling layer is used to convert the resulting representations of all position in P and Q into a fixed-length vector and feed it into a classifier to determine the semantic relationship between the two sentences.
A mean pooling is usually adopted on each sentence for capturing all of the information and also a max pooling for highlighting the significant properties. In this paper, we get a fixed dimensional representation V by concatenating them to-gether as (Chen et al., 2017;Duan et al., 2018).
Finally, we pass representation V into a multilayer perceptron (MLP) classifier to calculate the probability P r (·) of each label.

Original Semantics-Oriented Attention and Deep Fusion Network
In this paper, we mainly focus on the structure of attention-based matching layer. Inspired by the recent successful deep models (He et al., 2016;Duan et al., 2018), we propose an original semantics-oriented attention and deep fusion network (OSOA-DFN) for sentence matching, as shown in Figure 1

Original Semantics-Oriented Cross Sentence Attention
Cross sentence attention is utilized to model the relevance between two sentences. In the t-th attention layer, we use P(t)→Q to annotate that the sentence P learns attention to the sentence Q to extract the relevant information from Q. Given the representations of P and Q: , each cross attention P(t)→Q will use the original semantics H 0 Q of Q for interaction, where t = {1, · · · , T} and t = 1 represents P using the original representation H 0 P . We first compute the unnormalized attention weights as the similarity of P(t) and Q, the alignment matrix A t ∈ R m×n is defined as follows: where W t ∈ R h×h , U t p , U t q ∈ R h are learnable parameters, and ·, · denotes the inner production operation. p i and p j are the the i-th and j-th word in the P and Q respectively. Next, the semantics of sentence Q related to h t−1 p i is extracted to compute h t p i according to A t , as shown in Equation (10).
Intuitively, h t p i is a representation by using attentive information in H 0 Q that is softly aligned to h t−1 p i , and the semantics of H 0 Q is more probably selected if it is more related to h t−1 p i .

Deep Fusion
To further enrich the interaction, we first perform an usual fusion and then design a deep fusion for each cross attention to augment the propagation of attention information. The usual fusion (Wang and Jiang, 2016a;Duan et al., 2018) can be formulated as the Equations (11) -(13).
where [·; ·; ·; ·] refers to the concatenation operation. In matching operation, the concatenation can retain all the information (Wang and Jiang, 2016a;Chen et al., 2017). We use a neural nonlinear transformation ReLU (Glorot et al., 2011) as local comparison function. This operation helps the model to better fuse the attention information and also reduce the complexity of vector representation. Since the understanding of some word-level alignments may rely on the contextual matching information, we then apply a BiLSTM to incorporate the sequential matching information, which further gathers interactive features between two sentences. We design the deep fusion layer as follows. A gated connection layer is used to learn adaptively controlling how much information to be stored and carried to the next attention layer. It can be formulated as Equations (14) - (17): where W t * and b t * are the learnable parameters, h t p i is the result of current layer, h t−1 p i is the result from preceding layer, σ is a sigmoid function, the value of r t p i and z t p i is between 0 and 1. Intuitively, the model can learn to set the r t p i and z t p i close to 1, thus the more attention information from the preceding layers will be propagated to the following attention layers for matching, and close to 0 implying that the information of preceding layers is less propagated.
After the t-th layer of the original semanticsoriented cross attention, each word p i in sentence P is newly represented by h t p i . Similarly, we conduct cross attention for Q(t)→P, implying that the sentence Q learns attention to the sentence P, which will be oriented to the original semantic representation H 0 P of P to derive the attentionaware representation h t q j for each word q j of Q.

Self-Attention Mechanism
We additionally introduce a self-attention mechanism after cross sentence attention. It captures long-distance context information to learn word representation within each sentence and further enhances the attention-aware representation.
For sentence P, its attentive representation H T p = [h T p 1 , · · · , h T p i , · · · , h T pm ] is computed after T layers of original semantics-oriented cross sentence attention. We first compute a self-attention matrix S s ∈ R m×m as Equation (9).
where, S s ij indicates the relevance between the i-th word and j-th word in P. Then, the self-attention vector for each word in P is computed as follows: Intuitively, h s p i augments each word representation with global context of the sentence P.
After that, an usual fusion augmented by deep fusion, as described in Section 3.2, is also introduced to further enhance the self-attention information within each sentence as follows: The deep fusion after the self-attention layer is computed as Equations (14) and (17), which fuses the h T p i from original semantics-oriented cross attention and the h s p i from self-attention to get the final attention-ware representation.
Similarly, we conduct self-attention and deep fusion operations to the sentence Q to derive the attention-aware representation h s q j for each word q j of Q. Then, two sentences are converted to H s P = [h s p 1 , · · · , h s p i , · · · , h s pm ] and H s Q = [h s q 1 , · · · , h s q j , · · · , h s qn ]. Finally, H s P and H s Q are passed into the prediction layer as input V P and V Q for deciding their semantic relationship.  where θ denotes all the learnable parameters of our model, N is the number of instances in the training set, (P (i) , Q (i) ) are the sentence pairs, and y (i) denotes the corresponding annotated label for the i-th instance.
Word Embedding Following (Tay et al., 2017), to represent each input word, we concatenate three types of vectors: a pre-trained vector, a learnable vector for each word type, and a learnable vector for the POS tag of the word. We use NLTK 1 to acquire POS tags. Finally, we apply a nonlinear transformation ReLU to the concatenated vector to get the final word embedding.

Dataset
We evaluate our model on natural language inference and paraphrase identification tasks with three datasets: the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015), the SciTail dataset (Khot et al., 2018), and the Quora Questions Pairs dataset (Quora).
SNLI is a natural language inference dataset (Bowman et al., 2015). The original data set contains 570,152 sentence pairs, each labeled with one of the following relationships: Y = {entailment, contradiction, neutral}. We follow the same data split as in (Bowman et al., 2015).
SciTail is a binary entailment classification task and Y = {entailment, neutral}. We have the same data split as in (Khot et al., 2018). Notably, the premise and the corresponding hypothesis have high lexical similarity both for entailed and nonentailed pairs, which makes the task particularly difficult.
Quora consists of over 400,000 question pairs and Y = {0, 1} indicating whether two questions are paraphrases of each other. We have the same data split as in (Wang et al., 2017).

Implementation Details
We set word embeddings and all of the hidden states of BiLSTMs and MLPs to 300 dimensions. Pre-trained word vectors are 300-dimensional Glove 840B (Pennington et al., 2014) and without updating during training. The learnable word vectors and POS vectors have 30 dimensions. For all datasets, there are 3 cross sentence attention layers and 1 self-attention layer. The batch size is set to 64 for SNLI and Quora, 32 for SciTail. We use the Adam method (Kingma and Ba, 2014) for model training. We set the initial learning rate to 5e-4 with a decay ratio of 0.95 for each epoch, and l 2 regularizer strength to 6e-5. To prevent overfitting, we use dropout regularization (Srivastava et al., 2014) with a drop rate of 0.2 for all MLPs.

Ensemble
The ensemble strategy has been proved to effectively improve model accuracy. Following (Duan et al., 2018), our ensemble model averages the probability distributions from three individual single OSOA-DFNs, and each of them has the same architecture but different parameter initialization.

Comparison on Natural Language Inference
SNLI We compare our model with the following previous models on SNLI dataset, and show the results in Table 2. LstmAtt (Rocktäschel et al., 2015) extend the general LSTM model with at-   (Duan et al., 2018) adopt attention-fused deep matching network by using multiple stacked cross attention and self-attention layers.
In Table 2, our single OSOA-DFN achieves 88.8% test accuracy. Moreover, we also report the ensemble result, and the test accuracy is 89.3%. Comparative results show that our model outperforms the previous models on single and ensemble scenarios on SNLI dataset. ELMO (Peters et al., 2018) and BERT (Devlin et al., 2018) have been well known as pre-trained language model for acquiring contextual word vectors. However, our model has less computing complexity (340M parameters in BERT while 10M in our model), but obtained competitive performance. We will conduct the comparison with them in the future. In this paper, we evaluated the contribution of original semantics-oriented cross attention and deep fusion to our model.
SciTail We compare our model with the following previous models on SciTail dataset, and show the results in Table 3. The first five models in Table 3 are all implemented in (Khot et al., 2018). DGEM is a graph based attention model using lingual syntactic structures for improved performance (Khot et al., 2018). CAFE (Tay et al., 2017) improve previous comparison operations by compressing alignment vectors into scalar valued fea-   tures. DEISTE (Yin et al., 2018) propose deep explorations of inter-sentence interaction. AF-DMN (re-imp) is our re-implementation of the multilayered attention model in (Duan et al., 2018) that have not reported the results on this dataset.
On this dataset, our single OSOA-DFN significantly outperforms these strong baselines, achieving the state-of-the-art performance with 86.8% accuracy on the test set. It demonstrates that our model has the ability to improve semantic matching on the challenging SciTail dataset.

Comparison on Paraphrase Identification
Quora We compare our model with the following previous models on Quora dataset, and show the results in Table 4. The Siamese-CNN model and Siamese-LSTM model encode sentences with CNN and LSTM respectively, and then predict the relationship between them based on the cosine similarity (Wang et al., 2017). Multi-Perspective-CNN and Multi-Perspective-LSTM adopt multiple perspective cosine matching function (Wang et al., 2017). L.D.C (Wang et al., 2016) and BiMPM (Wang et al., 2017) adopt attention-based framework that performs word-level matching.
As we can see, our single OSOA-DFN outperforms the baselines and achieves 89.03% accuracy on the test set. The results prove that our model is very effective for paraphrase identification task.

Effect of Original Semantics-Oriented Cross Sentence Attention
To verify the effect of original semantics-oriented cross sentence attention, we first implement a variant of our model, namely OSOA-DFN (inter-   attention), as shown in Table 5. As in the model of (Duan et al., 2018), we make the cross sentence attention oriented to the intermediate representations from the preceding layer of another sentence, where cross attention is performed on parallel layers between two sentences. The results show that our method achieves higher accuracy on the test set of SciTail dataset, which proves the effect of original semantics-oriented cross attention on extracting expressive features from another sentence for semantic matching.
We further verify the effect of the depth of cross sentence attention on performance, as shown in Table 6. As the number of stacked attention layers increases from 1 to 5, we can see that the performance increases both on the development set and the test set of SciTail dataset. But from 3 to 5, the increased accuracy is slower. We can conclude that the multiple original semantics-oriented attentions are effective in improving matching performance. However, the parameters will grow rapidly with the increasing of the number of stacked attention layer, and a large of number of parameters will increase model complexity. Because of computational cost, we just set the number of cross attention layers to 3 in our experiment.

Effect of Deep Fusion and Self-Attention Mechanism
We verify the effect of deep fusion and selfattention for better understanding of the performance improvement of OSOA-DFN, as shown in Table 7. Only using BiLSTM fusion without deep fusion, the accuracy drops by 2.1% on the test set of SciTail dataset. This indicates the augmented deep fusion at different layers is important in prop-  agating attention information through the network for deep interaction. Without the self-attention, the accuracy is degraded to 84.8%. This indicates that the self-attention mechanism is effective in capturing global context information for augmenting attention-aware semantic representation. We also verify the effect of BiLSTM fusion, without the BiLSTM fusion, the accuracy is degraded to 83.2%. This shows contextual information gathered by BiLSTM fusion is important for interaction between two sentences.

What is Learned by Attention ?
We further investigate the results of the multilayered cross sentence attention and the selfattention and then visualize the results in Figure 2. This is an instance from the test set of the SciTail dataset: {P: all living cells have a plasma membrane that encloses their contents. Q: all types of cells are enclosed by a membrane. The label y: entailment.}. The results are produced by OSOA-DFN with 3 original semantics-oriented cross sentence attentions P(t)→Q and 1 self-attention. We visualize the attention matrices for each layer to show the dynamic attention changes.
From the results, we observe that the first attention layer may have erroneous alignments. We can find that the premise word "encloses" is incorrectly aligned with hypothesis word "all". In the second attention layer, the alignment quality is improved dramatically, where the "encloses" is correctly aligned to "enclosed". It shows that the second attention layer effectively revises the errors from the first attention layer. In the second and third attention layers, the attention gradually tends to capture phrase-level alignments, such as "that encloses their contents" and "enclosed", and "cells have a plasma membrane" and "mem-brane". Meanwhile, with the increment of interaction, the high attention layers also tend to capture new alignment information from another sentence that is not captured in low attention layers.
In the self-attention layer, we observe that the phrase "plasma membrane that encloses their contents" is strongly aligned to the phrase "living cells". This layer captures long-distance semantic dependency within the sentence. The visualization of attention further shows that our proposed model is capable of capturing alignment information from two sentences for better semantic matching.

Related Works
Recently, deep neural network models have achieved promising results in modeling sentence matching. A standard practice is to encode each sentence as a vector with a neural network (Bowman et al., 2015;Mou et al., 2015;Tan et al., 2015), and then the relation is decided based on the two sentence vectors. This kind of framework ignores the interaction between two sentences.
Most recent works (Wang and Jiang, 2016a;Chen et al., 2017;Duan et al., 2018) employ attention mechanism to model interaction between two sentences. The attention-based framework matches two sentences at the word level. (Wang and Jiang, 2016b) design a specific LSTM called matching-LSTM that performs word-byword matching of the hypothesis with the premise. Furthermore, (Wang et al., 2017) and (Chen et al., 2017) propose a new framework to model the relationship between two sentences, which performs the matching from two directions. To improve the attention-based framework, (Duan et al., 2018) propose an attention-fused deep matching network (AD-DMN), which is based on multilayered attention mechanism and shows that multiple stacked attention layers can improve matching performance. Besides cross sentence attention, the self-attention mechanism is proposed to solve the limitations of RNN model on the long-term dependency problem, which aims to align the sentence with itself and has been used in a variety of tasks (Lin et al., 2017;Duan et al., 2018).
Our proposed OSOA-DFN conducts original semantics-oriented cross sentence attention to model the matching. We design deep fusion to augment the propagation of attention information. At last, we introduce a self-attention mechanism to capture global context to enhance semantic representation. Compared to AF-DMN (Duan et al., 2018), we just use one self-attention layer instead of multiple layers, which reduces model complexity but achieves outperformed accuracy.

Conclusions and Future Work
In this paper, we propose an original semanticsoriented attention and deep fusion network (OSOA-DFN) for sentence matching. It leverages original semantics-oriented cross sentence attention, deep fusion and self-attention mechanism jointly. We compare our model with the previous models on two sentence matching tasks: natural language inference and paraphrase identification. Experiment results show that OSOA-DFN has the ability to model sentence matching more precisely and significantly improves the performance.
In the future, we will further investigate the effect of the network depth on sentence matching and explore introducing external knowledge, such as pre-trained language model BERT (Devlin et al., 2018) and paraphrase database (Ganitkevitch et al., 2013), to help learning more accurate and robust sentence representation.