Inter-Weighted Alignment Network for Sentence Pair Modeling

Sentence pair modeling is a crucial problem in the field of natural language processing. In this paper, we propose a model to measure the similarity of a sentence pair focusing on the interaction information. We utilize the word level similarity matrix to discover fine-grained alignment of two sentences. It should be emphasized that each word in a sentence has a different importance from the perspective of semantic composition, so we exploit two novel and efficient strategies to explicitly calculate a weight for each word. Although the proposed model only use a sequential LSTM for sentence modeling without any external resource such as syntactic parser tree and additional lexicon features, experimental results show that our model achieves state-of-the-art performance on three datasets of two tasks.


Introduction
Given two pieces of sentences S and T , sentence pair modeling (SPM) is a fundamental task whose applications include question answering (Lin, 2007), natural language inference (Bowman et al., 2015), paraphrase identification (Socher et al., 2011a) and sentence completion  and so on. In general, each of the two sentences are firstly mapped to a representation, and then a model is designed to determine the relation between them. Traditional methods use lexicon features such as Bag-of-Words(BOW) to map sentences. As we know, features design and selection are time-consuming and high dimensional features may suffer from sparsity because of the variation of linguistic. Recently, deep learning tech- * Corresponding author niques have been applied to develop end-to-end models for NLP tasks, such as sentence modeling (Socher et al., 2011b;Kim, 2014), relation classification (Socher et al., 2012) and machine translation (Sutskever et al., 2014). These works show that deep learning models can be comparable with hand-crafted features based models and often outperform them.
Existing DNN models are based on pre-trained word embeddings which map each word to one low dimensional vector and compose word embeddings to represent sentence. Some models are developed directly from the sentence models. They obtain single vector representation for each sentence separately and then determine the relation based on two vectors (Huang et al., 2013;Qiu and Huang, 2015;Palangi et al., 2016). Because of the absence of interaction, these models can not achieve state-of-the-art performance.
Inspired by attention mechanism in computer vision and machine translation, some elaborate models have been proposed (Rocktäschel et al., 2016;Wang and Jiang, 2016) which take interaction information into consideration. Meanwhile, to grasp the fine-grained information for semantic similarity, some prior works  firstly compute a word level similarity matrix according to word representation, and utilize multiple convolution layers and extract features from the similarity matrix in a perspective of image recognition.
In this paper, we focus on solving SPM problem by measuring semantic similarity between two sentences. We propose a new deep learning model based on two facts that previous works always neglected. As we know, in the aspect of semantic, each word in the sentence is of different importance. When calculating a sentence representation we should endow each word with a weight indicating its importance. Taking following sentences as an example: A: a man with a red helmet is riding a motorbike along a roadway. B: a man is riding a motorbike along a roadway. C: a man with a red helmet is riding a bicycle along a roadway. We can see that sentence A is more similar with sentence B than with sentence C while a conventional model probably makes an opposite conclusion because the phrase "with a red helmet" will bias the meaning of A to C meanwhile the difference between "motorbike" and "bicycle" is not large enough. If the model can realize that the phrase "with a red helmet" has little effect on semantic composition, the mistake will be avoided. Since we have to analyse a pair of sentences, the weights should be related to not only the sentence itself, but also its partner. From this point, we propose a novel inter-weighted layer to measure the importance of each word.
On the other hand, the more similar two sentences are, the more probably we can align each word of sentence S with several words of sentence T , and vice versa. On account of the variety of expression, the position and length of two aligned parts are very likely different, so we apply softalignment mechanism and build an effective alignment layer.
In summary, our contributions are as follows: 1. We propose an Inter-Weighted Alignment Network (IWAN) for SPM, which builds an alignment layer to compute similarity score according to the degree of alignment.
2. Considering the importance of each word in a sentence is different, we argue that an interweighted layer for evaluating the weight of each word is crucial to semantic composition. We propose two strategies for calculating weights. Experimental results demonstrate their effectiveness.
3. Experimental results on semantic relatedness benchmark dataset SICK and two answer selection datasets show that proposed model achieves state-of-the-art performances without any external information.
2 Related Work

Sentence Models
For sentence modeling, RNN (Elman, 1990;Mikolov et al., 2010) and CNN (Kim, 2014) are both powerful and widely used. RNN models a sentence sequentially by updating the hidden state which represents context recurrently. As sentence length grows, RNN will suffer from gradient vanishing problem. However, gated mechanism, such as Long Short Term Memory(LSTM) (Hochreiter and Schmidhuber, 1997) is introduced to address it. RecNN exploits syntatic information and models sentences under a tree structure. Gated mechanism can also improve the performance of RecNN (Tai et al., 2015). CNN can extract and combine important local context meanwhile model sentences in a hierarchical way (Kim, 2014;Kalchbrenner et al., 2014). All of the above models can be adapted to SPM by modeling two sentences separately.

Attentive Models
Hermann et al. (2015) firstly introduces attention mechanism into question answering under an RNN architecture. Rocktäschel et al. (2016) applies a similar model to natural language inference which attends over the premise conditioned on the hypothesis.  combines attention mechanism with tree-structured RecNN encoder. Some prior works (Wang et al., 2016b;Parikh et al., 2016;Wang et al., 2017) compute softalignment representation for each word in sentences attentively with word level similarity and then compose the alignment representations to determine the relation. Our model is also under this framework however we focus on explicitly calculating weights for each word to get more reasonable semantic composition.  adopts CNN on word level similarity matrix to extract fine-grained matching patterns from different text granularity.  uses a similar architecture with a 19-layer CNN in order to make full use of its power. Yin and Schütze (2015) proposes a hierarchical architecture to model different granularity representation and computes several similarity matrices for interaction.

Proposed Model
Given two sentences S and T , we aim to calculate a score to measure their similarity. Figure 1 shows the architecture of IWAN model. To learn representations with context information, we firstly use a bi-direction LSTM sentence model which takes word embeddings as inputs to obtain a context-aware representation for each position (Sec. 3.1). The context-aware representations are used to compute the word level similarity matrix (Sec. 3.2). Inspired by attention mechanism, we exploit soft-alignment to find semantic counterpart in one sentence for each position in the other and compute a weighted sum vector of one sentence as the alignment representation of each position of the other with an alignment layer (Sec. 3.3). Meanwhile, taking the context-aware representation of S and T as inputs, we apply an interweighted layer to compute a weight for each position in S and T . We argue that this weight can indicate the importance in semantic interaction and a weighted summation of the representations at each position is more interpretable than other composition method including max or average pooling and LSTM layer. We propose two strategies for computing those weights (Sec. 3.4). The weighted vectors are fed to full connection layers and a softmax layer is used to give the final prediction(Sec. 3.5).
As Figure 1 illustrates, our model is symmetric about S and T . So for simplicity, we only describe the left part of IWAN model which is mainly about modeling S from here. Right part is exactly same except the roles of S and T exchange.

BiLSTM Layer
With pre-trained d dimension word embedding, we can obtain sentence matrices S e = [s 1 e , . . . , s m e ] and T e = [t 1 e , . . . , t n e ] where s i e ∈ R d is embedding of the i-th word in sentence S. m and n are the length of S and T respectively. In order to capture contextual information, we run a bi-direction LSTM (Hochreiter and Schmidhuber, 1997) on two matrices. Let hidden layer dimension of LSTM be u. Given the word embedding x t at time step t, previous hidden vector h t−1 and cell state c t−1 , LSTM recurrently computes h t and c t as follows: where all W ∈ R u×d , V ∈ R u×u and b ∈ R u . σ is sigmoid function and ϕ is tanh function. ⊙ indicates the element-wise multiplication of two vectors. The input gates i, forget gates f and output gates o control information flow self-adaptively, moreover cell state c t can memorize long-distance information. h t is regarded as the representation of time step t.
We feed S e and T e separately into a parameter shared LSTM sentence model. If we run an LSTM model on the sequence of S e from left to right, we can get the forward hidden vector For applying bi-direction LSTM, we also run another LSTM backward and get S bh = [s 1 bh , . . . , s m bh ]. Then we concatenate them to one vector representation. So after bi-direction LSTM layer, we obtain

Word Level Similarity Matrix
As mentioned above, the word level similarity matrix is crucial to making use of the interaction information.  and Wang et al. (2016b) compute the similarity matrix between two word embeddings. We have argued that word embedding can not express the word meaning in context. From the view of RNN, s i f h contains the most semantic information about i-th word in S and less about the leftmost words, while s i bh also contains the most semantic information about ith word in S and less about the rightmost words. Therefore, the hidden vector of BiLSTM keeps the most information of corresponding word as well as integrated with the context information. Computing similarity matrix between BiLSTM hidden vectors is expected to improve the interaction results. We regard the inner dot of two vectors as their similarity. For the similarity matrix M , its element M ij indicates the similarity between s i h and t j h :

Alignment Layer
We design the alignment layer for an intuitive idea: more similar S and T are, more probably we can find semantic counterpart in T for each part in S, and vice versa. To some degree, people are likely to find semantic correspondences between two sentences and evaluate their similarity.  are also inspired by similar intuition, but they use deep CNN to recognize the alignment patterns implicitly. However, for each sentence, we explicitly calculate the alignment representation and alignment residual which we believe are good indicators of sentence pair similarity. For calculating the alignment representation, we apply attention mechanism (Bahdanau et al., 2014) to conduct a soft-alignment. The original attention mechanism outputs the alignment weights from an extra full connection layer while we think the inner dot can represent the semantic relatedness adequately. Therefore, we consider the i-th row of M as the similarity between the i-th position of S and each position in T and normalize it as follows: For T counterpart, the alignment representation is T a = [t 1 a , . . . , t n a ]. In order to measure the gap between the alignment representation and original representation, a direct strategy is to compute the absolute value of their difference: s i r = |s i h − s i a |. We call S r = [s 1 r , . . . , s m r ] alignment residual which is considered as alignment feature for subsequent processing.
We also utilize an orthogonal decomposition strategy which is first proposed by Wang et al.

Inter-Weighted Layer
3.4.1 Inter-Attention Layer (Lin et al., 2017) firstly proposes a self-attention sentence model which explicitly computes a weight for each word and uses the weighted summation of word representations as sentence embedding. Inspired by this work, we apply a full connection neural network to measure the importance to semantic interaction of every word. We extend the self-attention model to inter-attention layer in order to compute the weights combined with interaction information which composing the alignment representation benefits from. As the name suggests, these weights of S are not only dependent on S but also T and the parameters of the inter-attention layer are shared for S and T . Formally, we take S h and T h as inputs and the inter-attention layer outputs a vector w s with size m for S: )), where t avg = 1 n ∑ n k=1 t k h and S h ∈ R 2u×m . We calculate the average of {t k h } as the representation of T . We also try to replace average operator with a self-attention layer (Lin et al., 2017) but get a worse performance. e m is a vector of 1s with size m and ⊗ represents outer product operator. We feed the concatenated matrix containing pairwise information into a 2-layer neural network. The parameter W 1 ∈ R s×4u projects inputs into a hidden layer with s units. The output layer is parameterized by a vector w 2 with size s and a sof tmax operator ensures all the element of output sum up to 1. Then we can use w s to sum up S r , S p and S o weightedly across the position dimension: We can get t wr , t wp and t wo in the same way. We call these inter-features for final prediction.

Inter-Skip Layer
We also explore another novel strategy to compute w s from the intuition that if the i-th word in S has a low contribution to semantic composition, we will obtain a similar representation s i skip if we feed all word embeddings sequentially except s i e into BiLSTM. Unfortunately, the O(m 2 ) complexity of running BiLSTM model m times is too high so we exploit an approximate method to compute {s i skip }: Then we compute a skip feature as following: is the BiLSTM hidden representation of T . Figure 2 illustrates how to compute sf i skip . We think the difference between s i skip and s i h approximately reflects the contribution the i-th word makes to semantic composition. On the one hand, if the difference is small or even close to zero, the importance of correspond word should be small. On the other hand, if the difference (a vector) is not similar to the representation of T , correspond word is probably of less importance in measuring semantic similarity. From these two points, we think sf skip = [sf 1 skip , . . . , sf m skip ] is a good feature to measure word importance. The process of computing w s is similar: We can use w s outputted by inter-skip layer to obtain inter-features in the same way.

Output Layer
For more rich information, we combine alignment information with sentence embeddings of S and T for final prediction. We run the simple but effective self-attention (Lin et al., 2017) model on S h to obtain its embedding s wh : where W ′ 1 and w ′ 2 are trainable. We compute s wh and t wh with parameter shared self-attention layer which is similar with the inter-attention layer except inputs.
Following Tai et al. (2015), we compute their element-wise product h × = s wh ⊙ t wh and their absolute difference h + = |s wh − t wh | as selffeatures. If we use direct strategy, we combine the features as follows:  Table 1: Performances of our model with different strategies in alignment layer on three datasets.
If we use orthogonal decomposition strategy, we combine the features as follows: Following previous works, the sentence pair modeling problem can always be considered as a classification task, so we finally calculate a probability distribution with a 2-layer neural network: where h can be h di or h od and the hidden size is l. We use rectified linear units (ReLU) as activation function.

Dataset and Evaluation Metric
To evaluate the proposed model, we conduct experiments on two tasks: semantic relatedness and answer selection. For semantic relatedness task, we use the Sentences Involving Compositional Knowledge (SICK) dataset (Marelli et al., 2014), which consists of 9927 sentence pairs in a 4500/500/4927 train/dev/test split. The sentences are derived from existing image and video description and each sentence pair has a relatedness score y ∈ [1, 5], where the larger score indicates more similarity between two sentences. As the goal of this task is to calculate sentence pair similarity, we can directly evaluate our model on SICK. Following previous works, we use Pearson's Correlation r, Spearman's Correlation ρ and mean square error (MSE) as evaluation metrics.
For answer selection task, we experiment on two datasets: TrecQA and WikiQA. The TrecQA dataset (Wang et al., 2007) from the Text Retrieval Conferences has been widely used for the answer selection task during the past decade. The original TrecQA train dataset consists of 1,229 questions with 53,417 question-answer pairs, 82 questions with 1,148 pairs in development set, and 100 questions with 1,517 pairs in test set. Recent works (dos Santos et al., 2016;Rao et al., 2016;Wang et al., 2016b) removed questions in development and test set with no answers or with only positive/negative answers, thus there are 65 questions with 1,117 pairs in Clean version development set and 68 questions with 1,442 pairs in Clean version test set. Rao et al. (2016) has showed the performances on Original TrecQA and Clean version TrecQA are not comparable. Therefore, for a fair comparison, we only display the results on Clean version TrecQA which are posted on the website of Wiki of the Association for Computational Linguistics 1 . The open domain question-answering WikiQA (Yang et al., 2015) is constructed from real queries of Bing and Wikipedia. We follow Yang et al. (2015) to remove all questions with no correct candidate answers. The excluded WikiQA has 873/126/243 questions and 8627/1130/2351 question-answer pairs for train/dev/test split. To adapt our model to this task, we use semantic similarity to measure the probability of matching between a question and a candidate answer. We evaluate models by mean average precision (MAP) and mean reciprocal rank (MRR).

Training Details
For experiments on SICK, we follows Tai et al. (2015) to transform the relatedness score y to a sparse target distribution p: The training objective is to minimize the KL-divergence loss between p andp θ : where |D| is the number of training examples.
We regard the answer selection problem as "yes" or "no" binary classification and the training objective is to minimize the negative loglikelihood in training stage: where x (k) represents a question-answer pair and y (k) indicates whether the candidate answer is cor-rect to the question. In test stage, we sort condidate answers for same question in descending order by probability of "yes" category and calculate MAP and MRR.
In all experiments, we use 300-dimension GloVe word embeddings 2 (Pennington et al., 2014) and fix the embeddings during training. The LSTM hidden size u is set to 150. The hidden size of inter-attention and self-attention layer s and full connection network l are both set to 50. The L2 regularization strength is set to 3 × 10 −5 . We train the model with Adagrad (Duchi et al., 2011) optimization algorithm with a learning rate of 0.05. The minibatch size is always 25. We exploit early stopping strategy according to MSE on development set for SICK and MAP on development set for TrecQA and WikiQA.  Table 2: Test results on SICK. The symbol * indicates the models with pre-training. The symbol • indicates the models with data augmentation strategy. Wang and Ittycheriah (2015) 0.746 0.820 QA-LSTM (Tan et al., 2015) 0.728 0.832 Att-pooling (dos Santos et al., 2016) 0.753 0.851 LDC (Wang et al., 2016b) 0.771 0.845 MPCNN (He et al., 2015) 0.777 0.836 PWIM  0.738 0.827 NCE-CNN (Rao et al., 2016) 0.801 0.877 BiMPM (Wang et al., 2017) 0.802 0.875 IWAN-att (Proposed) 0.822 0.889 IWAN-skip (Proposed) 0.801 0.861 Table 3: Test results on Clean version TrecQA.
thogonal decomposition (OD) strategy has a superior performance to direct (DI) strategy on all datasets. The comparison results are posted in Table 1. In following experiments, we always choose OD strategy in alignment layer. Table 3 shows the performances of our model and compared models on SICK dataset. IWAN-att and IWAN-skip represents our models using inter-attention layer and inter-skip layer respectively. IWAN-skip outperforms IWAN-att in all metrics by a small margin. The traditional feature engineering based models in first group have much poorer performances than deep learning models. MaLSTM (Mueller and Thyagarajan, 2016) benefits from the data argumentation strategy with Wordnet information and pre-training process. Ablation experiments (Mueller and Thyagarajan, 2016) illustrates a 0.04 degradation of Pearson's r without data argumentation strategy. Therefore it is unfair to compare with this model directly, but our models achieve a comparable performance with it. Our models both outperform all other deep learning models. IWAN-skip outperforms Attentive Tree-LSTM   Lin, 2016)).

Semantic Relatedness
Answer Selection We compare our model with several state-of-the-art models on Clean version TrecQA and WikiQA in Table 3 and Table 4 respectively. Our two models both have a stateof-the-art performance on two datasets. IWANatt outperform all previous works on TrecQA and make a significant improvement of state-of-theart. IWAN-skip and IARNN (Wang et al., 2016a) which solves bias problem of attention mechanism beat all other models on WikiQA, while the latter is trained on an argumented dataset with negative sampling. Wang et al. (2016b) first proposes the orthogonal decomposition but their LDC model compute the similarity matrix between word embeddings which are lack of context information and IWAN-att outperforms it dramatically by 0.02-0.05 in MAP and MRR on both datasets. The PWIM ) is still competitive on WikiQA but gets an inferior performance on TrecQA. However, our models both have state-of-the-art performances on three datasets which demonstrates our models have excellent generalization ability in different datasets.  smaller. We found a large decline when removing BiLSTM layer, which confirms our conjecture that context information is useful. It is worth mentioning that  posts the degradation of their model from removing BiLSTM is 0.1225 in r which is much larger than 0.0387 of us. Removing inter-attention layer means we perform a mean-pooling on inter-features instead of a weighted summation. A 0.0075 r degradation proves importance weighting can result in a significant improvement. If the weights are only about single sentence information, the performance still gets worse. The last two settings show both components from orthogonal decomposition are informative. More or less unexpected, parallel component is almost as useful as orthogonal component.

Visualization of Inter-Weighted Layer
In order to illustrate the effect of inter-weighted layer in proposed model, we take a sentence pair in SICK test set as an example and display the weights outputted by inter-attention layer of each word in Figure 3. The ground truth of this pair is 3.2 and the prediction given by IWAN-att model is 3.507 which is much more accurate than 4.356 given by the model without inter-attention layer. We can find the inter-attention layer gives very high weights over 0.25 (while the average weight is about 0.14) to "sleeping" and "eating" which are the only difference between two sentences. Therefore, the difference will be attended in following processing. Meanwhile, the weights of the article "the" and the preposition "with" which are not as important as other real words in semantic composition are much lower. These prove the interweighted mechanism is reasonable and effective.

Conclusion
This work proposes a weighted alignment model for sentence pair modeling. We utilize an alignment layer to measure the similarity of sentence pairs according to their degree of alignment. Moreover, we propose an inter-weighted layer to measure the importance of different parts in sentences. Two strategies for this layer have been explored which are both effective. The composition of alignment features can benefit from the interweighted weights. Experiment results shows that proposed models achieve the state-of-the-art performance on three datasets. In the future work, we will improve the inter-weighted layer with more sophisticated module and evaluate our model on other large scale datasets.