Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement

Textual similarity measurement is a challenging problem, as it requires understanding the semantics of input sentences. Most previous neural network models use coarse-grained sentence modeling, which has difﬁculty capturing ﬁne-grained word-level information for semantic comparisons. As an alternative, we propose to explicitly model pairwise word interactions and present a novel similarity focus mechanism to identify important correspondences for better similarity measurement. Our ideas are implemented in a novel neural network architecture that demonstrates state-of-the-art accuracy on three SemEval tasks and two answer selection tasks.


Introduction
Given two pieces of text, measuring their semantic textual similarity (STS) remains a fundamental problem in language research and lies at the core of many language processing tasks, including question answering (Lin, 2007), query ranking (Burges et al., 2005), and paraphrase generation (Xu, 2014).
Traditional NLP approaches, e.g., developing hand-crafted features, suffer from sparsity because of language ambiguity and the limited amount of annotated data available. Neural networks and distributed representations can alleviate such sparsity, thus neural network-based models are widely used by recent systems for the STS problem (He et al., 2015;Tai et al., 2015;Yin and Schütze, 2015).
However, most previous neural network approaches are based on sentence modeling, which first maps each input sentence into a fixed-length vector and then performs comparisons on these representations. Despite its conceptual simplicity, researchers have raised concerns about this approach (Mooney, 2014): Will fine-grained wordlevel information, which is crucial for similarity measurement, get lost in the coarse-grained sentence representations? Is it really effective to "cram" whole sentence meanings into fixed-length vectors?
In contrast, we focus on capturing fine-grained word-level information directly. Our contribution is twofold: First, instead of using sentence modeling, we propose pairwise word interaction modeling that encourages explicit word context interactions across sentences. This is inspired by our own intuitions of how people recognize textual similarity: given two sentences sent 1 and sent 2 , a careful reader might look for corresponding semantic units, which we operationalize in our pairwise word interaction modeling technique (Sec. 5). Second, based on the pairwise word interactions, we describe a novel similarity focus layer which helps the model selectively identify important word interactions depending on their importance for similarity measurement. Since not all words are created equal, important words that can make more contributions deserve extra "focus" from the model (Sec. 6).
We conducted thorough evaluations on ten test sets from three SemEval STS competitions (Agirre et al., 2012;Marelli et al., 2014;Agirre et al., 2014) and two answer selection tasks (Yang et al., 2015;Wang et al., 2007). We outperform the recent multiperspective convolutional neural networks of He et al. (2015) and demonstrate state-of-the-art accuracy on all five tasks. In addition, we conducted ablation studies and visualized our models to show the clear benefits of modeling pairwise word interactions for similarity measurement.

Related Work
Feature engineering was the dominant approach in most previous work; different types of sparse features were explored and found useful. For example, n-gram overlap features at the word and character levels (Madnani et al., 2012;Wan et al., 2006), syntax features (Das and Smith, 2009;Xu et al., 2014), knowledge-based features using Word-Net (Fellbaum, 1998;Fern and Stevenson, 2008) and word-alignment features (Sultan et al., 2014).
The recent shift from sparse feature engineering to neural network model engineering has significantly improved accuracy on STS datasets. Most previous work use sentence modeling with a "Siamese" structure (Bromley et al., 1993). For example, Hu et al. (2014) used convolutional neural networks that combine hierarchical structures with layer-by-layer composition and pooling. Tai et al. (2015) and Zhu et al. (2015) concurrently proposed tree-structured long short-term memory networks, which recursively construct sentence representations following their syntactic trees. There are many other examples of neural network-based sentence modeling approaches for the STS problem (Yin and Schütze, 2015;Huang et al., 2013;Andrew et al., 2013;Weston et al., 2011;Socher et al., 2011;Zarrella et al., 2015).
Sentence modeling is coarse-grained by nature. Most recently, despite still using a sentence modeling approach, He et al. (2015) moved toward finegrained representations by exploiting multiple perspectives of input sentences with different types of convolution filters and pooling, generating a "matrix" representation where rows and columns capture different aspects of the sentence; comparisons over local regions of the representation are then performed. He et al. (2015) achieves highly competitive accuracy, suggesting the usefulness of fine-grained information. However, these multiple perspectives are obtained at the cost of increased model complexity, resulting in slow model training. In this work, we take a different approach by focusing directly on pairwise word interaction modeling. 3 Model Overview Figure 1 shows our end-to-end model with four major components: 1. Bidirectional Long Short-Term Memory Networks (Bi-LSTMs) (Graves et al., 2005;Graves et al., 2006) are used for context modeling of input sentences, which serves as the basis for all following components (Sec. 4). 2. A novel pairwise word interaction modeling technique encourages direct comparisons between word contexts across sentences (Sec. 5). 3. A novel similarity focus layer helps the model identify important pairwise word interactions across sentences (Sec. 6).

4.
A 19-layer deep convolutional neural network (ConvNet) converts the similarity measurement problem into a pattern recognition problem for final classification (Sec. 7).
To our best knowledge this is the first neural network model, a novel hybrid architecture combining Bi-LSTMs and a deep ConvNet, that uses a similarity focus mechanism with selective attention to important pairwise word interactions for the STS problem. Our approach only uses pretrained word embeddings, and unlike several previous neural network models (Yin and Schütze, 2015;Tai et al., 2015), we do not use sparse features, unsupervised model pretraining, syntactic parsers, or external resources like WordNet. We describe details of each component in the following sections.

Context Modeling
Different words occurring in similar semantic contexts of respective sentences have a higher chance to contribute to the similarity measurement. We therefore need word context modeling, which serves as a basis for all following components of this work.
LSTM (Hochreiter and Schmidhuber, 1997) is a special variant of Recurrent Neural Networks (Williams and Zipser, 1989). It can capture long-range dependencies and nonlinear dynamics between words, and has been successfully applied to many NLP tasks (Sutskever et al., 2014;Filippova et al., 2015). LSTM has a memory cell that can store information over a long history, as well as three gates that control the flow of information into and out of the memory cell. At time step t, given an input x t , previous output h t−1 , input gate i t , output gate o t and forget gate f t , LST M (x t , h t−1 ) outputs the hidden state h t based on the equations below: (1) where σ is the logistic sigmoid activation, W * , U * and b * are learned weight matrices and biases. LSTMs are better than RNNs for context modeling, in that their memory cells and gating mechanisms handle the vanishing gradients problem in training. We use bidirectional LSTMs (Bi-LSTMs) for context modeling in this work. Bi-LSTMs consist of two LSTMs that run in parallel in opposite directions: one (forward LST M f ) on the input sequence and the other (backward LST M b ) on the reverse of the sequence. At time step t, the Bi-LSTMs hidden state h bi t is a concatenation of the hidden state h f or t of LST M f and the hidden state h back t of LST M b , representing the neighbor contexts of input x t in the sequence. We define the unpack operation below: Context modeling with Bi-LSTMs allows all the following components to be built over word contexts, rather than over individual words.

Pairwise Word Interaction Modeling
From our own intuitions, given two sentences in a STS task, a careful human reader might compare words and phrases across the sentences to establish semantic correspondences and from these infer similarity. Our pairwise word interaction model is inspired by such behavior: whenever the next word of a sentence is read, the model would compare it and its context against all words and their contexts in the other sentence. Figure 2 illustrates this model.
We first define a comparison unit for comparing two hidden states Cosine distance (cos) measures the distance of two vectors by the angle between them, while L 2 Euclidean distance (L 2 Euclid ) and dot-product distance (DotProduct) measure magnitude differences. We use three similarity functions for richer measurement.
Algorithm 1 provides details of the modeling process. Given the input x a t ∈ sent a at time step t where a ∈ {1, 2}, its Bi-LSTMs hidden state h bi at is the concatenation of the forward state h f or at and the and h back 2s ; and 4) the addition of forward and backward hidden states h add 1t and h add 2s . The output of Algorithm 1 is a similarity cube simCube with size R 13·|sent 1 |·|sent 2 | , where |sent * | is the number of words in the sentence sent * . The 13 values collected from each word pair (s, t) are: the 12 similarity distances, plus one extra dimension for the padding indicator. Note that all word interactions are modeled over word contexts in Algorithm 1, rather than individual words.
Our pairwise word interaction model shares similarities with recent popular neural attention models (Bahdanau et al., 2014;Rush et al., 2015). However, there are important differences: For example, we do not use attention weight vectors or weighted On the Mat There Sit Cats Cats Sit On the Mat Figure 3: The similarity focus layer helps identify important pairwise word interactions (in black dots) depending on their importance for similarity measurement.
representations, which are the core of attention models. The other difference is that attention weights are typically interpreted as soft degrees with which the model attends to particular words; in contrast, our word interaction model directly utilizes multiple similarity metrics, and thus is more explicit.

Similarity Focus Layer
Since not all words are created equal, important pairwise word interactions between the sentences (Sec. 5) that can better contribute to the similarity measurement deserve more model focus. We therefore develop a similarity focus layer which can identify important word interactions and increase their model weights correspondingly. This similarity focus layer is directly incorporated into our end-to-end model and is placed on top of the pairwise word interaction model, as in Figure 1. Figure 3 shows one example where each cell of the matrix represents a pairwise word interaction. The similarity focus layer introduces re-weightings to word interactions depending on their importance for similarity measurement. The ones tagged with black dots are considered important, and are given higher weights than those without.
Algorithm 2 shows the forward pass of the similarity focus layer. Its input is the similarity cube simCube (Section 5). Algorithm 2 is designed to incorporate two different aspects of similarity based on cosine (angular) and L2 (magnitude) similarity, thus it has two symmetric components: the first one is based on cosine similarity (Line 5 to Line 13); and the second one is based on L2 similarity (Line 15 to Line 23). We also aim for the goal that similarity values of all found important word in-Algorithm 2 Forward Pass: Similarity Focus Layer 1: Input: simCube ∈ R 13·|sent1|·|sent2| 2: Initialize: mask ∈ R 13·|sent1|·|sent2| to all 0.1 3: Initialize: s1tag ∈ R |sent1| to all zeros 4: Initialize: s2tag ∈ R |sent2| to all zeros 5: sortIndex 1 = sort(simCube[10]) 6: for each id = 1...|sent 1 | + |sent 2 | do teractions should be maximized. To achieve this, we sort the similarity values in descending order (Line 5 for cosine, Line 15 for L2). Note channels 10 and 11 of the simCube contain cosine and L2 values, respectively; the padding indicator is in Line 24.
We start with the cosine part first, then L2. For each, we check word interaction candidates moving down the sorted list. Function calcP os is used to calculate relative sentence positions pos s * in the simCube given one interaction pair. We follow the constraint that no word in both sentences should be tagged to be important more than once. We increase weights of important word interactions to 1 (in Line 11 based on cosine and Line 21 based on L2), while unimportant word interactions receive weights of 0.1 (in Line 2).
We use a mask matrix, mask, to hold the weights of each. The final output of the similarity focus layer is a focus-weighted similarity cube f ocusCube, which is the element-wise multiplication (Line 25) of the matrix mask and the input simCube.
The similarity focus layer is based on the follow-ing intuition: given each word in one sentence, we look for its semantically similar twin in the other sentence; if found then this word is considered important, otherwise it contributes to a semantic difference. Though technically different, this process shares conceptual similarity with finding translation equivalences in statistical machine translation (Alonaizan et al., 1999). The backward pass of the similarity focus layer is straightforward: we reuse the mask matrix as generated in the forward pass and apply the element-wise multiplication of mask and inflow gradients, then propagate the resulting gradients backward.

Similarity Classification with Deep Convolutional Neural Networks
The f ocusCube contains focus-weighted finegrained similarity information. In the final model component we use the f ocusCube to compute the final similarity score. If we treat the f ocusCube as an "image" with 13 channels, then semantic similarity measurement can be converted into a pattern recognition (image processing) problem, where we are looking for patterns of strong pairwise word interactions in the "image". The stronger the overall pairwise word interactions are, the higher similarity the sentence pair will have. Recent advances from successful systems at ImageNet competitions (Simonyan and Zisserman, 2014;Szegedy et al., 2015) show that the depth of a neural network is a critical component for achieving competitive performance. We therefore use a deep homogeneous architecture which has repetitive convolution and pooling layers.
Our network architecture (Table 1) is composed of spatial max pooling layers, spatial convolutional layers (Conv) with a small filter size of 3 × 3 plus stride 1 and padding 1. We adopt this filter size because it is the smallest one to capture the space of left/right, up/down, and center; the padding and stride is used to preserve the spatial input resolution. We then use fully-connected layers followed by the final softmax layer for the output. After each spatial convolutional layer, a rectified linear units (ReLU) non-linearity layer (Krizhevsky et al., 2012) is used.
The input to this deep ConvNet is the f ocusCube, which does not always have the same size because  the lengths of input sentences vary. To address this, we use zero padding. For computational reasons we provide two configurations in Table 1, for length padding up to either 32 × 32 or 48 × 48. The only difference between the two configurations is the last pooling layer. If sentences are longer than the padding length limit we only use the number of words up to the limit. In our experiments we found the 48×48 padding limit to be acceptable since most sentences in our datasets are only 1−30 words long.

Experimental Setup
Datasets. We conducted five separate experiments on ten different datasets: three recent SemEval competitions and two answer selection tasks. Note that the answer selection task, which is to rank candidate answer sentences based on their similarity to the questions, is essentially the similarity measurement problem. The five experiments are as follows: 1. Sentences Involving Compositional Knowledge (SICK) is from Task 1 of the 2014 SemEval competition (Marelli et al., 2014) and consists of 9,927 annotated sentence pairs, with 4,500 for training, 500 as a development set, and 4,927 for   Yang et al. (2015). We followed the same pre-processing steps as Yang et al. (2015), where questions with no correct candidate answer sentences are excluded and answer sentences are truncated to 40 tokens. The resulting dataset consists of 873 questions with 8,672 question-answer pairs in the training set, 126 questions with 1,130 pairs in the development set, and 243 questions with 2,351 pairs in the test set. 5. The TrecQA dataset (Wang et al., 2007) from the Text Retrieval Conferences has been widely used for the answer selection task during the past decade. To enable direct comparison with previous work, we used the same training, development, and test sets as released by Yao et al. (2013).
The TrecQA data consists of 1,229 questions with 53,417 question-answer pairs in the TRAIN-ALL training set, 82 questions with 1,148 pairs in the development set, and 100 questions with 1,517 pairs in the test set.
Training. For experiments on SICK, MSRVID, and STS2014, the training objective is to minimize the KL-divergence loss: where f is the ground truth, f θ is the predicted distribution with model weights θ, and n is the number of training examples. We used a hinge loss for the answer selection task on WikiQA and TrecQA data. The training objective is to minimize the following loss, summed over examples x, y gold : where y gold is the ground truth label, input x is the pair of sentences x = {S 1 , S 2 }, θ is the model weight vector, and the function f θ (x, y ) is the output of our model.
We used the SICK development set for tuning and then applied exactly the same hyperparameters to all ten test sets. For the answer selection task (Wiki-QA and TrecQA), we used the official trec eval scorer to compute the metrics Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR) and  Lai and Hockenmaier (2014) 0.7993 0.7538 0.3692  0.8070 0.7489 0.3550 Bjerva et al. (2014) 0.8268 0.7721 0.3224 Zhao et al. (2014) 0  selected the best development model based on MRR for final testing. Our timing experiments were conducted on an Intel Xeon E5-2680 CPU. Due to sentence length variations, for the SICK and MSRVID data we padded the sentences to 32 words; for the STS2014, WikiQA, and TrecQA data, we padded the sentences to 48 words. (Table 3). Our model outperforms previous neural network models, most of which are based on sentence modeling. The ConvNet work (He et al., 2015) and TreeLSTM work (Tai et al., 2015) achieve comparable accuracy; for example, their difference in Pearson's r is only 0.1%. In comparison, our model outperforms both by 1% in Pearson's r, over 1.1% in Spearman's ρ, and 2-3% in MSE. Note that we used the same word embeddings, sparse distribution targets, and loss function as in He et al. (2015) and Tai et al. (2015), thereby representing comparable experimental conditions. (Table 4). Our model outperforms the work of He et al. (2015), which already reports a Pearson's r score of over 0.9, STS2014 Results (Table 5). Systems in the competition are ranked by the weighted mean (the of-Model Pearson's r Beltagy et al. (2014) 0.8300 Bär et al. (2012) 0.8730 Sarić et al. (2012) 0.8803 He et al. (2015) 0.9090 This work 0.9112   (Kashyap et al., 2014), 3rd (Lynum et al., 2014) systems in the STS2014 competition, all of which are based on heavy feature engineering. Our model does not use any sparse features, WordNet, or parse trees, but still performs favorably compared to the STS2014 winning system (Sultan et al., 2014).

MSRVID Results
WikiQA Results (Table 6). We compared our model to competitive baselines prepared by Yang et al. (2015) and also evaluated He et al. (2015)'s multi-perspective ConvNet on the same data. The neural network models in the table, paragraph vector (PV) (Le and Mikolov, 2014), CNN (Yu et al., 2014), and PV-Cnt/CNN-Cnt with word matching features (Yang et al., 2015), are mostly based on sentence modeling. Our model outperforms them all.
TrecQA Results (Table 7). This is the largest dataset in our experiments, with over 55,000 question-answer pairs. Only recently have neural network approaches (Yu et al., 2014) started to show promising results on this decade-old dataset. Previous approaches with probabilistic tree-edit techniques or tree kernels (Wang and Manning, 2010;Heilman and Smith, 2010;Yao et al., 2013) have been successful since tree structure information per-Model MAP MRR Word Cnt (Yang et al., 2015) 0.4891 0.4924 Wgt Word Cnt (Yang et al., 2015) 0.5099 0.5132 PV (Le and Mikolov, 2014) 0.5110 0.5160 PV-Cnt (Yang et al., 2015) 0.5976 0.6058 LCLR (Yih et al., 2013) 0.5993 0.6086 CNN (Yu et al., 2014) 0.6190 0.6281 CNN-Cnt (Yang et al., 2015) 0.6520 0.6652 He et al. (2015) 0.6930 0.7090 This work 0.7090 0.7234  Cui et al. (2005) 0.4271 0.5259 Wang et al. (2007) 0.6029 0.6852 Heilman and Smith (2010) 0.6091 0.6917 Wang and Manning (2010) 0.5951 0.6951 Yao et al. (2013) 0.6307 0.7477 Severyn and Moschitti (2013) 0.6781 0.7358 Yih et al. (2013) 0.7092 0.7700 Wang and Nyberg (2015) 0.7134 0.7913 Severyn and Moschitti (2015)   10 Analysis Ablation Studies. Table 8 shows the results of ablation studies on SICK and WikiQA data. We removed or replaced one component at a time from the full system and performed re-training and re-testing. We found large drops when removing the context modeling component, indicating that the context information provided by the Bi-LSTMs is crucial for the following components (e.g., interaction modeling). The use of our similarity focus layer is also beneficial, especially on the WikiQA data. When we replaced the entire similarity focus layer with a random dropout layer (p = 0.3), the dropout layer hurts accuracy; this shows the importance of directing the model to focus on important pairwise word interactions, to better capture similarity.   ConvNet model uses multiple types of convolution and pooling for sentence modeling. This results in a wide architecture with around 10 million tunable parameters. Our approach only models pairwise word interactions and does not require such a complicated architecture. Compared to that previous work, Table 9 shows that our model is 3.4× faster in training and has 83% fewer tunable parameters.
Visualization. Table 10 visualizes the cosine value channel of the f ocusCube for pairwise word interactions given two sentence pairs in the SICK test set. Note for easier visualization, the values are multiplied by 10. Darker red areas indicate stronger pairwise word interactions. From these visualizations, we see that our model is able to identify important word pairs (in dark red) and tag them with proper similarity values, which are significantly higher than the ones of their neighboring unimportant pairs. This shows that our model is able to recognize important fine-grained word-level information for better similarity measurement, suggesting the reason why our model performs well.

Conclusion
In summary, we developed a novel neural network model based on a hybrid of ConvNet and Bi-  LSTMs for the semantic textual similarity measurement problem. Our pairwise word interaction model and the similarity focus layer can better capture finegrained semantic information, compared to previous sentence modeling approaches that attempt to "cram" all sentence information into a fixed-length vector. We demonstrated the state-of-the-art accuracy of our approach on data from three SemEval competitions and two answer selection tasks.