Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks

Modeling sentence similarity is complicated by the ambiguity and variability of linguistic expression. To cope with these challenges, we propose a model for comparing sentences that uses a multiplicity of perspectives. We ﬁrst model each sentence using a convolutional neural network that extracts features at multiple levels of granularity and uses multiple types of pooling. We then compare our sentence representations at several granularities using multiple similarity metrics. We apply our model to three tasks, including the Microsoft Research paraphrase identiﬁcation task and two SemEval semantic textual similarity tasks. We obtain strong performance on all tasks, rivaling or exceeding the state of the art without using external resources such as WordNet or parsers.


Introduction
Measuring the semantic relatedness of two pieces of text is a fundamental problem in language processing tasks like plagiarism detection, query ranking, and question answering. In this paper, we address the sentence similarity measurement problem: given a query sentence S 1 and a comparison sentence S 2 , the task is to compute their similarity in terms of a score sim(S 1 , S 2 ). This similarity score can be used within a system that determines whether two sentences are paraphrases, e.g., by comparing it to a threshold.
Measuring sentence similarity is challenging because of the variability of linguistic expression and the limited amount of annotated training data. This makes it difficult to use sparse, hand-crafted features as in conventional approaches in NLP. Recent successes in sentence similarity have been obtained by using neural networks (Tai et al., 2015;Yin and Schütze, 2015). Our approach is also based on neural networks: we propose a modular functional architecture with two components, sentence modeling and similarity measurement.
For sentence modeling, we use a convolutional neural network featuring convolution filters with multiple granularities and window sizes, followed by multiple types of pooling. We experiment with two types of word embeddings as well as partof-speech tag embeddings (Sec. 4). For similarity measurement, we compare pairs of local regions of the sentence representations, using multiple distance functions: cosine distance, Euclidean distance, and element-wise difference (Sec. 5).
We demonstrate state-of-the-art performance on two SemEval semantic relatedness tasks (Agirre et al., 2012;Marelli et al., 2014), and highly competitive performance on the Microsoft Research paraphrase (MSRP) identification task (Dolan et al., 2004). On the SemEval-2014 task, we match the state-of-the-art dependency tree Long Short-Term Memory (LSTM) neural networks of Tai et al. (2015) without using parsers or part-ofspeech taggers. On the MSRP task, we outperform the recently-proposed convolutional neural network model of Yin and Schütze (2015) without any pretraining. In addition, we perform ablation experiments to show the contribution of our modeling decisions for all three datasets, demonstrating clear benefits from our use of multiple perspectives both in sentence modeling and structured similarity measurement.

Related Work
Most previous work on modeling sentence similarity has focused on feature engineering. Several types of sparse features have been found useful, including: (1) string-based, including n-gram overlap features on both the word and character levels (Wan et al., 2006) and features based on machine translation evaluation metrics (Madnani et al., 2012); (2) knowledge-based, using external lexical resources such as WordNet (Fellbaum, 1998;Fern and Stevenson, 2008); (3) syntaxbased, e.g., modeling divergence of dependency syntax between the two sentences (Das and Smith, 2009); (4) corpus-based, using distributional models such as latent semantic analysis to obtain features (Hassan, 2011;Guo and Diab, 2012).
Several strongly-performing approaches used system combination (Das and Smith, 2009;Madnani et al., 2012) or multi-task learning. Xu et al. (2014) developed a feature-rich multi-instance learning model that jointly learns paraphrase relations between word and sentence pairs. Recent work has moved away from handcrafted features and towards modeling with distributed representations and neural network architectures. Collobert and Weston (2008) used convolutional neural networks in a multitask setting, where their model is trained jointly for multiple NLP tasks with shared weights. Kalchbrenner et al. (2014) introduced a convolutional neural network for sentence modeling that uses dynamic k-max pooling to better model inputs of varying sizes. Kim (2014) proposed several modifications to the convolutional neural network architecture of Collobert and Weston (2008), including the use of both fixed and learned word vectors and varying window sizes of the convolution filters.
For the MSRP task, Socher et al. (2011) used a recursive neural network to model each sentence, recursively computing the representation for the sentence from the representations of its constituents in a binarized constituent parse. Ji and Eisenstein (2013) used matrix factorization techniques to obtain sentence representations, and combined them with fine-tuned sparse features using an SVM classifier for similarity prediction. Both Socher et al. and Ji and Eisenstein incorporated sparse features to improve performance, which we do not use in this work. Hu et al. (2014) used convolutional neural networks that combine hierarchical sentence modeling with layer-by-layer composition and pooling. While they performed comparisons directly over entire sentence representations, we instead develop a structured similarity measurement layer to compare local regions. A variety of other neural network models have been proposed for similarity tasks (Weston et al., 2011;Huang et al., 2013;Andrew et al., 2013;Bromley et al., 1993).
Most recently, Tai et al. (2015) and Zhu et al. (2015) concurrently proposed a tree-based LSTM neural network architecture for sentence modeling. Unlike them, we do not use syntactic parsers, yet our performance matches Tai et al. (2015) on the similarity task. This result is appealing because high-quality parsers are difficult to obtain for low-resource languages or specialized domains. Yin and Schütze (2015) concurrently developed a convolutional neural network architecture for paraphrase identification, which we compare to in our experiments. Their best results rely on an unsupervised pretraining step, which we do not need to match their performance.
Our model architecture differs from previous work in several ways. We exploit multiple perspectives of input sentences in order to maximize information utilization and perform structured comparisons over particular regions of the sentence representations. We now proceed to describe our model in detail, and we compare to the above related work in our experimental evaluation.

Model Overview
Modeling textual similarity is complicated by the ambiguity and variability of linguistic expression. We designed a model with these phenomena in mind, exploiting multiple types of input which are processed by multiple types of convolution and pooling. Our similarity architecture likewise uses multiple similarity functions.
To summarize, our model (shown in Figure 1) consists of two main components: 1. A sentence model for converting a sentence into a representation for similarity measurement; we use a convolutional neural network architecture with multiple types of convolution and pooling in order to capture different granularities of information in the inputs.

2.
A similarity measurement layer using multiple similarity measurements, which compare local regions of the sentence representations from the sentence model.
Our model has a "Siamese" structure (Bromley et al., 1993) with two subnetworks each processing a sentence in parallel. The subnetworks share all of their weights, and are joined by the similarity measurement layer, then followed by a fully connected layer for similarity score output.  Importantly, we do not require resources like WordNet or syntactic parsers for the language of interest; we only use optional part-of-speech tags and pretrained word embeddings. The main difference from prior work lies in our use of multiple types of convolution, pooling, and structured similarity measurement over local regions. We show later in our experiments that the bulk of our performance comes from this use of multiple "perspectives" of the input sentences.
We describe our sentence model in Section 4 and our similarity measurement layer in Section 5.

Sentence Modeling
In this section we describe our convolutional neural network for modeling each sentence. We use two types of convolution filters defined on different perspectives of the input (Sec. 4.1), and also use multiple types of pooling (Sec. 4.2).
Our inputs are streams of tokens, which can be interpreted as a temporal sequence where nearby words are likely to be correlated. Let sent ∈ R len×Dim be a sequence of len input words represented by Dim-dimensional word embeddings, where sent i ∈ R Dim is the embedding of the i-th word in the sequence and sent i:j represents the concatenation of embeddings from word i up to and including word j. We denote the k-th dimension of the i-th word vector by sent [k] i and we denote the vector containing the k-th dimension of words i to j by sent [k] i:j . w 1 w 2 w 3 w 4 w 5 w 1 w 2 w 3 w 4 w 5 Figure 2: Left: a holistic filter matches entire word vectors (here, ws = 2). Right: per-dimension filters match against each dimension of the word embeddings independently.

Convolution on Multiple Perspectives
We define a convolution filter F as a tuple ws, w F , b F , h F , where ws is the sliding window width, w F ∈ R ws×Dim is the weight vector for the filter, b F ∈ R is the bias, and h F is the activation function (a nonlinear function such as tanh).
When filter F is applied to sequence sent, the inner product is computed between w F and each possible window of word embeddings of length ws in sent, then the bias is added and the activation function is applied. This results in an output vector out F ∈ R 1+len−ws where entry i equals . This filter can be viewed as performing "temporal" convolution, as it matches against regions of the word sequence. Since these filters consider the entirety of each word embedding at each position, we call them holistic filters; see the left half of Figure 2.
In addition, we target information at a finer granularity by constructing per-dimension filters F [k] for each dimension k of the word embeddings, where w F [k] ∈ R ws . See the right half of Figure 2. The per-dimension filters are similar to "spatial convolution" filters except that we limit each to a single, predefined dimension. We include separate per-dimension filters for each dimension of the input word embeddings. Applying a per-dimension filter Our use of word embeddings in both ways allows more information to be extracted for richer sentence modeling. While we typically do not expect individual dimensions of neural word embeddings to be interpretable to humans, there may still be distinct information captured by the different dimensions that our model could exploit. Furthermore, if we update the word embeddings during learning, different dimensions could be encouraged further to capture distinct information.
We define a convolution layer as a set of convolution filters that share the same type (holistic or per-dimension), activation function, and width ws. The type, width, activation function, and number of filters numFilter in the layer are chosen by the modeler and the weights of each filter (w F and b F ) are learned.

Multiple Pooling Types
The output vector out F of a convolution filter F is typically converted to a scalar for subsequent use by the model using some method of pooling. For example, "max-pooling" applies a max operation across the entries of out F and returns the maximum value. In this paper, we experiment with two additional types of pooling: "min-pooling" and "mean-pooling".
A group, denoted group(ws, pooling, sent), is an object that contains a convolution layer with width ws, uses pooling function pooling, and operates on sentence sent. We define a building block to be a set of groups. We use two types of building blocks, block A and block B , as shown in Figure 3. We define block A as That is, an instance of block A has three convolution layers, one corresponding to each of the three pooling functions; all have the same window size ws a . An alternative choice would be to use the multiple types of pooling on the same filters (Rennie et al., 2014); we instead use independent sets of filters for the different pooling types. 1 We use blocks of type A for all holistic convolution layers.
We define block B as That is, block B contains two groups of convolution layers of width ws b , one with max-pooling and one with min-pooling. Each group B ( * ) contains a convolution layer with Dim per-dimension convolution filters. That is, we use blocks of type B for convolution layers that operate on individual dimensions of word vectors. We use these multiple types of pooling to extract different types of information from each type of filter. The design of each group( * ) allows a pooling function to interact with its own underlying convolution layers independently, so each convolution layer can learn to recognize distinct phenomena of the input for richer sentence modeling.
For a group A (ws a , pooling a , sent) with a convolution layer with numFilter A filters, we define the output oG A as a vector of length numFilter A where entry j is where filters are indexed as F j . That is, the output of group A ( * ) is a numFilter A -length vector containing the output of applying the pooling function on each filter's vector of filter match outputs. 2 A component group B ( * ) of block B contains Dim filters, each operating on a particular dimension of the word embeddings. We define the output oG B of group B (ws b , pooling b , sent) as a j is filter j for dimension k.

Multiple Window Sizes
Similar to traditional n-gram-based models, we use multiple window sizes ws in our building blocks in order to learn features of different lengths. For example, in Figure 4 we use four building blocks, each with one window size ws =  Figure 4: Example neural network architecture for a single sentence, containing 3 instances of block A (with 3 types of pooling) and 2 instances of block B (with 2 types) on varying window sizes ws = 1, 2 and ws = ∞; block A operates on entire word vectors while block B contains filters that operate on individual dimensions independently.
1 or 2 for its own convolution layers. In order to retain the original information in the sentences, we also include the entire matrix of word embeddings in the sentence, which essentially corresponds to ws = ∞.
The width ws represents how many words are matched by a filter, so using larger values of ws corresponds to matching longer n-grams in the input sentences. The ranges of ws values and the numbers of filters numFilter of block A and block B are empirical choices tuned based on validation data.

Similarity Measurement Layer
In this section we describe the second part of our model, the similarity measurement layer.
Given two input sentences, the first part of our model computes sentence representations for each of them in parallel. One straightforward way to compare them is to flatten the sentence representations into two vectors, then use standard metrics like cosine similarity. However, this may not be optimal because different regions of the flattened sentence representations are from different underlying sources (e.g., groups of different widths, types of pooling, dimensions of word vectors, etc.). Flattening might discard useful compositional information for computing similarity. We therefore perform structured comparisons over particular regions of the sentence representations.
One important consideration is how to identify suitable local regions for comparison so that we can best utilize the compositional information in the sentence representations. There are many possible ways to group local comparison regions. In doing so, we consider the following four as-pects: 1) whether from the same building block; 2) whether from convolutional layers with the same window size; 3) whether from the same pooling layer; 4) whether from the same filter of the underlying convolution layers. 3 We focus on comparing regions that share at least two of these conditions.
To concretize this, we provide two algorithms below to identify meaningful local regions. While there exist other sets of comparable regions that share the above conditions, we do not explore them all due to concerns about learning efficiency; we find that the subset we consider performs strongly in practice.

Similarity Comparison Units
We define two comparison units for comparing two local regions in the sentence representations: Cosine distance (cos) measures the distance of two vectors according to the angle between them, while L 2 Euclidean distance (L 2 Euclid ) and element-wise absolute difference measure magnitude differences.

Comparison over Local Regions
Algorithms 1 and 2 show how the two sentence representations are compared in our model. Algorithm 1 works on the output of block A only, while Algorithm 2 deals with both block A and block B , focusing on regions from the output of the same pooling type and same block type, but with different filters and window sizes of convolution layers. Given two sentences S 1 and S 2 , we set the maximum window size ws of block A and block B to be n, let regM * represent a numFilter A by n + 1 matrix, and assume that each group * outputs its corresponding oG * . The output features are accumulated in a final vector fea.
In Figure 5, each column of the max/min/mean groups is compared with all columns of the same pooling group for the other sentence. This is shown in red dotted lines in the Figure and listed in lines 2 to 9 in Algorithm 2. Note that both ws 1 and ws 2 columns within each pooling group should be compared using red dotted lines, but we omit this from the figure for clarity.
In the horizontal direction, each equal-sized max/min/mean group is extracted as a vector and is compared to the corresponding one for the other sentence. This process is repeated for all rows and comparisons are shown in green solid lines, as performed by Algorithm 1.

Other Model Details
Output Fully-Connected Layer. On top of the similarity measurement layer (which outputs a vector containing all fea * ), we stack two linear layers with an activation layer in between, followed by a log-softmax layer as the final output layer, which outputs the similarity score.
Activation Layers. We used element-wise tanh ⊗ Max ws 1 ⊗ Min ⊗ Mean ws 2 ws 1 ws 2 ws 1 ws 2 ⊗ Max ws 1 ⊗ Min ⊗ Mean ws 2 ws 1 ws 2 ws 1 ws 2 Figure 5: Simplified example of local region comparisons over two sentence representations that use block A only. The "horizontal comparison" (Algorithm 1) is shown with green solid lines and "vertical comparison" (Algorithm 2) with red dotted lines. Each sentence representation uses window sizes ws 1 and ws 2 with max/min/mean pooling and numFilter A = 3 filters.
as the activation function for all convolution filters and for the activation layer placed between the final two layers.

Experiments and Results
Everything necessary to replicate our experimental results can be found in our open-source code repository. 4

Tasks and Datasets
We consider three sentence pair similarity tasks:

Microsoft Research Paraphrase Corpus
(MSRP). This data was collected from news sources (Dolan et al., 2004) and contains 5,801 pairs of sentences, with 4,076 for training and the remaining 1,725 for testing. Each sentence pair is annotated with a binary label indicating whether the two sentences are paraphrases, so the task here is binary classification.
2. Sentences Involving Compositional Knowledge (SICK) dataset. This data was collected for the 2014 SemEval competition (Marelli et al., 2014) and consists of 9,927 sentence pairs, with 4,500 for training, 500 as a development set, and the remaining 4,927 in the test set. The sentences are drawn from image and video descriptions. Each sentence pair is annotated with a relatedness score ∈ [1, 5], with higher scores indicating the two sentences are more closely-related.
3. Microsoft Video Paraphrase Corpus (MSRVID). This dataset was collected for the 2012 SemEval competition and consists of 1,500 pairs of short video descriptions which were then annotated (Agirre et al., 2012). Half of it is for training and the other half is for testing. Each sentence pair has a relatedness score ∈ [0, 5], with higher scores indicating the two sentences are more closely-related.

Training
We use a hinge loss for the MSRP paraphrase identification task. This is simpler than log loss since it only penalizes misclassified cases. The training objective is to minimize the following loss (summed over examples x, y gold ): where y gold is the ground truth label, input x is the pair of sentences x = {S 1 , S 2 }, θ is the model weight vector to be trained, and the function f θ (x, y) is the output of our model. We use regularized KL-divergence loss for the semantic relatedness tasks (SICK and MSRVID), since the goal is to predict the similarity of the two sentences. The training objective is to minimize the KL-divergence loss plus an L 2 regularizer: where f θ is the predicted distribution with model weight vector θ, f is the ground truth, m is the number of training examples, and λ is the regularization parameter. Note that we use the same KL-loss function and same sparse target distribution technique as Tai et al. (2015).

Experiment Settings
We conduct experiments with ws values in the range [1, 3] as well as ws = ∞ (no convolution). We use multiple kinds of embeddings to represent each sentence, both on words and part-ofspeech (POS) tags. We use the Dim g = 300dimensional GloVe word embeddings (Pennington et al., 2014) trained on 840 billion tokens. We use Dim k = 25-dimensional PARAGRAM vectors (Wieting et al., 2015) only on the MSRP task since they were developed for paraphrase tasks, having been trained on word pairs from the Paraphrase Database (Ganitkevitch et al., 2013). For POS embeddings, we run the Stanford POS tagger  on the English side of the Xinhua machine translation parallel corpus, which consists of Xinhua news articles with approximately 25 million words. We then train Dim p = 200-dimensional POS embeddings using the word2vec toolkit (Mikolov et al., 2013). Adding POS embeddings is expected to retain syntactic information which is reported to be effective for paraphrase identification (Das and Smith, 2009). We use POS embeddings only for the MSRP task.
Therefore for MSRP, we concatenate all word and POS embeddings and obtain Dim = Dim g + Dim p + Dim k = 525-dimension vectors for each input word; for SICK and MSRVID we only use Dim = 300-dimension GloVe embeddings.
We use 5-fold cross validation on the MSRP training data for tuning, then largely re-use the same hyperparameters for the other two datasets. However, there are two changes: 1) for the MSRP task we update word embeddings during training but not so on SICK and MSRVID tasks; 2) we set the fully connected layer to contain 250 hidden units for MSRP, and 150 for SICK and MSRVID. These changes were done to speed up our experimental cycle on SICK and MSRVID; on SICK data they are the same experimental settings as used by Tai et al. (2015), which makes for a cleaner empirical comparison.
We set the number of holistic filters in block A to be the same as the input word embeddings, therefore numFilter A = 525 for MSRP and numFilter A = 300 for SICK and MSRVID. We set the number of per-dimension filters in block B to be numFilter B = 20 per dimension for all three datasets, which corresponds to 20 * Dim filters in total.
We perform optimization using stochastic gradient descent (Bottou, 1998). The backpropagation algorithm is used to compute gradients for all parameters during training (Goller and Kuchler, 1996). We fix the learning rate to 0.01 and regularization parameter λ = 10 −4 .
When comparing to their model without pretraining, we outperform them by 6% absolute in accuracy and 3% in F1. Our model is also superior to other recent neural network models (Hu et al., 2014;Socher et al., 2011) without requiring sparse features or unlabeled data as in (Yin and Schütze, 2015;Socher et al., 2011). The best result on MSRP is from Ji and Eisenstein (2013) which uses unsupervised learning on the MSRP test set and rich sparse features.
Results on SICK Data. Our results on the SICK task are summarized in Table 2, showing Pearson's r, Spearman's ρ, and mean squared error (MSE). We include results from the literature as reported by Tai et al. (2015), including prior work using recurrent neural networks (RNNs), the best submissions in the SemEval-2014 competition, and variants of LSTMs. When measured by Pearson's r, the previous state-of-the-art approach uses a treestructured LSTM (Tai et al., 2015); note that their best results require a dependency parser.
On the contrary, our approach does not rely on parse trees, nor do we use POS/PARAGRAM embeddings for this task. The word embeddings,   sparse distribution targets, and KL loss function are exactly the same as used by Tai et al. (2015), therefore representing comparable conditions.
Results on MSRVID Data. Our results on the MSRVID data are summarized in Table 3, which includes the top 2 submissions in the Semantic Textual Similarity (STS) task from SemEval-2012. We find that we outperform the top system from the task by nearly 3 points in Pearson's r.

Model Ablation Study
We report the results of an ablation study in Table 4. We identify nine major components of our approach, remove one at a time (if applicable), and perform re-training and re-testing for all three tasks. We use the same experimental settings in Sec. 6.3 and report differences (in accuracy for MSRP, Pearson's r for SICK/MSRVID) compared to our results in Tables 1-3.  Table 4: Ablation study over test sets of all three datasets. Nine components are divided into four groups. We remove components one at a time and show differences.
From Table 4 we find drops in performance for all components, with the largest differences appearing when removing components of the similarity measurement layer. For example, conducting comparisons over flattened sentence representations (removing component 9) leads to large drops across tasks, because this ignores structured information within sentence representations. Groups (1) and (2) are also useful, particularly for the MSRP task, demonstrating the extra benefit obtained from our multi-perspective approach in sentence modeling.
We see consistent drops when ablating the Vertical/Horizontal algorithms that target particular regions for comparison. Also, removing group (3) hinders both the Horizontal and Vertical algorithms (as described in Section 5.1), so its removal similarly causes large drops in performance. Though convolutional neural networks already perform strongly when followed by flattened vector comparison, we are able to leverage the full richness of the sentence models by performing structured similarity modeling on their outputs.

Discussion and Conclusion
On the SICK dataset, the dependency tree LSTM (Tai et al., 2015) and our model achieve comparable performance despite taking very different approaches. Tai et al. use syntactic parse trees and gating mechanisms to convert each sen-tence into a vector, while we use large sets of flexible feature extractors in the form of convolution filters, then compare particular subsets of features in our similarity measurement layer.
Our model architecture, with its many paths of information flow, is admittedly complex. Though we have removed hand engineering of features, we have added a substantial amount of functional architecture engineering. This may be necessary when using the small training sets provided for the tasks we consider here. We conjecture that a simpler, deeper neural network architecture may outperform our model when given large amounts of training data, but we leave an investigation of this direction to future work.
In summary, we developed a novel model for sentence similarity based on convolutional neural networks. We improved both sentence modeling and similarity measurement. Our model achieves highly competitive performance on three datasets. Ablation experiments show that the performance improvement comes from our use of multiple perspectives in both sentence modeling and structured similarity measurement over local regions of sentence representations. Future work could extend this model to related tasks including question answering and information retrieval.