Learning Semantic Textual Similarity from Conversations

We present a novel approach to learn representations for sentence-level semantic similarity using conversational data. Our method trains an unsupervised model to predict conversational responses. The resulting sentence embeddings perform well on the Semantic Textual Similarity (STS) Benchmark and SemEval 2017’s Community Question Answering (CQA) question similarity subtask. Performance is further improved by introducing multitask training, combining conversational response prediction and natural language inference. Extensive experiments show the proposed model achieves the best performance among all neural models on the STS Benchmark and is competitive with the state-of-the-art feature engineered and mixed systems for both tasks.


Introduction
We propose a novel approach to sentence-level semantic similarity based on unsupervised learning from conversational data. We observe that semantically similar sentences have a similar distribution of potential conversational responses, and that a model trained to predict conversational responses should implicitly learn useful semantic representations. As illustrated in Figure 1, "How old are you?" and "What is your age?" are both questions about age, which can be answered by similar responses such as "I am 20 years old". In contrast, "How are you?" and "How old are you?" use similar words but have different meanings and lead to different responses.
Deep learning models have been shown to predict conversational responses with increasingly good accuracy (Henderson et al., 2017;Kannan Figure 1: Sentences have similar meanings if they can be answered by a similar distribution of conversational responses. et al., 2016). The internal representations of such models resolve the semantics necessary to predict the correct response across a broad selection of input messages. Meaning similarity between sentences then can be obtained by comparing the sentence-level representations learned by such models. We follow this approach, and assess the quality of the resulting similarity scores on the Semantic Textual Similarity (STS) Benchmark (Cer et al., 2017) and a question similarity subtask from SemEval 2017's Community Question Answering (CQA) evaluation. The STS benchmark scores sentence pairs based on their degree of meaning similarity. The Community Question Answering (CQA) subtask B (Nakov et al., 2017) ranks questions based on their similarity with a target question.
We first assess representations learned from unsupervised conversational input-response pairs. We then explore augmenting our model with multi-task training over a combination of unsupervised conversational response prediction and supervised training on Natural Language Inference (NLI) data, as training to NLI has been shown to independently yield useful general purpose representations (Conneau et al., 2017). Unsupervised training over conversational data yields represen- Figure 2: The conversational response selection problem attempts to identify the correct response from a collection of candidate responses. We train using batch negatives with each candidate response serving as a positive example for one input and a negative sample for the remaining inputs.
tations that perform well on STS and CQA question similarity. The addition of supervised SNLI data leads to further improvements and reaches state-of-the-art performance for neural STS models, surpassing training on NLI data alone.

Approach
This section describes the conversational learning task and our architecture for predicting conversational responses. We detail two encoding methods for converting sentences into sentence embeddings and describe multitask learning over conversational and NLI data.

Conversational Response Prediction
We formulate the conversational learning task as response prediction given an input (Kannan et al., 2016;Henderson et al., 2017). Following prior work, the prediction task is cast as a response selection problem. As shown in Figure 2, the model P (y|x) attempts to identify the correct response y from K − 1 randomly sampled alternatives.

Model Architecture
Our model architecture encodes input and response sentences into fixed-length vectors u and v, respectively. The preference of an input described by u for a response described by v is scored by the dot product of the two vectors. The dot product scores are converted into probabilities using a softmax over the scores from all other candidate responses. Model parameters are trained to maximize the log-likelihood of the correct responses. Figure 3 illustrates the input-response scoring model architecture. Tied parameters are used for the input and response encoders. In order to model the mapping between inputs and their expected Fully-connected layers Figure 3: Conversational response prediction model. The sentence encoders are in red and use shared parameters. Fully connected DNN layers perform the mapping between the semantics of the input sentence and the candidate response.
responses, the response embeddings are passed through an additional feed-forward network to get the final response vector v before computing the dot product with the input sentence embedding. 1 Training is performed using batches of K randomly shuffled input-response pairs. Within a batch, each response serves as the correct answer to its corresponding input and the incorrect response to the remaining K − 1 inputs in the batch. In the remaining sections, this architecture is referred to as the input-response model. Figure 4 illustrates the encoders we explore for obtaining sentence embeddings: DANs (Iyyer et al., 2015) and Transformer (Vaswani et al., 2017). 2

DAN
Deep averaging networks (DAN) compute sentence-level embeddings by first averaging word-level embeddings and then feeding the averaged representation to a deep neural network (DNN) (Iyyer et al., 2015). We provide our encoder with input embeddings for both words and bigrams in the sentence being encoded. This simple architecture has been found to outperform LSTMs on email response prediction (Henderson et al., 2017). The embeddings for words and bigrams are learned during training of the inputresponse model. Our implementation sums the input embeddings and then divides by sqrt(n), where n is the sentence length. 3 The resulting vector is passed as input to the DNN.   DANs perform well in practice on sentencelevel prediction and encoding tasks (Iyyer et al., 2015;Henderson et al., 2017). However, they lack any explicit network structure for encoding long range relationships between words.

Transformer
Transformer (Vaswani et al., 2017) is a recent network architecture that makes use of attention mechanisms to explicitly capture relationships between words appearing at any position in a sentence. The architecture is able to achieve stateof-the-art performance on translation tasks and is available as open-source. 4 While the original transformer architecture contains an encoder and decoder, we only need the encoder component in our training procedure. The encoder is constructed as a series of attention layers consisting of a multi-headed self-attention operation over all input positions followed by a feedforward layer that processes each position independently (see figure 4b). Positional information is captured by injecting a "timing signal" into the 3 sqrtn is one of TensorFlow's built-in embedding combiners. The intuition behind dividing by sqrt(n) is as follows: We want our input embeddings to be sensitive to length. However, we also want to ensure that for short sequences the relative differences in the representations are not dominated by sentence length effects. 4 https://github.com/tensorflow/tensor2tensor input embeddings based on sine/cosine functions at different frequencies.
The transformer encoder output is a variablelength sequence. We reduce it to fixed length by averaging across all sequence positions. Intuitively, this is similar to building a bag-of-words representation, except that the words have had a chance to interact with their contexts through the attention layers. In practice, we see that the learned attention masks focus largely on nearby words in the first layer, and attend to progressively more distant context in the higher layers.

Multitask Encoder
We anticipate that learning good semantic representations may benefit from the inclusion of multiple distinct tasks during training. Multiple tasks should improve the coverage of semantic phenomenon that are critical to one task but less essential to another. We explore multitask models that use a shared encoder for learning conversational response prediction and natural language inference (NLI). The NLI data are from the Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) corpus. The sentences are mostly non-conversational, providing a complementary learning signal. Figure 5 illustrates the multitask model with SNLI. We keep the input-response model the same, and build another two encoders for SNLI pairs, sharing parameters with the input-response encoders. Following Conneau et al. (2017), we encode a sentence pair into vectors u 1 , u 2 and construct a feature vector (u 1 , u 2 , |u 1 − u 2 |, u 1 * u 2 ). The feature vector is fed into a 3-way classifier consisting of a feedforward network culminating in a softmax layer. Following prior work, we use a single 512 unit hidden layer for our experiments.

Conversational Data
Our unsupervised model relies on structured conversational data. The data for our experiments are drawn from Reddit conversations spanning 2007 to 2016, extracted by Al-Rfou et al. (2016). This corpus contains 133 million posts and a total of 2.4 billion comments. The comments are mostly conversational and well structured, making it a good resource for training conversational models. Figure 6 provides an example of a Reddit comment chain. Comment B is a child of comment A if comment B is a reply to comment A. We extract Encoder SNLI Input 2 SNLI Input 1 u1 u2 Input Response u v (u1, u2, |u1-u2|, u1*u2) Fully-connected layers

Encoder Encoder Encoder
Fully-connected layers SNLI Figure 5: Architecture of the multitask model. Sentence encoders are in red and share parameters. comments and their children to form the inputresponse pairs described above. Several rules are applied to filter out the noisy data. A comment is removed if any of the following conditions holds: number of characters ≥ 350, percentage of alphabetic characters ≤ 70%, starts with "https", "/r/" or "@", author's name contains "bot". The total number of extracted pairs is around 600 million. adjust the batch size to 256 and learning rate to 0.001 after 30 million and 20 million steps for the Reddit and the Reddit+SNLI models, respectively. When training the multitask model, we initialize the shared parameters with a pretrained Reddit model. We employ a distributed training system with multiple workers, where 95% of workers are used to continue training the Reddit task and 5% of workers are used to train the SNLI task. We use a sentence embedding size of 500 in all experiments, and normalize sentence embeddings prior to use in subsequent network layers. The parameters were only lightly tuned to prevent overfitting on the SNLI task. The encoder configurations are taken from the default parameters from previous work. For DAN, we employ a 3-layer DNN with layers containing 300, 300, and 500 hidden units. For the transformer encoder, our experiments make use of 6 attention layers (num hidden layers) and 8 attentions heads (num heads). Within each attention layer, the feedforward network applied to each head has an input and output size of 512 (hidden size) and makes use of a 2048 unit inner-layer (filter size).

Experiments
We first evaluate the different encoders on the response prediction task. For the multitask models, we then examine their performance on SNLI. Finally, we evaluate the encoders on the STS Benchmark (Cer et al., 2017) and on SemEval 2017 Community Question Answering (CQA) subtask B (Nakov et al., 2017). We refer to the model trained over Reddit input-response pairs as Reddit and the multitask model as Reddit+SNLI.

Response Prediction
Following Henderson et al. (2017), we use precision at N (P@N) as an evaluation metric for the conversational response prediction task. Given an Accuracy  Table 2: SNLI classification performance for the Reddit+SNLI model using the transformer encoder with reference evaluation numbers from prior work. We note that similar to InferSent, our goal is to use SNLI to obtain better sentence representations rather than achieving state-of-the-art performance on the SNLI task itself.
input, the task is to select the true response (positive) from 99 randomly selected responses (negatives). We rank all 100 candidate responses by their dot-product scores from the input-response model. The P@N score evaluates if the true response (positive) appears in the top N responses. For the evaluation, the Reddit data is randomly split into train (90%) and test (10%) sets. Table 1 shows the P@N results of Reddit models trained with different encoders, for N=1, 3, 10. The DAN encoder (with n-grams), as investigated by Henderson et al. (2017), provides a strong baseline. We observe the transformer encoder outperforms DAN for all values of N. The transformer encoder achieves a P@1 metric of 65.7% while DAN achieves only 56.1%. Given its greater performance, we use a transformer encoder for the remainder of the experiments reported in this work. (Bowman et al., 2015) annotates the inferential relationship between paired sentences as entailment, contradiction, or neural. One sentence is entailed by another sentence if its meaning can be inferred from the other. Sentences contradict each other if the meaning of one implies that the other is not true. The sentence pairs in the dataset are partitioned into train (550,152), dev (10,000), and test (10,000). Model performance is evaluated based on classification accuracy.

SNLI
Our multitask model learns a shared encoder for the conversational response prediction and SNLI tasks. We report evaluation results on the SNLI task in order to facilitate better comparison with InferSent (Conneau et al., 2017), which served as the inspiration for the inclusion of the SNLI task within a multitask model. For reference, we pro-vide the results of Gumbel TreeLSTM (Williams et al., 2017), which is the best sentence encoder based model, and KIM Ensemble (Chen et al., 2017), which is the current state-of-the-art.
Sentence encoder based models first encode the two sentences in an SNLI input pair separately, and then feed the encodings into a classifier. By comparison, other models explicitly consider word-level interactions between the paired sentences (e.g., using cross-attention). We note that our model is sentence encoder based. Table 2 shows the accuracy on the test set of the joint model and baselines. The multitask model achieves 84.1% accuracy and is close to the performance of InferSent. There are two significant differences between our model and prior work. First, the proposed model learns all model parameters from scratch, including the word embeddings. Due in part to the size of the SNLI training set, In-ferSent uses a large pre-trained word embedding model fit via GloVe (Pennington et al., 2014) on 840 billion tokens of web crawl data, which results in fewer out-of-vocabulary words. For our multitask model, the Reddit dataset is large enough that we do not necessarily require pre-trained word embeddings. However, it is possible the pretrained GloVe embeddings provide slightly better performance on the SNLI task. 5 Secondly, our multi-task model learns two tasks simultaneously, balancing performance between them, while In-ferSent only optimizes performance on SNLI. As will be presented below, our multi-task model performs better on STS. We suspect multi-task training both increases coverage of different language phenomenon and acts as a regularizer across tasks that prevents the resulting sentence embeddings from overfitting any particular task, thus improving transfer performance to new tasks. 6

STS Benchmark
The proposed models encode text into a sentencelevel embedding space. We evaluate the extent to which the embeddings accurately encode sentence-level meaning using the Semantic Tex-5 Preliminary experiments with pre-trained embeddings on a P@N Reddit response prediction evaluation revealed no performance advantage over embeddings learned directly from the data. 6 We note that, if our model is reduced to just training on SNLI without multitask training on Reddit, it would be equivalent to InferSent but without the use of pretrained sentence embeddings. We do not provide results for this configuration as preliminary experiments suggested it performed poorly.  tual Similarity (STS) Benchmark. The benchmark includes English datasets from the Se-mEval/*SEM STS shared tasks between 2012 and 2017 (Cer et al., 2017;Agirre et al., 2016Agirre et al., , 2015Agirre et al., , 2014Agirre et al., , 2013Agirre et al., , 2012. The data include 8,628 sentence pairs from three categories: captions, news and forums. Each pair is annotated with a humanlabeled degree of meaning similarity, ranging from 0 to 5. The dataset is divided into train (5,749), dev (1,500) and test (1,379). We report results using two configurations for the evaluation of the Reddit and Reddit+SNLI models. The first configuration is "out-of-thebox" with no adaptation for the STS task. Rather, we take the original sentence embeddings u, v and directly score the sentence pair similarity based on the angular distance between the two vectors, − arccos uv ||u|| ||v|| . 7 We suspect the original sentence embeddings from the Reddit and Red-dit+SNLI models will not necessary weight all semantic distinctions in a way that is consistent with the annotations for STS. The second configuration for evaluating the two models uses a single transformation matrix to fine-tune the sentence embedding representations for the STS task. The matrix, which is parameterized using the STS training data, transforms the original sentence embedding vectors u, v to u * , v * . Table 3 presents results on the dev and test sets of the STS Benchmark. For model comparisons, we include the state-of-the-art neural STS 7 arccos is used to convert the cosine similarity scores into angular distances that obey the triangle inequality. model CNN (HCTI) (Shao, 2017) and other systems in Cer et al. (2017). 8 The untuned Reddit model is competitive with many of the other neural representation models, demonstrating that the sentence embeddings learned on Reddit conversations do keep text with similar semantics close in embedding space. The "out-of-the-box" multitask model, Reddit+SNLI, achieves an r of 0.814 on the dev set and 0.782 on test. Using a transformation matrix to adapt the Reddit model trained without SNLI to STS, we achieve Pearson's r of 0.809 on dev and 0.781 on test. This surpasses In-ferSent and is close to the performance of the best neural representation approach, CNN (HCTI). 9 The adapted multitask model achieves the best performance among all neural models, with an r of 0.835 on the dev data and 0.808 on test. The results are competitive with state-of-the-art feature engineered and mixed systems, e.g. ECNU and BIT. However, our models are simpler and require no feature engineering. 10

CQA Subtask B
To further validate the effectiveness of sentence representations learned from conversational data, we assess the proposed models on subtask B of SemEval Community Question Answering (CQA) (Nakov et al., 2017). In this task, given an "original" question Q, and the top ten related questions from a forum (Q 1 , . . . , Q 10 ) as retrieved by a search engine, the goal is to rank the related questions according to their similarity with respect 8 InferSent (Conneau et al., 2017), Sent2Vec (Pagliardini et al., 2017), SIF (Arora et al., 2017), PV-DBOW (Lau and Baldwin, 2016), C-PHRASE (Kruszewski et al., 2015), ECNU (Tian et al., 2017) and BIT . 9 For both the STS shared task and the STS benchmark leaderboard, systems are allowed to use external datasets as long as they do not make use of supervised annotations on data that overlap with the evaluation sets. InferSent introduced the use of SNLI for STS. However, we discovered 4 out of the 1,379 pairs within the STS Benchmark dev set and 5 out of the 1,500 pairs in the STS Benchmark test set overlap with the SNLI training set. We do not believe this minimal overlap had a meaningful impact on the results presented here. 10 As summarized by Cer et al. (2017), ENCU makes use of a large feature set that includes: n-gram overlap; edit distance; longest common prefix/suffix/substring; tree kernels; word alignment based similarity; summarization and MT evaluation metrics; kernel similarity of bags-of-words and bags-of-dependency triples; and pooled word embeddings. The manually engineered features are combined with scores from DAN and LSTM based deep learning models. BIT relies primarily on a measure of sentence information content (IC) with a non-trivial derivation that is optionally combined with either an alignment based similarity score or the cosine similarity of IDF weighed summed word embeddings.  Table 4: Pearson's r of the proposed models on the STS Benchmark with a breakdown by category.
Score Label STS Input Sentences Good -0.51 4.2 S1: a small bird sitting on a branch in winter. S2: a small bird perched on an icy branch. Good -1.23 0.0 S1: microwave would be your best bet. S2: your best bet is research. Bad -0.42 2.2 S1: a little boy is singing and playing a guitar. S2: a man is singing and playing the guitar. Bad -0.45 1.0 S1: yes, you have to file a tax return in canada. S2: you are not required to file a tax return in canada if you have no taxable income.  to the original question. Mean average precision (MAP) is used to evaluate candidate models. Each pairing of an original question and a related question (Q, Q i ) is labeled "PerfectMatch", "Relevant" or "Irrelevant". Both "PerfectMatch" and "Relevant" are considered as good questions, which should rank above "Irrelevant" ones.
Similar to the STS experiments, we use cosine similarity between the original question and related questions, without considering any other interaction between the two questions. 11 Given a related question Q i and its original question Q, we first encode them into vectors u i and u. Then the related questions are ranked based on the cosine similarity with respect to the original question, 11 Our model also excludes the use of comments and user profiles provided by CQA as optional contextual features. cos (u i , u). Results are shown in table 6. Sim-Bow (Charlet and Damnati, 2017) and KeLP (Filice et al., 2017), which are the best systems on the 2017 task, are used as baselines. 12 Even without tuning on the training data provided by the task, our models show competitive performance. Red-dit+SNLI outperforms SimBow-primary, which official ranked first during the 2017 shared task.

Analysis
Model performance on the STS Benchmark can be partition by sentence pair source. The test set contains 625 sentence pairs drawn from captions, 500 pairs from news data, and 254 from online forums. Table 4 provides results on each sub-group. For the captions category, adding the SNLI data improves the baseline Reddit model by about 8% absolute. Even with tuning to STS, mixing in SNLI data still helps dramatically on captions, as the STS tuned Reddit+SNLI model is 5% absolute higher than the STS tuned Reddit model on this category. The improvement is likely attributed to the fact that the SNLI sentences are from image captions, while Reddit doesn't contain much caption-style data. Training with the SNLI data has a smaller impact on performance for the other categories, with even a slight decrease for the STS tuned models on news test.
We observe that the STS tuned models have only modest performance improvements on the forum data over the untuned models, with much larger improvements for captions and news. Moreover, for the Reddit+SNLI models, tuning produces a large performance increase for news with smaller increases for both captions and forums. This suggests tuning is impart compensating for domain limitations within the training data. 13 Further improvements on the STS Benchmark could likely be achieved by including additional encoder training data sourced from news data. Figure 7 plots predicted similarity scores 13 e.g., the Reddit+SNLI model is trained on image caption and discussion forum data but not news. against the ground truth labels within the STS Benchmark test data. The figure shows that while the predicted scores are correlated with human judgment, there is still a sizable range of predicted similarity values for any given gold STS label.
We provide examples of good and bad similarity predictions in table 5. For the two good examples, the model correctly has a relatively high similarity score for the first pair, and a relatively low score for the second. For the first bad example, the model fails to penalize its similarity score based on the semantic distinction between "boy" and "man" as much as human raters did. For the second bad example, apparently being on the topic of whether it is necessary to file Canadian tax returns was enough for the model to assign a high similarly score. Human raters correctly assigned a low similarity score since the two sentences are making very different claims.

Quantity of SNLI data and Performance
The experiments in the previous section show that supervised in-domain data, SNLI's image captions, can be used to improve the semantic representations of in-domain (caption) sentences. However, supervised data is difficult to obtain, especially on the order of SNLI's 570,000 sentence pairs. In order to learn how much supervised data is needed, we train multitask models with Reddit and varying amounts of SNLI data, ranging from 10% to 90% of the full dataset. Figure 8 shows the STS Benchmark results for all data and for caption data only, on both dev and test sets. When first adding the SNLI data into the training task, Pearson's r increases rapidly across all measures. Even with only 10% of the SNLI data, r reaches around 0.85 for captions data on both dev and test. The curves mostly flatten out after using 40% of the data, with performance only improving slightly past this point. This suggests encoders trained primarily on Reddit data can be efficiently adapted to perform well on other domains using a small sample of in-domain data.

Related Work
The STS task was first introduced by Agirre et al. (2012). Early methods focused on lexical semantics, surface form matching and basic syntactic similarity (Bär et al., 2012;Jimenez et al., 2012). More recently, deep learning based methods became competitive (Shao, 2017;Tai et al., 2015). One approach to this task is to encode sentences into sentence-level embeddings and then calculate the cosine similarity between the encoded representations of the sentence pair. The encoding model can be directly trained on the STS task (Shao, 2017) or it can be trained on an alternative supervised (Conneau et al., 2017) or unsupervised (Pagliardini et al., 2017) task. The primary contribution of the work described in this paper falls into the latter category, introducing a new unsupervised task based on conversational data that achieves good performance on predicting semantic similarity scores. Training on input-response data has been previously shown to be effective at email response prediction (Kannan et al., 2016;Henderson et al., 2017). We extend prior work by exploring the effectiveness of representations learned from conversations in capturing generalpurpose semantic information. The approach is similar to Skip-Thought (Kiros et al., 2015), which learns sentence-level representations through prior and next sentence prediction within a document. However, within our work, the adjacent sentences are pulled from turns in a conversation.

Conclusion
In this paper, we propose using conversational response prediction models to obtain sentence-level embeddings. We find that encodings learned for conversational response prediction perform well on sentence-level semantic similarity. Sentence embeddings extracted from a model trained on conversational data can be used to obtain results on the STS Benchmark that are competitive with well performing models based on sentence-level encoders. A multitask model trained on response prediction and SNLI achieves state-of-the-art performance for sentence encoding based models on the STS Benchmark, and surpasses prior work that trained on SNLI alone (InferSet). Finally, even without any task-specific training, the sentence embeddings obtained from both the conversational response prediction model and the multitask model that includes SNLI are competitive on CQA subtask B.