Bayesian Supervised Domain Adaptation for Short Text Similarity

Identiﬁcation of short text similarity ( STS ) is a high-utility NLP task with applications in a variety of domains. We explore adaptation of STS algorithms to different target domains and applications. A two-level hierarchical Bayesian model is employed for domain adaptation ( DA ) of a linear STS model to text from different sources (e.g., news, tweets). This model is then further extended for multitask learning ( MTL ) of three related tasks: STS , short answer scoring ( SAS ) and answer sentence ranking ( ASR ). In our experiments, the adaptive model demonstrates better overall cross-domain and cross-task performance over two non-adaptive baselines.


Short Text Similarity: The Need for Domain Adaptation
Given two snippets of text-neither longer than a few sentences-short text similarity (STS) determines how semantically close they are. STS has a broad range of applications: question answering (Yao et al., 2013;Severyn and Moschitti, 2015), text summarization (Dasgupta et al., 2013;Wang et al., 2013), machine translation evaluation (Chan and Ng, 2008;Liu et al., 2011), and grading of student answers in academic tests (Mohler et al., 2011;Ramachandran et al., 2015). STS is typically viewed as a supervised machine learning problem (Bär et al., 2012;Lynum et al., 2014;Hänig et al., 2015). SemEval contests (Agirre et al., 2012;Agirre et al., 2015) have spurred recent progress in STS and have provided valuable training data for these supervised approaches. However, similarity varies across domains, as does the underlying text; e.g., syntactically well-formed academic text versus informal English in forum QA.
Our goal is to effectively use domain adaptation (DA) to transfer information from these disparate STS domains. While "domain" can take a range of meanings, we consider adaptation to different (1) sources of text (e.g., news headlines, tweets), and (2) applications of STS (e.g., QA vs. answer grading). Our goal is to improve performance in a new domain with few in-domain annotations by using many out-of-domain ones (Section 2).
In Section 3, we describe our Bayesian approach that posits that per-domain parameter vectors share a common Gaussian prior that represents the global parameter vector. Importantly, this idea can be extended with little effort to a nested domain hierarchy (domains within domains), which allows us to create a single, unified STS model that generalizes across domains as well as tasks, capturing the nuances that an STS system must have for tasks such as short answer scoring or question answering.
We compare our DA methods against two baselines: (1) a domain-agnostic model that uses all training data and does not distinguish between in-domain and out-of-domain examples, and (2) a model that learns only from in-domain examples. Section 5 shows that across ten different STS domains, the adaptive model consistently outperforms the first baseline while performing at least as well as the second across training datasets of different sizes. Our multitask model also yields better overall results over the same baselines across three related tasks: (1) STS, (2) short answer scoring (SAS), and (3) answer sentence ranking (ASR) for question answering.

Tasks and Datasets
Short Text Similarity (STS) Given two short texts, STS provides a real-valued score that represents their degree of semantic similarity. Our STS datasets come from the SemEval 2012-2015 corpora, containing over 14,000 human-annotated sentence pairs (via Amazon Mechanical Turk) from domains like news, tweets, forum posts, and image descriptions.
For our experiments, we select ten datasets from ten different domains, containing 6,450 sentence pairs. 1 This selection is intended to maximize (a) the number of domains, (b) domain uniqueness: of three different news headlines datasets, for example, we select the most recent (2015), discarding older ones (2013,2014), and (c) amount of per-domain data available: we exclude the FNWN (2013) dataset with 189 annotations, for example, because it limits per-domain training data in our experiments. Sizes of the selected datasets range from 375 to 750 pairs. Average correlation (Pearson's r) among annotators ranges from 58.6% to 88.8% on individual datasets (above 70% for most) (Agirre et al., 2012;Agirre et al., 2013;Agirre et al., 2014;Agirre et al., 2015).
Short Answer Scoring (SAS) SAS comes in different forms; we explore a form where for a shortanswer question, a gold answer is provided, and the goal is to grade student answers based on how similar they are to the gold answer (Ramachandran et al., 2015). We use a dataset of undergraduate data structures questions and student responses graded by two judges (Mohler et al., 2011). These questions are spread across ten different assignments and two examinations, each on a related set of topics (e.g., programming basics, sorting algorithms). Inter-annotator agreement is 58.6% (Pearson's ρ) and 0.659 (RMSE on a 5-point scale). We discard assignments with fewer than 200 pairs, retaining 1,182 student responses to forty questions spread across five assignments and tests. 2 Answer Sentence Ranking (ASR) Given a factoid question and a set of candidate answer sentences, ASR orders candidates so that sentences containing the answer are ranked higher. Text similarity is the foundation of most prior work: a candidate sentence's relevance is based on its similarity with the question (Wang et al., 2007;Yao et al., 2013;Severyn and Moschitti, 2015).
For our ASR experiments, we use factoid questions developed by Wang et al. (2007)

Bayesian Domain Adaptation for STS
We first discuss our base linear models for the three tasks: Bayesian L 2 -regularized linear (for STS and SAS) and logistic (for ASR) regression. We extend these models for (1) adaptation across different short text similarity domains, and (2) multitask learning of short text similarity (STS), short answer scoring (SAS), and answer sentence ranking (ASR).

Base Models
In our base models (Figure 1), the feature vector f combines with the feature weight vector w (including a bias term w 0 ) to form predictions. Each parameter w i ∈ w has its own zero-mean Gaussian prior with its standard deviation σ w i distributed uniformly in [0, m σw ], the covariance matrix Σ w is diagonal, and the zero-mean prior L 2 regularizes the model. In the linear model (Figure 1a), S is the output (similarity score for STS; answer score for SAS) and is normally distributed around the dot product w T f . The model error σ S has a uniform prior over a prespecified range [0, m σ S ]. In the logistic model (Figure 1b) for ASR, the probability p that the candidate sentence answers the question, is (1) the sigmoid of w T f , and (2) the Bernoulli prior of A, whether or not the candidate answers the question.
The common vectors w and f in these models enable joint parameter learning and consequently multitask learning (Section 3.3).  Plates represent replication across sentence pairs. Each model learns weight vector w. For STS and SAS, the real-valued output S (similarity or student score) is normally distributed around the weight-feature dot product w T f . For ASR, the sigmoid of this dot product is the Bernoulli prior for the binary output A, relevance of the question's answer candidate.

Adaptation to STS Domains
Domain adaptation for the linear model ( Figure 1a) learns a separate weight vector w d for each domain d (i.e., applied to similarity computations for test pairs in domain d) alongside a common, global domainagnostic weight vector w * , which has a zero-mean Gaussian prior and serves as the Gaussian prior mean for each w d . Figure 2 shows the model. Both w * and w d have hyperpriors identical to w in Figure 1a. 4 Each w d depends not just on its domain-specific observations but also on information derived from the global, shared parameter w * . The balance between capturing in-domain information and inductive trans- fer is regulated by Σ w ; larger variance allows w d more freedom to reflect the domain.

Multitask Learning
An advantage of hierarchical DA is that it extends easily to arbitrarily nested domains. Our multitask learning model ( Figure 3) models topical domains nested within one of three related tasks: STS, SAS, and ASR (Section 2). This model adds a level to the hierarchy of weight vectors: each domain-level w d is now normally distributed around a task-level weight vector (e.g., w STS ), which in turn has global Gaussian mean w * . 5 Like the DA model, all weights in the same level share common variance hyperparameters while those across different levels are separate. Again, this hierarchical structure (1) jointly learns global, task-level and domain-level feature weights enabling inductive transfer among tasks and domains while (2) retaining the distinction between in-domain and out-of-domain annotations. A taskspecific model ( Figure 1) that only learns from indomain annotations supports only (2). On the other hand, a non-hierarchical joint model ( Figure 4) supports only (1): it learns a single shared w applied to any test pair regardless of task or domain. We compare these models in Section 5. Figure 3: Multitask learning: STS, SAS and ASR. Global (w * ), task-specific (w STS , w SAS , w ASR ) and domain-specific (w d ) weight vectors are jointly learned, enabling transfer across domains and tasks.
ASR STS ∪ SAS 0, igure 4: A non-hierarchical joint model for STS, SAS and ASR. A common weight vector w is learned for all tasks and domains.

Features
Any feature-based STS model can serve as the base model for a hierarchical Bayesian adaptation framework. For our experiments, we adopt the feature set of the ridge regression model in Sultan et al. (2015), the best-performing system at SemEval-2015 (Agirre et al., 2015).
m ) (where each w is a token) produce two similarity features. The first is the proportion of content words in S (1) and S (2) (combined) that have a semantically similar word-identified using a monolingual word aligner (Sultan et al., 2014)in the other sentence. The overall semantic similarity of a word pair (w is a weighted sum of lexical and contextual similarities: a paraphrase database (Ganitkevitch et al., 2013, PPDB) identifies lexically similar words; contextual similarity is the average lexical similarity in (1) . Lexical similarity scores of pairs in PPDB as well as weights of word and contextual similarities are optimized on an alignment dataset (Brockett, 2007). To avoid penalizing long answer snippets (that still have the desired semantic content) in SAS and ASR, word alignment proportions outside the reference (gold) answer (SAS) and the question (ASR) are ignored.
The second feature captures finer-grained similarities between related words (e.g., cell and organism). Given the 400-dimensional embedding (Baroni et al., 2014) of each content word (lemmatized) in an input sentence, we compute a sentence vector by adding its content lemma vectors. The co- sine similarity between the S (1) and S (2) vectors is then used as an STS feature. Baroni et al. develop the word embeddings using word2vec 6 from a corpus of about 2.8 billion tokens, using the Continuous Bag-of-Words (CBOW) model proposed by Mikolov et al. (2013).

Experiments
For each of the three tasks, we first assess the performance of our base model to (1) verify our samplingbased Bayesian implementations, and (2) compare to the state of the art. We train each model with a Metropolis-within-Gibbs sampler with 50,000 samples using PyMC (Patil et al., 2010;Salvatier et al., 2015), discarding the first half of the samples as burnin. The variances m σw and m σ S are both set to 100. Base models are evaluated on the entire test set for each task, and the same training examples as in the state-of-the-art systems are used. Table 1 shows the results. Following SemEval, we report a weighted sum of correlations (Pearson's r) across all test sets for STS, where the weight of a test set is proportional to its number of pairs. Our model and Sultan et al. (2015) are almost identical on all twenty test sets from Se-mEval 2012-2015, supporting the correctness of our Bayesian implementation.
Following Mohler et al. (2011), for SAS we use RMSE and Pearson's r with gold scores over all answers. These metrics are complementary: correlation is a measure of consistency across students while error measures deviation from individual scores. Our model beats the state-of-the-art text matching model of Mohler et al. (2011) on both metrics. 7 Finally, for ASR, we adopt two metrics widely used in information retrieval: mean average precision (MAP) and mean reciprocal rank (MRR). MAP assesses the quality of the ranking as a whole whereas MRR evaluates only the top-ranked answer sentence. Severyn and Moschitti (2015) report a convolutional neural network model of text similarity which shows top ASR results on the Wang et al. (2007) dataset.
Our model outperforms this model on both metrics.

Adaptation to STS Domains
Ideally, our domain adaptation (DA) should allow the application of large amounts of out-of-domain training data along with few in-domain examples to improve in-domain performance. Given data from n domains, two other alternatives in such scenarios are: (1) to train a single global model using all available training examples, and (2) to train n individual models, one for each domain, using only in-domain examples. We present results from our DA model and these two baselines on the ten STS datasets discussed in Section 2. We fix the training set size per domain and split each domain into train and test folds randomly.
Models have access to training data from all ten domains (thus nine times more out-of-domain examples than in-domain ones). Each model (global, individual, and adaptive) is trained on relevant annotations and applied to test pairs, and Pearson's r with gold scores is computed for each model on each individual test set. Since performance can vary across different splits, we average over 20 splits of the same train/test ratio per dataset. Finally, we evaluate each model with a weighted sum of average correlations across all test sets, where the weight of a test set is proportional to its number of pairs. Figure 5 shows how models adapt as the training set grows. The global model clearly falters with larger training sets in comparison to the other two models. On the other hand, the domain-specific model (i.e., the ten individual models) performs poorly when in-domain annotations are scarce. Importantly, the adaptive model performs well across different amounts of available training data.
To correlation in that domain over all seven training set sizes of Figure 5. We then normalize each score by dividing by the best score in that domain. Each cell in Table 2 shows this score for a model-domain pair.
For example, Row 1 shows that-on average-the individual model performs the best (hence a correlation ratio of 1.0) on QA forum answer pairs while the global model performs the worst. While the adaptive model is not the best in every domain, it has the best worst-case performance across domains. The global model suffers in domains that have unique parameter distributions (e.g., MSRpartest: a paraphrase dataset). The individual model performs poorly with few training examples and in domains with noisy annotations (e.g., SMT: a machine translation evaluation dataset). The adaptive model is much less affected in such extreme cases. The summary statistics (weighted by dataset size) confirm that it not only stays the closest to the best model on average, but also deviates the least from its mean performance level.

Qualitative Analysis
We further examine the models to understand why the adaptive model performs well in different extreme scenarios, i.e., when one of the two baseline models performs worse than the other. Table 3 shows feature weights learned by each model from a split with   Table 3: Feature weights and correlations of different models in three extreme scenarios. In each case, the adaptive model learns relative weights that are more similar to those in the best baseline model. seventy-five training pairs per domain and how well each model does. All three domains have very different outcomes for the baseline models. We show weights for the alignment (w 1 ) and embedding features (w 2 ). In each domain, (1) the relative weights learned by the two baseline models are very different, and (2) the adaptive model learns relative weights that are closer to those of the best model. In SMT, for example, the predictor weights learned by the adaptive model have a ratio very similar to the global model's and does just as well. On Answers-students, however, it learns weights similar to those of the in-domain model, again approaching best results for the domain. Now, the labor of cleaning up at the karaoke parlor is realized. Gold=.52 ∆G=.1943Gold=.52 ∆G=. ∆I=.2738Gold=.52 ∆G=. ∆A=.2024 Up till now on the location the cleaning work is already completed. The Chelsea defender Marcel Desailly has been the latest to speak out.
Gold=.45 ∆G=.2513 ∆I=.2222 ∆A=.2245 Marcel Desailly, the France captain and Chelsea defender, believes the latter is true.  Table 4 shows the effect of this on two specific sentence pairs as examples. The first pair is from SMT; the adaptive model has a much lower error than the individual model on this pair, as it learns a higher relative weight for the embedding feature in this domain (Table 3) via inductive transfer from out-of-domain annotations. The second pair, from MSRpar-test, shows the opposite: in-domain annotations help the adaptive model fix the faulty output of the global model by upweighting the alignment feature and downweighting the embedding feature.
The adaptive model gains from the strengths of both in-domain (higher relevance) and out-of-domain (more training data) annotations, leading to good results even in extreme scenarios (e.g., in domains with unique parameter distributions or noisy annotations).

Multitask Learning
We now analyze performance of our multitask learning (MTL) model in each of the three tasks: STS, SAS and ASR. Multitask baselines resemble DA's: (1) a global model trained on all available training data (Figure 4), and (2) nineteen task-specific models, each trained on an individual dataset from one of the three tasks (Figure 1). The smallest of these datasets has only 204 pairs (SAS assignment #1); therefore, we use training sets with up to 175 pairs per dataset. Because the MTL model is more complex, we use a stronger regularization for this model (m σw =10) while keeping the number of MCMC samples unchanged. As in the DA experiments, we compute average performance over twenty random train/test splits for each training set size. Figure 6 shows STS results for all models across For SAS, the adaptive model again has the best overall performance for both correlation and error (Figure 7). The correlation plot is qualitatively similar to the STS plot, but the global model has a much higher RMSE across all training set sizes, indicating a parameter shift across tasks. Importantly, the adaptive model remains unaffected by this shift.
The ASR results in Figure 8 show a different pattern. Contrary to all results thus far, the global model performs the best in this task. Going back to STS, this finding also offers an explanation of why adaptation might have been less useful in multitask learning than in domain adaptation, as only the former has ASR annotations.

Discussion and Related Work
For a variety of short text similarity tasks, domain adaptation improves average performance across different domains, tasks, and training set sizes. Our adaptive model is also by far the least affected by adverse factors such as noisy training data and scarcity or coarse granularity of in-domain examples. This combination of excellent average-case and very reliable worst-case performance makes it the model of choice for new STS domains and applications.
Although STS is a useful task with sparse data, few domain adaptation studies have been reported. Among those is the supervised model of Heilman and Madnani (2013a;2013b) based on the multilevel model of Daumé III (2007). Gella et al. (2013) report using a two-level stacked regressor, where the second level combines predictions from n level 1 models, each trained on data from a separate domain. Unsupervised models use techniques such as tagging examples with their source datasets (Gella et al., 2013;Severyn et al., 2013) and computing vocabulary similarity between source and target domains (Arora et al., 2015). To the best of our knowledge, ours is the first systematic study of supervised DA and MTL techniques for STS with detailed comparisons with comparable non-adaptive baselines.

Conclusions and Future Work
We present hierarchical Bayesian models for supervised domain adaptation and multitask learning of short text similarity models. In our experiments, these models show improved overall performance across different domains and tasks. We intend to explore adaptation to other STS applications and with additional STS features (e.g., word and character ngram overlap) in future. Unsupervised and semisupervised domain adaptation techniques that do not assume the availability of in-domain annotations or that learn effective domains splits (Hu et al., 2014) provide another avenue for future research.