Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection

We address the problem of Duplicate Question Detection (DQD) in low-resource domain-specific Community Question Answering forums. Our multi-view framework MV-DASE combines an ensemble of sentence encoders via Generalized Canonical Correlation Analysis, using unlabeled data only. In our experiments, the ensemble includes generic and domain-specific averaged word embeddings, domain-finetuned BERT and the Universal Sentence Encoder. We evaluate MV-DASE on the CQADupStack corpus and on additional low-resource Stack Exchange forums. Combining the strengths of different encoders, we significantly outperform BM25, all single-view systems as well as a recent supervised domain-adversarial DQD method.


Introduction
Duplicate Question Detection is the task of finding questions in a database that are equivalent to an incoming query. Many Community Question Answering (CQA) forums leave this task to the collective memory of their users. This results in unnecessary manual work for community members as well as delayed access to answers (Hoogeveen et al., 2015).
Automatic DQD is often approached as a supervised problem with community-generated training labels. However, smaller CQA forums may suffer from label sparsity: On Stack Exchange, 50% of forums have fewer than 160 user-labeled duplicates, and 25% have fewer than 50 (see Figure 1). 1 To overcome this problem, two avenues have been explored: The first is supervised domainadversarial training on a label-rich source forum (Shah et al., 2018), which works best when 10 0 10 1 10 2 10 3 10 4 10 5 10 6 #D #Q source and target domains are related. The second is unsupervised DQD via representation learning (Charlet and Damnati, 2017;Lau and Baldwin, 2016), which requires only unlabeled questions. In this paper, we take the unsupervised avenue.
A major challenge in the context of domainspecific CQA forums is that language usage may differ from the "generic" domains of existing representations. To illustrate this point, compare the following Nearest Neighbor lists of the word "tree", based either on generic GloVe embeddings (Pennington et al., 2014) or on FastText embeddings (Bojanowski et al., 2017) that were trained on specific CQA forums: generic (GloVe): trees, branches, leaf chess: searches, prune, modify outdoors: trees, trunk, trunks gis: strtree, rtree, btree wordpress: trees, hierachy, hierarchial gaming: trees, treehouse, skills Charlet and Damnati (2017) and Lau and Baldwin (2016) report that representations trained on in-domain data perform better on unsupervised DQD than generic representations. But in a low-resource setting, the amount of unlabeled indomain data is limited. This can result in low coverage or quality, as illustrated by the in-domain embedding neighbors of "tree" in the smallest forum from our dataset: windowsphone: dreamspark, l535ds, generally  It is therefore desirable to combine the overall quality and coverage of generic representations with the domain-specificity of in-domain representations via multi-view learning. There is a large body of work on multi-view word embeddings (see Section 2.3), including domain adapted word embeddings (Sarma et al., 2018).
Recent representation learning techniques go beyond the word level and embed larger contexts (e.g., sentences) jointly (Peters et al., 2018;Devlin et al., 2019;Cer et al., 2018). To reflect this paradigm shift, we take multi-view representation learning from the word to the sentence level and propose MV-DASE (Multi-View Domain Adapted Sentence Embeddings), a framework that combines an ensemble of sentence encoders via Generalized Canonical Correlation Analysis (see Section 3.1).
MV-DASE uses unlabeled in-domain data only, making it applicable to the problem of unsupervised DQD. As a framework, it is agnostic to the internal specifics of its ensemble. In Section 3.2, we describe an ensemble of different sentence encoders: domain-specific and generic, contextualized and noncontextualized (see Table 1). In Sections 4 and 5, we demonstrate that MV-DASE is effective at duplicate retrieval on the CQADupStack corpus (Hoogeveen et al., 2015) and on additional low-resource Stack Exchange forums. Significance tests show significant gains over BM25, all single-view systems and domainadversarial supervised training as proposed by Shah et al. (2018). In Sections 6 and 7, we successfully evaluate MV-DASE on two additional benchmarks: the SemEval-2017 DQD shared task (Nakov et al., 2017) as well as the unsupervised STS Benchmark (Cer et al., 2017).  Hoogeveen et al. (2018)) focuses on supervised architectures. As discussed, these approaches are not applicable to forums with few or no labeled duplicates. Shah et al. (2018) tackle label sparsity by domain-adversarial training (ADA). More specifically, they train a bidirectional Long-Short Term Memory Network (LSTM) (Hochreiter and Schmidhuber, 1997) on a label-rich source forum, while minimizing the distance between source and target domain representations. Their approach beats BM25 and a simple transfer baseline in cases where source and target domain are closely related (e.g., AskUbuntu→SuperUser), but not on more distant pairings. This is not ideal, as the existence of a big related source forum is not guaranteed.
An alternative is unsupervised DQD via representation learning, which does not require any labels. Charlet and Damnati (2017) use a word embedding-based soft cosine distance for duplicate ranking. In a recent DQD shared task (SemEval-2017 task 3B, Nakov et al. (2017)), their best unsupervised system trails the best supervised system by only 2% Mean Average Precision (MAP). This seems reasonable, given that the implicit objective of many representation learning methods (similar representations for similar objects) is closely related to the notion of a duplicate. Charlet and Damnati (2017) report overall better results when embeddings are trained on domain-specific data rather than Wikipedia. However, they make no attempts to combine the two domains. Lau and Baldwin (2016) evaluate two representation learning techniques (doc2vec (Le and Mikolov, 2014) and word2vec (Mikolov et al., 2013a)) on CQADupStack. They also report better results when representations are learned on domain-specific rather than generic data.

Sentence embeddings and STS
Unsupervised DQD is related to the task of unsupervised Semantic Textual Similarity (STS), i.e., sentence similarity scoring (Cer et al., 2017).  show that a weighted average over pre-trained word embeddings, followed by principal component removal, is a strong baseline for STS. We use their weighting scheme, Smooth Inverse Frequency (SIF), in Section 3.2.
Averaged word embeddings are insensitive to word order. This stands in contrast to contextualized encoders, such as LSTMs or Transform-ers (Vaswani et al., 2017). Contextualized encoders are typically trained as unsupervised language models (Peters et al., 2018;Devlin et al., 2019) or on supervised transfer tasks (Conneau et al., 2017;Cer et al., 2018). At the time of writing, weighted averaged word embeddings achieve better results than contextualized encoders on unsupervised STS. 2

Multi-view word embeddings
Multi-view representation learning is an umbrella term for methods that transform different representations of the same entities into a common space. In NLP, it has typically been applied to word embeddings. A famous example is the cross-lingual projection of word embeddings (Mikolov et al., 2013b;Faruqui and Dyer, 2014). Monolingually, Rastogi et al. (2015) use Generalized Canonical Correlation Analysis (GCCA) to project different word representations into a common space. Yin and Schütze (2016)  All of these methods are post-training, i.e., they are applied to fully trained word embeddings. MV-DASE falls into the same category, albeit at the sentence level (see Section 3.1). Other methods, which we will call in-training, encourage the alignment of embeddings during training (e.g., Bollegala et al. (2015); Yang et al. (2017)).

Multi-view sentence embeddings
Multi-view sentence embeddings are less frequently explored than multi-view word embeddings. One exception is Tang and de Sa (2019), who train a recurrent neural network and an average word embedding encoder jointly on an unlabeled corpus. This method is in-training, i.e., it cannot be used to combine pre-existing encoders. Kiela et al. (2018) dynamically integrate an ensemble of word embeddings into a task-specific LSTM. They require labeled data and the resulting embeddings are task-specific.  (Bowman et al., 2015). They initialize InferSent with the adapted embeddings and then retrain it on SNLI. Note that this approach is not feasible when the training regime of an encoder cannot be reproduced, e.g., when the original training data is not publicly available.

Method
We now describe MV-DASE as a general framework. For details on the ensemble used in this paper, see Section 3.2.

Framework
GCCA basics. Given zero-mean random vectors x 2 are maximally correlated. Bach and Jordan (2002) show that CCA reduces to a generalized eigenvalue problem. A generalized eigenvalue problem finds scalar-vector pairs (ρ, θ) that satisfy Aθ = ρBθ for matrices A, B. Here, A, B are the following block matrices: where Σ 1,1 , Σ 2,2 are the covariance matrices of x 1 , x 2 and Σ 1,2 , Σ 2,1 are their cross-covariance matrices. We stack all d eigenvectors into an operator Θ ∈ R d×d 1 +d 2 . Using this operator, multiview representations are projected as: Generalized CCA (GCCA) generalizes CCA to three or more random vectors x 1 . . . x J . There are several variants of GCCA (Kettenring, 1971); we follow Bach and Jordan (2002) and solve a multiview version of Equation 1: For stability, we add τ σ j I j to every covariance matrix Σ j,j , where τ is a hyperparameter (here: τ = 0.1), I j is the identity matrix and σ j is the average variance of x j . Like in the two-view case, we stack all d eigenvectors into an operator: Θ ∈ R d× j d j . GCCA application. Assume that we have an ensemble of J sentence encoders. The j'th encoder is denoted f j : S → R d j , where S is the set of all possible in-domain strings (here: indomain questions) and d j is determined by f j . Assume also that we have a sample from S, i.e., a corpus of unlabeled in-domain strings, denoted S = {s 1 , . . . , s N }. From this corpus, we create one training matrix X j per encoder: . . .
We then apply GCCA as described before, yielding Θ ∈ R d× j d j . The multi-view embedding of a new input q (e.g., a test query) is:

Ensemble
We use MV-DASE on the following ensemble: • weighted averaged generic GloVe vectors (Pennington et al., 2014) • weighted averaged domain-specific FastText vectors (Bojanowski et al., 2017) • Universal Sentence Encoder (USE) (Cer et al., 2018) • domain-finetuned BERT (Devlin et al., 2019) In this section, we describe the encoders in detail. Note that the choice of encoders is orthogonal to the framework and other resources could be used. Where possible, we base our selection on the literature: We choose USE over InferSent due to better performance on STS (Perone et al., 2018), and BERT over ELMo (Peters et al., 2018) due to better performance on linguistic probing tasks (Liu et al., 2019a). The choice of GloVe for generic word embeddings is based on Sarma et al. (2018). Weighted averaged word embeddings. We denote generic and domain-specific word embeddings of some word type i as w G,i ∈ R d G and For w G,i , we use pre-trained 300-d GloVe vectors. 3 w D,i are trained using skipgram FastText 4 (100-d, default parameters) on the indomain corpus S. We SIF-weight all word embeddings by a · (a + p(i)) −1 , where p(i) is the unigram probability of the word type and the smoothing factor (here: a = 10 −3 ) is taken from . We find that probabilities estimated on S produce better results than the Wikipediabased probabilities provided by  (see Table 2, top), hence this is what we use below. After weighting, we perform top-3 principal component removal on the embedding matrices, which is beneficial for word-level similarity tasks (Mu et al., 2018). We denote the new embeddings of word type i asŵ G,i ,ŵ D,i . The embedding of a tokenized string s = (s 1 , . . . , s T ) is computed by averaging: Contextualized encoders. USE and BERT are downloaded as pre-trained models. 5 6 USE is a Transformer trained on SkipThought (Kiros et al., 2015), conversation response prediction (Henderson et al., 2017) and SNLI. It outputs a single 512d sentence embedding, which we use as-is. Below, USE is denoted f U .
BERT is a Transformer that was pre-trained as a masked language model with next sentence prediction. We find that domain-finetuning BERT on S results in improvements over generic BERT (see Table 2, bottom). Note that domain-finetuning refers to unsupervised training as a masked language model, i.e., we only require unlabeled data (Howard and Ruder, 2018). We use default parameters 7 except for a reduced batch size of 8.
At test time, we take the following approach: BERT segments a token sequence s = (s 1 , . . . , s T ) into a subword sequence s = ([CLS], s 1 , . . . , s T , [SEP]), where [CLS] and [SEP] are special tokens that were used during pre-training, and T ≥ T . BERT produces one 768-d vector v l,t per subword s t and layer l ∈ [1, . . . , L], where L is the total number of layers (here: 12). We SIF-weight all vectors according to the probability of their subword (estimated on S) and average over layers and subwords, excluding the special tokens:  (Hoogeveen et al., 2015), which is based on a 2014 Stack Exchange dump. CQADupStack contains 12 forums that have enough duplicates for supervised training; as a consequence, it may not be representative of lowresource domains. We therefore supplement it with 12 low-resource forums from the 2018-12-20 Stack Exchange dump. 8 For our purposes, lowresource means a forum with 100-200 duplicates, which we consider sufficient for evaluation but insufficient for supervised training. All duplicates in the datasets were labeled by unpaid community members. As a result, false negatives (i.e., unflagged duplicates) are common in the gold standard   Data split. We split every forum into a test and training set, such that the test set contains all duplicates and the training set contains the remaining unlabeled questions. 9 The unlabeled training set is used for FastText training, BERT domainfinetuning, SIF weight estimation and GCCA. Test queries are never seen during training, not even in an unsupervised way. For hyperparameter choices, we hold out two high-resource and two low-resource forums (highlighted in Table 3). They are not used for the final evaluation and significance tests.
Preprocessing. Every question object consists of a title (average length 9 words), a body (average length 125 words), any number of answers or comments, and metadata (e.g., upvotes, view counts). We preprocess the data with the CQADupStack package. 10 To calculate question representations, we use the concatenation of question title and body. We always ignore answers, comments and metadata, as this information is not usually available at the time a question is posted. .

Evaluation and Metrics
Given a test query q, we rank all candidates c = q from the same forum by cos(f (q), f (c)), where f is an encoder (e.g., MV-DASE). Our metrics are MAP, AUC(.05), Normalized Discounted Cumulative Gain (NDCG), Recall@3 (R@3) and Pre-cision@3 (P@3). AUC(.05), the area under the ROC curve up to a false positive rate of .05, is used by Shah et al. (2018). Note that upper bounds on P@3 and R@3 are not 1, since most duplicates have only one original and a few have more than three.

Baselines
Unsupervised. Our IR baseline is BM25 (Robertson et al., 1995) as implemented in Elasticsearch 6.5.4 (Gormley and Tong, 2015) with default parameters. We test against all single-view encoders from our ensemble. The remaining unsupervised baselines are: • ELMo (Peters et al., 2018). 11 We treat ELMo like BERT (Section 3.2), i.e., we finetune 12 the language model on the in-domain corpus (3 epochs, batch size 8), SIF-weight all vectors according to in-domain word probability and then average over layers and tokens.
• Doc2vec (Le and Mikolov, 2014) trained on the in-domain corpus, using the best DQD hyperparameters reported in Lau and Baldwin (2016).

Ablation studies
We perform a set of experiments where we omit views from the ensemble. We also replace GCCA with naive view concatenation or view averaging. When averaging, we pad lower-dimensional vectors (Coates and Bollegala, 2018).

Significance tests
We perform paired t-tests, using the 20 test set forums as data points. 15 We then find groups of equivalent methods by transitive closure of a ∼ b ≡ p ≥ .05. Group A being ranked higher than group B means that every method in A performs significantly better than every method in B. Two methods in the same group may differ significantly, but there exists a chain between them of methods with insignificant differences.

Comparison with baselines
BM25. BM25 is a tough baseline for DQD: In terms of MAP, it is better than or comparable to 14 We also experiment with Multinomial Adversarial Networks (MAN) (Chen and Cardie, 2018), a multi-source multitarget framework that can be trained on all 24 forums jointly. Initial results were not competitive with ADA, so we do not include them here. See supplementary material for details. 15 Ten forums for t-tests involving ADA.  every single view (see Table 5). MV-DASE on the other hand, which is built from the same views, outperforms BM25 significantly and almost consistently (19 out of 20 test forums), regardless of the metric. This underlines the usefulness of our multi-view approach. Single views. MV-DASE outperforms the views that make up its ensemble significantly and almost consistently. There are two exceptions (out of 20 test forums): On law and outdoors, f U (USE) performs slightly better on its own (Table 4, row 15). Since these forums are less "technical" than most, we hypothesize that they may be less in need of domain adaptation.
Word-level CCA. The word-level CCA baseline by Sarma et al. (2018) outperforms f G and f D on their own (see Table 4, rows 10, 21), which validates the approach. The method is directly comparable to MV-DASE¬(f U , f B ), i.e., MV-DASE on generic and domain-specific averaged word embeddings (see Table 6). The main differences between them are (a) the order in which CCA and averaging are performed and (b) whether the CCA "vocabulary" is composed of word types or sentences. Note that in contrast to MV-DASE, word-level CCA is incompatible with contextualized embeddings, since it requires a contextindependent one-to-one mapping between word types and vectors.
ADA. Supervised domain-adversarial ADA performs significantly worse than unsupervised MV-DASE (see Table 5). It is comparable to BM25 in terms of AUC(.05) (the metric used by Shah et al. (2018)), but not in terms of MAP. Recall that we restricted the choice of source domains to the 12 CQADupStack forums. As a consequence, some target forums were paired with non-ideal source forums (e.g., en-glish→buddhism). It is possible that the baseline would have performed better with a wider choice of source domains. Nonetheless, this observation highlights a key advantage of our approach: It does not depend on the availability of a labelrich related source domain (or indeed, any labels at all). Other baselines. InferSent performs poorly on the DQD task, which is surprising given its similarity to USE. Recall that InferSent and USE are pre-trained on sentence-level SNLI, but that the training regime of USE also contains conversation response prediction. So USE is expected to be better equipped to handle (a) multi-sentence documents and (b) forum-style language.
Doc2vec is trained on the same data as f D , but performs significantly worse. The difference between them may be due to the ability of FastText to exploit orthography. Domain-finetuned ELMo performs comparably to domain-finetuned BERT on some forums but not consistently.

Ablation study
View ablation. On the low-resource forums, omitting f D has a beneficial effect (Table 6, row 10). This suggests that the in-domain FastText embeddings have insufficient quality when learned on the smallest forums and / or that domain-finetuned BERT subsumes any positive effect. On the highresource CQADupStack forums, domain-specific embeddings contribute positively, while generic GloVe does not (rows 1,2). Table 5 shows that omitting either f G or f D from the ensemble does not lead to a significant drop in MAP, but omitting both does.
USE has the biggest positive effect on MV-DASE (Table 6, rows 3,11), also evidenced by the fact that omitting it is significantly more harmful than omitting any other single view (Table 5). Recall from Section 3.2 that USE is trained on supervised transfer tasks, while the remaining encoders are fully unsupervised.
GCCA ablation. The naive concatenation or averaging of views is significantly less effective than view correlation by GCCA (Table 6, rows 7,8,15,16, and Table 5). This underlines that multi-view learning is not just about which views are combined, but also about how. Intuitively, GCCA discovers which features from the different encoders "mean the same thing" in the domain. By contrast, concatenation treats views as orthogonal, while averaging mixes them in an unstructured way.

Evaluation on SemEval-2017 3B
In this section, we evaluate MV-DASE on SemEval-2017 3B, a DQD shared task based on the QatarLiving CQA forum. The benchmark provides manually labeled question pairs for training as well as additional unlabeled in-domain data. Since MV-DASE is unsupervised, we discard all training labels and concatenate training and unlabeled data into a text corpus (≈ 1.5M tokens). This corpus is used for FastText training, BERT domain-finetuning, SIF weight estimation and GCCA, as described in Section 3.
The test set contains 88 queries q with ten candidates c 1 . . . c 10 each. We preprocess all data with the CQADupStack package and concatenate question subjects and bodies, before encoding them. We rank candidates by cos(f (q), f (c)) and evaluate the result with the official shared task scorer. 16 In keeping with the original leaderboard, we report MAP and MRR (Mean Reciprocal Rank). We compare against previous literature as well as all individual views, view concatenation and averaging. See Table 7 for results. Like we observed on the Stack Exchange data, MV-DASE outperforms its individual views, their concatenation and average. It beats the previous State of the Art (a supervised system) by a margin of 2.5% MAP.

Evaluation on unsupervised STS
While this paper focuses on Duplicate Question Detection, MV-DASE is also applicable to other unsupervised sentence-pair tasks. As proof of concept, we test it on the unsupervised STS Benchmark (Cer et al., 2017). Here, the task is to predict similarity scores y ∈ R for sentence pairs (s 1 , s 2 ).  We treat the benchmark training set as an unlabeled corpus, i.e., we discard all labels and destroy the original sentence pairings by shuffling. The resulting corpus is used for BERT domainfinetuning, SIF weight estimation and GCCA. At test time, we measure Pearson's r between cos(f (s 1 ), f (s 2 )) and y, where f is an encoder (e.g., MV-DASE) and y is the ground truth similarity of test set pair (s 1 , s 2 ).
In this experiment, the ensemble contains USE (f U ), domain-finetuned BERT (f B ) and f G . For f G , we either use SIF-weighted averaged GloVe vectors (Section 3.2), or unweighted averaged ParaNMT 17 word and trigram vectors (Wieting and Gimpel, 2018), which are the current State of the Art on the unsupervised STS Benchmark test set (Ethayarajh, 2018). The unlabeled training set is very small (64K tokens); hence, we do not include f D in the ensemble, and we finetune the BERT language model for 10K rather than 100K steps to avoid overfitting. Like on the DQD tasks, MV-DASE beats its individual views as well as naive view concatenation and averaging (see Table 8). After adding ParaNMT to the ensemble, we achieve competitive results.

Future Work
Non-Linear GCCA. In Section 3.1, we assumed that relationships between representations are linear. This is probably reasonable for word embeddings (most cross-lingual word embeddings are 17 github.com/jwieting/para-nmt-50m  linear projections, e.g. Artetxe et al. (2018)), but it is unclear whether it holds for sentence embeddings. Potential avenues for non-linear GCCA include Kernel GCCA (Tenenhaus et al., 2015) and Deep GCCA (Benton et al., 2017). More views. A major advantage of MV-DASE is that it is agnostic to the number and specifics of its views. We plan to investigate whether additional or different views (e.g., encoders learned on related domains) can increase performance.

Conclusion
We have presented a multi-view approach to unsupervised Duplicate Question Detection in lowresource, domain-specific Community Question Answering forums. MV-DASE is a multi-view sentence embedding framework based on Generalized Canonical Correlation Analysis. It combines domain-specific and generic weighted averaged word embeddings with domain-finetuned BERT and the Universal Sentence Encoder, using unlabeled in-domain data only. Experiments on the CQADupStack corpus and additional low-resource forums show significant improvements over BM25 and all single-view baselines. MV-DASE sets a new State of the Art on a recent DQD shared task (SemEval-2017 3B), with a 2.5% MAP improvement over the best supervised system. Finally, an experiment on the STS Benchmark suggests that MV-DASE has potential on other unsupervised sentence-pair tasks.